Team:Shenzhen BGIC 0101/Tutorial

From 2013.igem.org

(Difference between revisions)
Line 372: Line 372:
       <div id="tab3">
       <div id="tab3">
    
    
-
           <p>test</p>
+
           <p class="tit">3. SegmMan </p>
 +
<p>This module will cut chromosome into pieces with different sizes with Gibson, Goldengate, Homologous adaptors to them so that they are able to be assembled into whole experimentally.</p>
 +
          <p class="tit">Plugin Scripts</p>
 +
<br/>
 +
          <p class="tit">3-1. 01.whole2mega.pl</p>
 +
<p>This utility can split the whole chromosome ( at least 90kbp long ) into about 30k segments and add homologous overlap and adaptors, so that these fragments can be integrated into whole experimentally.</p>
 +
          <p class="tit">Internal operation</p>
 +
<p>First, this utility searches for the location of centromere and ARSs (autonomously replicating site). The minimal distance between centromere and ARS should NOT be larger than a defined megachunk which is about 30k long. <br/>
 +
Second, this utility cuts out the first 30k sequence window containing the centromere and its adjacent ARS, and then adds this megachunk with two original markers and left, right telomeres.<br/>
 +
Thirdly, this utility continues to cut more megachunks from the original one to both ends. But these megachunks are not independent, they all have about 1kbp overlaps. Moreover, these new splited window can be given only one marker alternately and only left or right telomere.
 +
The output file will be dealed with 02.globalREmarkup.pl
 +
For more information about segmentation design, please refer to the page ASSEMBLY DESIGN PRINCIPLE .</p>
 +
          <p class="tit">Example (command line)</p>
 +
<p>perl 01.whole2mega.pl –gff sce_chrI.gff -fa sce_chr01.fa -ol 1000 -ck 30000 -m1 LEU2 -m2 URA3 -m3 HIS3 -m4 TRP1 -ot sce_chrI.mega</p>
 +
          <p class="tit">Parameters</p>
 +
<pre>
 +
gff The gff file of the chromosome being restriction enzyme sites parsing
 +
 +
fa The fasta file of the chromosome being restriction enzyme sites parsing
 +
(The length of the chromosome is larger than 90k)
 +
 +
ol The length of overlap between megachunks
 +
1000bp
 +
ck The length of megachunks
 +
30kbp
 +
m1 The first marker for selection alternately
 +
LEU2 (1797bp) LEU2/URA3HIS3/TRP1
 +
m2 The second marker for selection alternately
 +
URA3 (1112bp) LEU2/URA3/HIS3/TRP1
 +
m3 The first marker orinally residing in first 30k segmentation
 +
HIS3 (1774bp) LEU2/URA3/HIS3/TRP1
 +
m4 The second marker orinally residing in first 30k segmentation
 +
TRP1 (1467bp) LEU2/URA3/HIS3/TRP1
 +
ot The output file Prefix(fa filename)+ suffix(.mega)
 +
</pre>
 +
          <p class="tit">The format of output:</p>
 +
<p>The output file is stored in /the path where you install GENOVO/Result/ 01.whole2mega.
 +
Besides, there is screen output about the process state and result.
 +
1. Screen output
 +
2. 01.state
 +
&nbsp;Store the segmentation information
 +
<pre>
 +
Megachunk_ID Corresponding location in the designed chromosome
 +
Part ID Location in the segmentation
 +
</pre>
 +
</p>
 +
          <p class="tit"></p>
 +
<p></p>
 +
          <p class="tit"></p>
 +
<p></p>
 +
          <p class="tit"></p>
 +
<p></p>
 +
          <p class="tit"></p>
 +
<p></p>
 +
          <p class="tit"></p>
 +
<p></p>
 +
          <p class="tit"></p>
 +
<p></p>
 +
 
       </div>
       </div>
       <div id="tab4">
       <div id="tab4">
Line 461: Line 519:
Here is the simple but straightforward tutorial to teach you how to understand a KGML file.<br/>  
Here is the simple but straightforward tutorial to teach you how to understand a KGML file.<br/>  
Take “sce04111.xml” as an example. We can simplify the data structure as:<br/>  
Take “sce04111.xml” as an example. We can simplify the data structure as:<br/>  
-
<img src="https://static.igem.org/mediawiki/2013/c/c5/T4-4.png" />
+
<img src="https://static.igem.org/mediawiki/2013/c/c5/T4-4.png" /><br/>
The entry element can be path/ko/ec/rn/cpd/gl/org/group, enzyme/protein/gene will have relation and gene will also have compound and reactions. We choose one relation to be example:<br/>  
The entry element can be path/ko/ec/rn/cpd/gl/org/group, enzyme/protein/gene will have relation and gene will also have compound and reactions. We choose one relation to be example:<br/>  
<img src="https://static.igem.org/mediawiki/2013/2/27/T4-5.png" />
<img src="https://static.igem.org/mediawiki/2013/2/27/T4-5.png" />

Revision as of 17:03, 27 September 2013

logo

Tutorial


1. NeoChr

NeoChr module would assist users to grab related genes in different pathways manually, to rewire genes’ relationship logically*, and to replace genes with ortholog that score higher*. Firstly, it would allow users to define gene order and orientation in DRAG&DROP way. Secondly, decoupled these genes if have overlap and make all genes are non-redundancy. Finally, add chromosome features to build a new chromosome and show in the JBrowse. Moreover, users can drag a window in the JBrowse and delete any gene in the window.
Note:
*These function are unavailable now, please wait for version 2.
**You can also add any thing here including your own water mark.

2. Plugin Scripts

This module contains three plugins: Decouple.pl, Add.pl and Delete.pl.

2.1 Decouple.pl

This plugin is to decouple the genes which have overlap gene regions. These overlapping genes can be decoupled if meet the following conditions: (1)If two genes have overlap gene regions, the latter gene 5’UTR does not cover the former gene initial codon (ATG); (2)Overlapping region initial coordinate is in the coding DNA sequences(CDS) of gene which is need to be decoupled; (3)The decouple site of CDS have synonymous substitute codon to replace; After decoupling, we use these non-redundancy genes to generate a GFF file and a FASTA file.

2.1.1 Internal operation

First, this plugin extracts base sequence from the genome file according to the gene order list, and records the gene order in the list. And then plugin records the annotation information according to the specie GFF file, moreover, plugin extends gene CDS upstream 600bp as 5’-UTR and downstream 100bp as 3’-UTR if the GFF file does not contain annotated these two features.
Second, this plugin detects the overlapping genes in the same chromosome. In case the overlapping genes are detected, it will judge whether the overlapping initial site is located in the CDS region, and identify the site is belong to phase0/1/2.
Third, the plugin attempts to synonymous substitute codon to break the initial codon intra the CDS. Printing information whether or not be decoupled successfully, such as:
data
And non-redundancy genes are generated.
Finally, the plugin links non-redundancy genes to construct a new chromosome according to the gene order.

2.2.1 Example

We have two input forms to execute the plugin:
1. Using string format as gene order list input form:
perl GeneDecouple.pl --species saccharomyces_cerevisiae_chr --list_format string --gene_order="YAL054C -,YAL038W +,YBR019C -,YBR145W +,YCL040W +,YCR012W +,YCR105W +,YDL168W +,YPL017C -,YIL177C -,YIL177W-A +,YIL172C -,YIL171W-A +,” --geneset_dir ../gene_set --upstream_extend 600 --downstream_extend 100 --neo_chr_gff neochr.gff --neo_chr_fa neochr.fa
2. Using file format as gene order list input form:
erl GeneDecouple.pl --species saccharomyces_cerevisiae_chr --list_format file --gene_order gene_ordre.list --geneset_dir ../gene_set --upstream_extend 600 --downstream_extend 100 --neo_chr_gff neochr.gff --neo_chr_fa neochr.fa

2.1.3 Parameters

Parameter Description
list_format set the input form of gene order list string string/file
gene_order set the input gene order list file(include pathway genes and addition genes)
Parameter Description Default Selectable range
geneset_dir set the species annotation directory 600
upstream_extend set the length of gene downstram(bp) 100
neo_chr_gff set the name of output neochr gff file
neo_chr_fa set the name of output neochr fasta file
help Show help information


2.4.1 The format of output file

The output files are standard GFF and FASTA format files which are decoupled.
  1. decoupled GFF file
data
  2.decoupled FASTA file
data

2.2 Add.pl

This plugin will add the LoxPsym sequence and the customized left and right telomeres, centromere and autonomously replicating sequence (ARS) into the FASTA file and GFF file which are generated by Decouple.pl.

2.2.1 Internal operation

The plugin adds LoxPsym behind the first 3bp of 3’-UTR in each gene and adds telomere, centromere and ARS according this mode:
left_telomere + gene1 + centromere + gene2 + ARS + gene3 + right_telomere
The distance between centromere and ARS is less than 30Kb.
Finally, user can see the new added features chromosome according to the JBrowse.

2.2.2 Example

perl 04.Add.pl --loxp loxPsym.feat --left_telomere UTC_left.feat --right_telomere UTC_right.feat --ars chromosome_I_ARS108.feature --centromere chromosome_I_centromere.feat --chr_gff neochr.gff --chr_seq neochr.fa --neochr_seq neochr.final.fa --neochr_gff neochr.final.gff

All the feature file format is 4 lines format, for example:
  name = site_specific_recombination_target_region
  type = loxPsym
  source = BIO
  sequence = ATAACTTCGTATAATGTACATTATACGAAGTTAT
Note: the first line is the detail name of feature, the second line is the type of feature, the third line is the source of feature and the last line is the sequence of feature.

2.2.3 Parameters

Parameter	Description	Default	Selectable range
loxp	set the sequence of loxp	ATAACTTCGTATAATGTATGCTATACGAAGTTAT	
left_telomere	set the sequence of left telomere		
right_telomere	set the sequence of right telomere		
chr_gff	set the input neorchr_gff file		
chr_seq	set the input neorchr_gff file		
neochr_seq	set the name of output added loxps and telomeres neochr_fa file		
neochr_gff	set the name of output added loxps and telomeres neochr_gff file		

2.2.4 The format of output

The output files are standard GFF and FASTA format of adding features chromosome.
1. added features GFF file
data

2.3 Delete.pl

This plugin can modify the GFF and FASTA file which are generated by Add.pl according to the user drags a window in the JBrowse and delete any gene in the window.

2.3.1 Internal operation

Firstly, user uses mouse to drag a window in the added features FASTA file which is showed in the JBrowse and JBrowse displays all the genes in this window.Secondly, user decides which genes is need to be delected from the new chromosome and plugin deletes genes from GFF file and modify FASTA in the same time.

2.3.2 Example

perl 05.delete.pl --delete="YAL054C,YAL038W" --neochr_gff neochr.refine.final.gff --neochr_fa neochr.refine.final.fa --slim_gff neochr.refine.delete.gff --slim_fa neochr.refine.delete.fa

2.3.3 Parameters

Parameter	Description	Default	Selectable range
delete	Set the to be deleted gene list		
neochr_gff	Set the input GFF file which is generated by Add.pl		
neochr_fa	Set the input FASTA file which is generated by Add.pl		
slim_gff	Set the output GFF file		
slim_fa	Set the output FASTA file		

2.3.4 The format of ouput

The output files are standard GFF and FASTA format of deleted genes chromosome.

test

3. SegmMan

This module will cut chromosome into pieces with different sizes with Gibson, Goldengate, Homologous adaptors to them so that they are able to be assembled into whole experimentally.

Plugin Scripts


3-1. 01.whole2mega.pl

This utility can split the whole chromosome ( at least 90kbp long ) into about 30k segments and add homologous overlap and adaptors, so that these fragments can be integrated into whole experimentally.

Internal operation

First, this utility searches for the location of centromere and ARSs (autonomously replicating site). The minimal distance between centromere and ARS should NOT be larger than a defined megachunk which is about 30k long.
Second, this utility cuts out the first 30k sequence window containing the centromere and its adjacent ARS, and then adds this megachunk with two original markers and left, right telomeres.
Thirdly, this utility continues to cut more megachunks from the original one to both ends. But these megachunks are not independent, they all have about 1kbp overlaps. Moreover, these new splited window can be given only one marker alternately and only left or right telomere. The output file will be dealed with 02.globalREmarkup.pl For more information about segmentation design, please refer to the page ASSEMBLY DESIGN PRINCIPLE .

Example (command line)

perl 01.whole2mega.pl –gff sce_chrI.gff -fa sce_chr01.fa -ol 1000 -ck 30000 -m1 LEU2 -m2 URA3 -m3 HIS3 -m4 TRP1 -ot sce_chrI.mega

Parameters

gff	The gff file of the chromosome being restriction enzyme sites parsing
		
fa	The fasta file of the chromosome being restriction enzyme sites parsing
(The length of the chromosome is larger than 90k)
		
ol	The length of overlap between megachunks
	1000bp	
ck	The length of megachunks
	30kbp	
m1	The first marker for selection alternately
	LEU2 (1797bp)	LEU2/URA3HIS3/TRP1
m2	The second marker for selection alternately
	URA3 (1112bp)	LEU2/URA3/HIS3/TRP1
m3	The first marker orinally residing in first 30k segmentation
	HIS3 (1774bp)	LEU2/URA3/HIS3/TRP1
m4	The second marker orinally residing in first 30k segmentation
	TRP1 (1467bp)	LEU2/URA3/HIS3/TRP1
ot	The output file 	Prefix(fa filename)+ suffix(.mega)	

The format of output:

The output file is stored in /the path where you install GENOVO/Result/ 01.whole2mega. Besides, there is screen output about the process state and result. 1. Screen output 2. 01.state  Store the segmentation information

Megachunk_ID	Corresponding location in the designed chromosome
Part ID	Location in the segmentation

Presentation from KGML

This module will grab genes’ details in different pathways, which from KEGG with KEGG Makeup Language (KGML) file and export genes list and relationship of genes. The goal here is to visualize the pathway and rebuild it in the level of genes.

Scripts
1. keggid_convert_gene.pl

This utility can convert KEGGID which in KGML file into genes’ name and rewrite KGML file.

Internal operation

First, this utility will change the pathway’s name into KEGG database names, and then open the file with the entire list of genes, push them in hash.
Second, this utility will read the original pathway’s xml file in and replace KEGGID with gene’s name one by one. Furthermore, it will change type element all into “gene”.
Thirdly, it will be the substitution of original pathway’s xml file.

Example

perl keggid_convert_gene.pl ko04010

The format of output:

It will rewrite the original pathway’s xml file, if we have following statement:
After running this scripts, it will turn into:

convert.py

This utility will read in KGML file which have been rewritten before, and grab genes’ information, such as genes’ name, genes’ relationship and then convert these into JSON.

Internal operation

First, this utility will use the parameter –f to determine the specified file and read it in. The output file will use the file name that put forward.
Second, this utility convert KGML file into JSON and grab the information of genes, such as the reactions of genes and the relationship between genes.
Third, this utility continue to integrate the information above into two files, ‘gene.json’ and ‘relation.json’, which can be use directly in rewrite gene’s pathway.

Example

python convert.py –f sce04111

Parameters:

-f/--file read KGML from FILENAME(omit '.xml'), produce two files: gene list and relation

The format of output:

The output file is stored in the path where you running this program.
1. _gene.json
type:

ortholog	KO (orthology group)
enzyme	Enzyme
reaction	Reaction
gene	gene product (mostly a protein)
group	a complex of gene products (mostly a protein complex)
geneID:the unique identification of gene
name	the KEGGID of this gene 
type	the type of this gene
reaction:
name	the KEGGID of this reaction
reversible	true: reversible reaction; false: irreversible reaction
substrates	KEGGID of substrate node
products	the KEGGID of product node
related-reactions	relate to another pathway or gene
2. _relation.json relations:
type	The type of this relation[ ECrel, PPrel. GErel, PCrel, maplink]
subtype	Interaction/relation information[activation/inhibition]
entry1	The first (from) entry that defines this relation
entry2	The second (to) entry that defines this relation
entry1&2:
entry ID	The KEGGID of node which takes part in this relation
type	Have only two options: [gene/group]
name	The KEGGID of this gene
group	The node is a complex of gene products (mostly a protein complex)


Shortcoming

We can’t automatic acquisition KGML file in the KEGG API, all the demo we have show need to be downloaded before. You can get the entire list of one database genes through KEGG API, just like http://rest.kegg.jp/list/ko , it shows the entire list of orthology genes. The download method about KGML files shows in “How to finish this plun-in” part. Some genes will relate to another pathway but it doesn’t shows in the pathway so we are failed to grab the relationships between genes and pathways automatically. So we added two gene-pathway relations in ko04010 demo manually, TP53 gene connected with ko04115: P53 signaling pathway and NLK gene connected with ko04310: WNT signaling pathway. We look forward to the improvement of this plun-in through these disadvantages.

How to finish this plun-in?

KEGG is a database resource for understanding high-level functions and utilities of the biological system, and KGML is an XML presentation of the KEGG pathway database, which enables automatic drawing of KEGG pathways and provides facilities for computational analysis and modeling of gene/protein networks and chemical networks. Here is the data structure of KGML.
It’s really complex and it will bother you to understand the KGML file! Do not worry, I will show you how to understand a KGML file and then how to convert it into JSON.
First, how to find or download KGML file?
Method:
Download KGML" link for each pathway map.
If you choose a pathway with prefix “map”, you can’t find the download link in the page, that’s because it can be generating almost ko/rn/ec/org files.
Such as “map04111”, it has no link for download. But if you change the “map” into “sce” in the URL, you can get the file.
KEGG API: Take “sce04111” as an example, you can download the KGML file via the this.
Second, the KGML file is difficult to find out the relationship through the data structure show above.
Here is the simple but straightforward tutorial to teach you how to understand a KGML file.
Take “sce04111.xml” as an example. We can simplify the data structure as:

The entry element can be path/ko/ec/rn/cpd/gl/org/group, enzyme/protein/gene will have relation and gene will also have compound and reactions. We choose one relation to be example:
It means that gene (YCL061C) have activation effect in gene (YPL153C).By the way, through the graphics elements, we can definitely rewrite the connection between these two genes.
Third, thought we know how to get KGML file and can understand it but the crucial problem is, how to grab the information of genes and convert it into JSON.
Here we use Python programming Language, install the library “lxml” for processing XML and HTML, and we import JSON library for convert.
Fourth, you may ask: If there have any other software to do such job, that is read KGML files, convert it and rewrite it?
Of course, and we indeed tried but failed for it’s not open source or the original source is difficult for modification. Actually, if you want to do some visualize pathway job, Cytoscape is a good choice and it also have cytoscape.js for drawing.
Finally, we indeed done plentiful preparatory work, maybe in the end it’s not useful in this software but it can expand our horizons.