Team:USTC-Software/Project/Method

From 2013.igem.org

(Difference between revisions)
Line 59: Line 59:
   <h1 align="justify">Methodologies</h1>
   <h1 align="justify">Methodologies</h1>
   <p align="justify">In order to simulate the GRN's working and analyze the changing after exogenous gene imported, some advanced algorithms and classical methods are employed in the software. These algorithms and methods include Binary Tree method, Needle-Wunsch Algorithm, Decision Tree method, Hill Equation and PSO Algorithm.</br><br/>
   <p align="justify">In order to simulate the GRN's working and analyze the changing after exogenous gene imported, some advanced algorithms and classical methods are employed in the software. These algorithms and methods include Binary Tree method, Needle-Wunsch Algorithm, Decision Tree method, Hill Equation and PSO Algorithm.</br><br/>
-
There are five parts of methodologies: Fetch Database, Alignment Analyze, New Network Construction, Network Model and Predict.
+
There are four parts of methodologies: Database, Operon Theory and Regulatory Model, Forward Analysis and Reverse Analysis.
   </p>
   </p>
        
        
Line 68: Line 68:
<div id="Fetch_Database">
<div id="Fetch_Database">
-
<h2>Fetch Database</h2>
+
<h2>Database</h2>
<div id="jobs_container">
<div id="jobs_container">
-
        <div class="jobs_trigger"><strong>Fetch Database Abstract</strong></div>
+
        <div class="jobs_trigger"><strong>Abstract</strong></div>
-
<div class="jobs_item" style="display: none;">
+
-
<div class="jobs_trigger"><strong>Fetch Database Abstract</strong></div>
+
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">To simulate and analyze a genetic regulatory network (GRN), we need to build an objects' array to store the complete information of each gene. It contains regulation relationships between genes, sequences of genes, sequences of promoters and so on. However, it's hard to find an appropriate database online containing all information we need in a simple file. RegulonDB has downloadable files about the regulation between transcription factors (TF) and genes. Files about genetic information, transcription unit information and promoter information can also be downloaded from the RegulonDB. All those files have been put into file “source data” in the root directory of our software. They contain all information the simulation needs and we use fetching module to achieve data extraction and integration. There are four steps: fetch regulation relationships, fetch gene information, fetch promoter information and integrate information above.
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">To simulate and analyze a genetic regulatory network (GRN), we need to build an objects' array to store the complete information of each gene. It contains regulation relationships between genes, sequences of genes, sequences of promoters and so on. However, it's hard to find an appropriate database online containing all information we need in a simple file. RegulonDB has downloadable files about the regulation between transcription factors (TF) and genes. Files about genetic information, transcription unit information and promoter information can also be downloaded from the RegulonDB. All those files have been put into file “source data” in the root directory of our software. They contain all information the simulation needs and we use fetching module to achieve data extraction and integration. There are four steps: fetch regulation relationships, fetch gene information, fetch promoter information and integrate information above.
-
</p>
 
-
                </div><p class="bodytext"></p><p align="justify">To simulate and analyze a genetic regulatory network (GRN), we need to build an objects' array to store the complete information of each gene. It contains regulation relationships between genes, sequences of genes, sequences of promoters and so on. However, it's hard to find an appropriate database online containing all information we need in a simple file. RegulonDB has downloadable files about the regulation between transcription factors (TF) and genes. Files about genetic information, transcription unit information and promoter information can also be downloaded from the RegulonDB. All those files have been put into file “source data” in the root directory of our software. They contain all information the simulation needs and we use fetching module to achieve data extraction and integration. There are four steps: fetch regulation relationships, fetch gene information, fetch promoter information and integrate information above.
 
</p>
</p>
                 </div>
                 </div>
Line 81: Line 77:
   <div id="jobs_container">
   <div id="jobs_container">
        <div class="jobs_trigger"><strong>Fetch Regulation</strong></div>
        <div class="jobs_trigger"><strong>Fetch Regulation</strong></div>
-
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">In GRN, there are two kinds of files: <a class="content" href="http://regulondb.ccg.unam.mx/menu/download/datasets/files/network_tf_tf.txt">TF to TF</a> and <a class="content" href="http://regulondb.ccg.unam.mx/menu/download/datasets/files/network_tf_gene.txt">TF to Gene</a>. Since the database about the regulation between TFs and Genes contains only one-way interaction, the matrix of GRN is a rectangle.</br></br>
+
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">In GRN, there are two kinds of files: <a class="content" href="http://regulondb.ccg.unam.mx/menu/download/datasets/files/network_tf_tf.txt"> TF to TF</a> and <a class="content" href="http://regulondb.ccg.unam.mx/menu/download/datasets/files/network_tf_gene.txt">TF to Gene</a>. Since the database about the regulation between TFs and Genes contains only one-way interaction, the matrix of GRN is a rectangle.</br></br>
First of all, read the regulation relationship of TFs. Our software filters the documentation of RegulonDB on the head of all files and then reads the name of regulate and regulated TF, which is also the name of its genes, one by one. In the same time, our software numerates the genes and stores their names into an objects' array of genetic data. </br></br>
First of all, read the regulation relationship of TFs. Our software filters the documentation of RegulonDB on the head of all files and then reads the name of regulate and regulated TF, which is also the name of its genes, one by one. In the same time, our software numerates the genes and stores their names into an objects' array of genetic data. </br></br>
&nbsp;&nbsp;The format of regulation database:</br>
&nbsp;&nbsp;The format of regulation database:</br>
&nbsp;&nbsp;&nbsp;&nbsp;TF_name &nbsp;&nbsp;&nbsp;TF_name &nbsp;&nbsp;&nbsp;+/-/+-</br></br>
&nbsp;&nbsp;&nbsp;&nbsp;TF_name &nbsp;&nbsp;&nbsp;TF_name &nbsp;&nbsp;&nbsp;+/-/+-</br></br>
-
 
+
<img src="https://static.igem.org/mediawiki/2013/6/69/USTC_Software_TT.jpg"/>
The regulation of TFs has been put into a square matrix whose row is the regulator and column is the one regulated by. To make our GRN as complete as possible, the regulation between TF and genes has joined into the matrix. The one-way interaction results that we must read the TF in order to fulfill the regulator before completing the TF to gene's regulation in the same way of TF to TF. </br></br>
The regulation of TFs has been put into a square matrix whose row is the regulator and column is the one regulated by. To make our GRN as complete as possible, the regulation between TF and genes has joined into the matrix. The one-way interaction results that we must read the TF in order to fulfill the regulator before completing the TF to gene's regulation in the same way of TF to TF. </br></br>
&nbsp;&nbsp;The format of regulation database:</br>
&nbsp;&nbsp;The format of regulation database:</br>
&nbsp;&nbsp;&nbsp;&nbsp;TF_name &nbsp;&nbsp;&nbsp;Gene_name &nbsp;&nbsp;&nbsp;+/-/+-</br></br>
&nbsp;&nbsp;&nbsp;&nbsp;TF_name &nbsp;&nbsp;&nbsp;Gene_name &nbsp;&nbsp;&nbsp;+/-/+-</br></br>
-
 
+
<img src="https://static.igem.org/mediawiki/2013/4/47/USTC_Software_TG.jpg"/>
At last, a regulatory matrix whose row represents regulate gene (TF) and whose column represents gene regulated by (TF+Gene) has been output into a file called “old_GRN” in root directory. The values in GRN matrix are regulations in which “1” means positive activation, “-1” means repression and “0” means no relationship. There have been some regulations both positive and negative identified regulations are determined by the experimental environment. As a result, our software picks out those uncertain genes and stores them into a file named “uncertain_database”.</br></br>
At last, a regulatory matrix whose row represents regulate gene (TF) and whose column represents gene regulated by (TF+Gene) has been output into a file called “old_GRN” in root directory. The values in GRN matrix are regulations in which “1” means positive activation, “-1” means repression and “0” means no relationship. There have been some regulations both positive and negative identified regulations are determined by the experimental environment. As a result, our software picks out those uncertain genes and stores them into a file named “uncertain_database”.</br></br>
&nbsp;&nbsp;The format of uncertain database:</br>
&nbsp;&nbsp;The format of uncertain database:</br>
Line 103: Line 99:
&nbsp;&nbsp;The format of Gene Info database:</br>
&nbsp;&nbsp;The format of Gene Info database:</br>
&nbsp;&nbsp;&nbsp;&nbsp;ID_assigned_by_RegulonDB &nbsp;&nbsp;&nbsp;Gene_name &nbsp;&nbsp;&nbsp;Left_end_position &nbsp;&nbsp;&nbsp;Right_end_position &nbsp;&nbsp;&nbsp;DNA_strand &nbsp;&nbsp;&nbsp;Product_type &nbsp;&nbsp;&nbsp;&nbsp;Product_name &nbsp;&nbsp;&nbsp;Start_codon_sequence&nbsp;&nbsp;&nbsp;  Stop_codon_sequence &nbsp;&nbsp;&nbsp;Gene_sequence</br></br>
&nbsp;&nbsp;&nbsp;&nbsp;ID_assigned_by_RegulonDB &nbsp;&nbsp;&nbsp;Gene_name &nbsp;&nbsp;&nbsp;Left_end_position &nbsp;&nbsp;&nbsp;Right_end_position &nbsp;&nbsp;&nbsp;DNA_strand &nbsp;&nbsp;&nbsp;Product_type &nbsp;&nbsp;&nbsp;&nbsp;Product_name &nbsp;&nbsp;&nbsp;Start_codon_sequence&nbsp;&nbsp;&nbsp;  Stop_codon_sequence &nbsp;&nbsp;&nbsp;Gene_sequence</br></br>
-
 
+
<img src="https://static.igem.org/mediawiki/2013/4/45/USTC_Software_GI.jpg"/>
The label of the map vector is gene name which will be picked out based on the names read in regulation matrix before. It is really fast using the binary tree method to find the specific genetic information and store them into a specific object. Those information includes gene ID, left position, right position, gene description and gene sequence. The gene ID is used to link to RegulonDB's gene details; The left position is used to find its specific transcription unit; The right position is used to figure out the base amount; The description of genes is used to distinguish the RNA and protein; The sequence is used to predict the regulation by alignment.
The label of the map vector is gene name which will be picked out based on the names read in regulation matrix before. It is really fast using the binary tree method to find the specific genetic information and store them into a specific object. Those information includes gene ID, left position, right position, gene description and gene sequence. The gene ID is used to link to RegulonDB's gene details; The left position is used to find its specific transcription unit; The right position is used to figure out the base amount; The description of genes is used to distinguish the RNA and protein; The sequence is used to predict the regulation by alignment.
Line 115: Line 111:
&nbsp;&nbsp;The format of TU info database:</br>
&nbsp;&nbsp;The format of TU info database:</br>
&nbsp;&nbsp;&nbsp;&nbsp;Operon_name &nbsp;&nbsp;&nbsp;Unit_name &nbsp;&nbsp;&nbsp;promoter_name &nbsp;&nbsp;&nbsp;Transcription_start_site ......</br></br>
&nbsp;&nbsp;&nbsp;&nbsp;Operon_name &nbsp;&nbsp;&nbsp;Unit_name &nbsp;&nbsp;&nbsp;promoter_name &nbsp;&nbsp;&nbsp;Transcription_start_site ......</br></br>
-
 
+
<img src="https://static.igem.org/mediawiki/2013/1/1e/USTC_Software_TI.jpg"/>
The principle of fetching information of promoters is same as fetching genes's. Our software stores the promoter information from the file named “promoter_info” in a “map” which could be used to pick out the promoter sequence by searching promoter name through binary tree method.</br></br>
The principle of fetching information of promoters is same as fetching genes's. Our software stores the promoter information from the file named “promoter_info” in a “map” which could be used to pick out the promoter sequence by searching promoter name through binary tree method.</br></br>
&nbsp;&nbsp;The format of Promoter Info database:</br>
&nbsp;&nbsp;The format of Promoter Info database:</br>
&nbsp;&nbsp;&nbsp;&nbsp;Promoter_ID_assigned_by_RegulonDB &nbsp;&nbsp;&nbsp;Promoter_name</br></br>
&nbsp;&nbsp;&nbsp;&nbsp;Promoter_ID_assigned_by_RegulonDB &nbsp;&nbsp;&nbsp;Promoter_name</br></br>
-
 
+
<img src="https://static.igem.org/mediawiki/2013/8/8a/USTC_Software_PI.jpg"/>
The sequence of promoter will be used in the alignment method in next module which could make a prediction of exogenous genes' regulation pattern.
The sequence of promoter will be used in the alignment method in next module which could make a prediction of exogenous genes' regulation pattern.
</p>          </div>   
</p>          </div>   
Line 219: Line 215:
        <div class="jobs_item" style="display: none;"><p align="justify">The GRN matrix is the mathematical description of gene regulatory network in which “1” represents “enhance”, “-1” represents “repress” and “0” represents “no regulatory relationship”. The units(RU) in x-axis regulate the units in y-axis. A row can be seen as a vector containing all the information of the target(corresponding unit in the y-axis). Similarly, a column can be seen as a vector containing all the information of the regulator(corresponding unit in the x-axis).</p>           
        <div class="jobs_item" style="display: none;"><p align="justify">The GRN matrix is the mathematical description of gene regulatory network in which “1” represents “enhance”, “-1” represents “repress” and “0” represents “no regulatory relationship”. The units(RU) in x-axis regulate the units in y-axis. A row can be seen as a vector containing all the information of the target(corresponding unit in the y-axis). Similarly, a column can be seen as a vector containing all the information of the regulator(corresponding unit in the x-axis).</p>           
                 </div>
                 </div>
-
        <div class="jobs_item" style="display: none;"><p align="justify">The sequence similarity is obtained by sequence alignment based on Needleman-Wunsch algorithm[FIXME: wiki link here]. The Needleman-Wunsch algorithm performs a global alignment on two protein sequences or nucleotide sequences. It was the first application of dynamic programming to biological sequence comparison.
+
        <div class="jobs_item" style="display: none;"><p align="justify">The sequence similarity is obtained by sequence alignment based on Needleman-Wunsch algorithm[FIXME: wiki link here]. The Needleman-Wunsch algorithm performs a global alignment on two protein sequences or nucleotide sequences. It was the first application of dynamic programming to biological sequence comparison.<br /><br />
-
When dynamic programming is applicable, the method takes far less time than naive methods. Using a naive method, many of the subproblems are generated and sovled many times. The dynamic programming approach seeks to solve each subproblem only once. Once the solution to a given subproblem has been computed, it is stored to be looked up next time.
+
When dynamic programming is applicable, the method takes far less time than naive methods. Using a naive method, many of the subproblems are generated and sovled many times. The dynamic programming approach seeks to solve each subproblem only once. Once the solution to a given subproblem has been computed, it is stored to be looked up next time.<br /><br />
-
[Pic. 5 Dynamic programming and naive method]
+
Like the Needleman-Wunsch algorithm, of which it is a variation, Smith-Waterman is also a dynamic programming algorithm. But it is a local sequence alignment algorithm. The famous BLAST(Basic Local Alignment Search Tool) is improved from Smith-Waterman algorithm. Although local algorithm has the desirable property that it is guaranteed to find the optimal local alignment, we decided to choose the global one because we regarded the segment sequence as a unit.<br /><br />
-
Like the Needleman-Wunsch algorithm, of which it is a variation, Smith-Waterman is also a dynamic programming algorithm. But it is a local sequence alignment algorithm. The famous BLAST(Basic Local Alignment Search Tool) is improved from Smith-Waterman algorithm. Although local algorithm has the desirable property that it is guaranteed to find the optimal local alignment, we decided to choose the global one because we regarded the segment sequence as a unit.
+
Sequences are aligned with different detailed methods in different situations. In the regulated side, what we care about is the DNA sequence. In the regulating side, it is the amino acid sequence. When it comes to predict the regulated behavior, we use a DNA substitution matrix to align promoter and protein coding sequences. In the prediction of regulating behavior, the substitution matrix BLOSUM_50 is used to align the amino acid sequences translated from protein coding sequences.<br /><br />
-
 
+
-
Sequences are aligned with different detailed methods in different situations. In the regulated side, what we care about is the DNA sequence. In the regulating side, it is the amino acid sequence. When it comes to predict the regulated behavior, we use a DNA substitution matrix to align promoter and protein coding sequences. In the prediction of regulating behavior, the substitution matrix BLOSUM_50 is used to align the amino acid sequences translated from protein coding sequences.
+
The promoter similarities of the query unit and subject units are stored in a vector. The protein coding similarities are stored in another vector. These vectors are prepared to be used in the new network construction.
The promoter similarities of the query unit and subject units are stored in a vector. The protein coding similarities are stored in another vector. These vectors are prepared to be used in the new network construction.
Line 233: Line 227:
         </div>     
         </div>     
-
     </div><!--jobs container-->
+
     </div>
-
<div class="jobs_trigger"><strong>Needleman-Wunsch Algorithm</strong></div>
 
-
<div class="jobs_item" style="display: none;"><p align="justify">The Needleman-Wunsch algorithm was first published in1970 by Saul B. Needleman and Christian D. Wunsch. It performs a global alignment of two sequences and is mostly used in bioinformatics to align protein or nucleotide sequence. Our software applied this algorithm in the alignment of DNA and amino acid sequences.<br/><br/>
 
-
The Needleman-Wunsch algorithm is one kind of dynamic programming and It was the first attempt in biological sequence comparison of dynamic programming.<br/><br/>
+
<h2>Forward Analysis</h2>
 +
<div class="jobs_trigger"><strong>Construct New GRN</strong></div>
 +
  <div class="jobs_item" style="display: none;">
 +
    <h3>User Input</h3>
 +
    <p align="justify">
 +
      Some genes' regulation could be get from experiment. So, if users could get the unknow regulation between new gene and old ones, they could manually set the interactions which do not need model. Those regulations will be used in later simulation.
 +
    </p>
 +
    <h3>Simalarity Analysis</h3>
 +
    <p align="justify"><b>1.Sequence</b></br>
 +
      <h4>Needleman-Wunsch Algorithm</h4>
 +
      The Needleman-Wunsch algorithm was first published in1970 by Saul B. Needleman and Christian D. Wunsch. It performs a global alignment of two sequences and is mostly used in bioinformatics to align protein or nucleotide sequence. Our software applied this algorithm in the alignment of DNA and amino acid sequences.<br/><br/>
-
Here is an example of Needleman-Wunsch algorithm. S(a,b) is the similarity of character a and character b. The scores of characters are shown in the similarity matrix. We assume this matrix was
+
      The Needleman-Wunsch algorithm is one kind of dynamic programming and It was the first attempt in biological sequence comparison of dynamic programming.<br/><br/>
-
</p>
+
 
 +
      Here is an example of Needleman-Wunsch algorithm. S(a,b) is the similarity of character a and character b. The scores of characters are shown in the similarity matrix. We assume this matrix was
 +
    </p>
       <div align="center"><img src="https://static.igem.org/mediawiki/2013/5/52/USTC_Software_DNA_S_M.png"/></div>           
       <div align="center"><img src="https://static.igem.org/mediawiki/2013/5/52/USTC_Software_DNA_S_M.png"/></div>           
       <p>And we uses linear gap penalty, denoted by d, here, we set the gap penalty as -5.Then the alignment:</p>
       <p>And we uses linear gap penalty, denoted by d, here, we set the gap penalty as -5.Then the alignment:</p>
Line 249: Line 253:
       </em></strong></p>
       </em></strong></p>
-
<p>would have the following score:</p>
+
      <p>would have the following score:</p>
-
<p align="center"><strong><em>
+
      <p align="center"><strong><em>
-
  S(A,C)+S(G,C)+S(A,A)+(3)+S(G,G)+S(T,A)+S(T,C)+S(A,G)+S(C,T) = -3+7+10-(3x5)+7+(-4)+0+(-1)+0 = 1
+
      S(A,C)+S(G,C)+S(A,A)+(3)+S(G,G)+S(T,A)+S(T,C)+S(A,G)+S(C,T) = -3+7+10-(3x5)+7+(-4)+0+(-1)+0 = 1
-
    </em></strong></p>
+
      </em></strong></p>
-
    <p align="justify">To find the highest score of alignment, in this algorithm, a two dimensional matrix F with sequences and scores was allocated. The score in row i, column j is denoted by Fij. There is one column for each character in sequence A and one row for each character in sequence B. Therefore, if we align sequences with sizes of n and m, the amount of memory taken up here is O(n,m).<br/><br/>
+
      <p align="justify">To find the highest score of alignment, in this algorithm, a two dimensional matrix F with sequences and scores was allocated. The score in row i, column j is denoted by Fij. There is one column for each character in sequence A and one row for each character in sequence B. Therefore, if we align sequences with sizes of n and m, the amount of memory taken up here is O(n,m).<br/><br/>
-
As the algorithm going on, Fij was calculated to be the optimal score by the principle as following:<br/>
+
      As the algorithm going on, Fij was calculated to be the optimal score by the principle as following:<br/>
-
Basis:
+
      Basis:
-
</p>
+
      </p>
-
    <p align="center"><strong><em>Fi0 = d*i<br/>F0j = d*j</em></strong></p>
+
      <p align="center"><strong><em>Fi0 = d*i<br/>F0j = d*j</em></strong></p>
-
    <p>Recursion:</p>
+
      <p>Recursion:</p>
-
    <p align="center"><strong><em>Fij = max(F(i-1,j-1) + S(Ai,Bj), F(i-1,j) + d, F(i,j-1) + d)</em></strong></p><br/>
+
      <p align="center"><strong><em>Fij = max(F(i-1,j-1) + S(Ai,Bj), F(i-1,j) + d, F(i,j-1) + d)</em></strong></p><br/>
-
    <p>The pseudo-code of this algorithm would look like this:</p>
+
      <p>The pseudo-code of this algorithm would look like this:</p>
-
<br/>
+
      <br/>
-
<div id="pseudo"><p>
+
      <div id="pseudo"><p>
-
<strong> for</strong> i = 0 <strong>to length(A)</strong><br/>
+
      <strong> for</strong> i = 0 <strong>to length(A)</strong><br/>
-
  &nbsp;  F(i,0) <-- d*i<br/>
+
      &nbsp;  F(i,0) <-- d*i<br/>
-
<strong> for</strong> j = 0 <strong>to length(B)</strong><br/>
+
      <strong> for</strong> j = 0 <strong>to length(B)</strong><br/>
-
  &nbsp;  F(0,j) <-- d*j<br/>
+
      &nbsp;  F(0,j) <-- d*j<br/>
-
        <strong>for</strong> i = 0 <strong>to length(A)</strong><br/>
+
      <strong>for</strong> i = 0 <strong>to length(A)</strong><br/>
-
  &nbsp;  <strong>for</strong> j = 0 <strong>to length(B)</strong><br/>
+
      &nbsp;  <strong>for</strong> j = 0 <strong>to length(B)</strong><br/>
-
  &nbsp; {<br/>
+
      &nbsp; {<br/>
-
  &nbsp; &nbsp;  Match  <-- F(i-1,j-1) + S(Ai,Bj)<br/>
+
      &nbsp; &nbsp;  Match  <-- F(i-1,j-1) + S(Ai,Bj)<br/>
-
  &nbsp; &nbsp;  Delete <-- F(i-1,j) + d<br/>
+
      &nbsp; &nbsp;  Delete <-- F(i-1,j) + d<br/>
-
  &nbsp; &nbsp;  Insert <-- F(i,j-1) + d<br/>
+
      &nbsp; &nbsp;  Insert <-- F(i,j-1) + d<br/>
-
  &nbsp; &nbsp;  F(i,j) <-- <strong>max</strong>(Match, Insert, Delete)<br/>
+
      &nbsp; &nbsp;  F(i,j) <-- <strong>max</strong>(Match, Insert, Delete)<br/>
-
  &nbsp; }
+
      &nbsp; }
-
</p>
+
      </p>
-
</div>
+
      </div>
-
    <p align="justify">After the matrix F was computed, Fnm would be the maximum score among all possible alignment.<br/><br/>
+
      <p align="justify">After the matrix F was computed, Fnm would be the maximum score among all possible alignment.<br/><br/>
-
If you want to see the optimal alignment, you can trace back from Fnm by comparing three possible sources mentioned in the above code (Match, Insert and Delete). If Match, then Aj and Bi are aligned, if Insert, Bi was aligned with a gap and if Delete, then Aj and a gap are aligned. Also, you may find there are not only one optimal alignment.<br/><br/>
+
      If you want to see the optimal alignment, you can trace back from Fnm by comparing three possible sources mentioned in the above code (Match, Insert and Delete). If Match, then Aj and Bi are aligned, if Insert, Bi was aligned with a gap and if Delete, then Aj and a gap are aligned. Also, you may find there are not only one optimal alignment.<br/><br/>
-
As for the example, we would get the following matrix by applying Needleman Wunsch algorithm:</p>
+
      As for the example, we would get the following matrix by applying Needleman Wunsch algorithm:</p>
-
<div align="center"><img src="https://static.igem.org/mediawiki/2013/e/e2/USTC_Software_arrow_game.png"/></div>
+
      <div align="center"><img src="https://static.igem.org/mediawiki/2013/e/e2/USTC_Software_arrow_game.png"/></div>
-
<p>And the optimal alignment would be:</p>
+
      <p>And the optimal alignment would be:</p>
-
<p align="center"><strong><em>- - AGACTAGTTAC <br/>
+
      <p align="center"><strong><em>- - AGACTAGTTAC <br/>
                               CGAGAC - - GT - - -
                               CGAGAC - - GT - - -
-
</em></strong></p>
+
      </em></strong></p>
 +
      <h4>A Supplementary Game</h4>
 +
      <p align="justify">The rows and columns in the GRN matrix can be regarded as vectors containing the regulated or the regulating information. The behavior similarity of two units can be described by the dot product of two regulated vectors or two regulating vectors. Biologists usually think the more similar two sequences are, the more likely they have similar behaviors. Whether the ratio of genes with similar behaviors is positively correlated with gene similarity is essential to our project. So we obtained 1.6 million sets of data by pairwise alignment of all the 1748 units in the GRN of K-12. Each set of data consists of gene similarity and behavior similarity. The result is analyzed and plotted in the figure. The linear fit shows that the ratio is positively correlated with the similarity.</p><br/>
-
 
+
       <div align="center"><img style="width:600px; height:auto;"src="https://static.igem.org/mediawiki/2013/d/d0/USTC_Software_Simi-Ratio.png" />
-
       </div>
+
      <p><strong>Figure 4.</strong>Linear fit of ratio-similarity relationship.</p></div>
-
 
+
-
 
+
-
<div class="jobs_trigger"><strong>A Supplementary Game</strong></div>
+
-
<div class="jobs_item" style="display: none;"><p align="justify">The rows and columns in the GRN matrix can be regarded as vectors containing the regulated or the regulating information. The behavior similarity of two units can be described by the dot product of two regulated vectors or two regulating vectors. Biologists usually think the more similar two sequences are, the more likely they have similar behaviors. Whether the ratio of genes with similar behaviors is positively correlated with gene similarity is essential to our project. So we obtained 1.6 million sets of data by pairwise alignment of all the 1748 units in the GRN of K-12. Each set of data consists of gene similarity and behavior similarity. The result is analyzed and plotted in the figure. The linear fit shows that the ratio is positively correlated with the similarity.</p><br/>
+
-
 
+
-
                <div align="center"><img style="width:600px; height:auto;"src="https://static.igem.org/mediawiki/2013/d/d0/USTC_Software_Simi-Ratio.png
+
-
" />
+
-
                <p><strong>Figure 4.</strong>Linear fit of ratio-similarity relationship.</p></div>
+
       <p align="justify">Although there are examples that a slight change in DNA sequence will significantly change the property of the gene, for example, sickle-cell disease, the influence is usually determined by the location and scale of the mutation. So the result is still convincing to some degree.</p>
       <p align="justify">Although there are examples that a slight change in DNA sequence will significantly change the property of the gene, for example, sickle-cell disease, the influence is usually determined by the location and scale of the mutation. So the result is still convincing to some degree.</p>
-
      </div>
+
  <p>
-
 
+
    <b>2.Filtering</b></p>
-
 
+
    <h4>Random Noise</h4>
-
 
+
    <p class="bodytext"></p><p align="justify">Normally, the similarity of two sequences will not be zero. Some computational
-
 
+
-
</div>
+
-
 
+
-
<div id="New_Network_Construction">
+
-
 
+
-
<h2>New Network Construction</h2>
+
-
 
+
-
 
+
-
  <div id="jobs_container">
+
-
        <div class="jobs_trigger"><strong>Random Noise</strong></div>
+
-
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">Normally, the similarity of two sequences will not be zero. Some computational
+
experiments were carried out to study the random sequence similarities. We randomly
experiments were carried out to study the random sequence similarities. We randomly
chose a gene in the network and generated 1000 random sequences. The alignment result
chose a gene in the network and generated 1000 random sequences. The alignment result
Line 326: Line 313:
<img src="https://static.igem.org/mediawiki/igem.org/8/89/USTC_Software_Figure_4.png" />
<img src="https://static.igem.org/mediawiki/igem.org/8/89/USTC_Software_Figure_4.png" />
<p><strong>Figure 5.</strong> Random similarity distribution</p></div>
<p><strong>Figure 5.</strong> Random similarity distribution</p></div>
-
 
+
    <h4>Filter</h4>
-
 
+
    <p align="justify">We need the genes highly similar to the exogenous one to interact with it. The program will
-
                </div>
+
-
+
-
<div class="jobs_trigger"><strong>Filter</strong></div>
+
-
<div class="jobs_item" style="display: none;"><p align="justify">We need the genes highly similar to the exogenous one to interact with it. The program will
+
align the exogenous gene(query) with genes in the network(subject) and get the original
align the exogenous gene(query) with genes in the network(subject) and get the original
similarities. In order to filter meaningless low values, a certain amount of random
similarities. In order to filter meaningless low values, a certain amount of random
Line 344: Line 327:
statistical significance.<br /><br />
statistical significance.<br /><br />
An example about filtring and consistency is presented in “Example”.
An example about filtring and consistency is presented in “Example”.
-
</p>        
+
</p>
-
                </div>  
+
    <p><b>3.Regulation Calculation</b></p>
-
               
+
    <p align="justify">If there is a three-unit network and they interact with each other as it is shown in the figure.
-
               
+
-
               
+
-
<div class="jobs_trigger"> <strong>Construct new GRN</strong></div>
+
-
<div class="jobs_item" style="display: block;"><p align="justify">If there is a three-unit network and they interact with each other as it is shown in the figure.
+
The regulation is described by the GRN matrix.</p>
The regulation is described by the GRN matrix.</p>
<div align="center"><img src="https://static.igem.org/mediawiki/igem.org/8/8a/USTC_Software_Figure_5.png" />
<div align="center"><img src="https://static.igem.org/mediawiki/igem.org/8/8a/USTC_Software_Figure_5.png" />
Line 383: Line 362:
<div align="center">
<div align="center">
<img src="https://static.igem.org/mediawiki/igem.org/c/c5/USTC_Software_Figure_7.png" />
<img src="https://static.igem.org/mediawiki/igem.org/c/c5/USTC_Software_Figure_7.png" />
-
<p><strong>Figure 8.</strong> Construct New GRN</p></div>
 
-
 
-
      </div>
 
-
  </div><!--jobs container-->
 
</div>
</div>
-
 
+
<p><strong>Figure 8.</strong> Construct New GRN</p>
-
 
+
    <h3>Clustering</h3>
-
<div id="Network_Model">
+
    <p>
-
<h2>Network Model</h2>
+
      Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.</br></br>
-
 
+
      For get a better regulation, we use online database DAVID to cluster all the genes in our whole GRN. Avoid of supersoftless, we hope to create an online communication with DAVID. After getting the cluster of our genes, we multiply the genes simalarity with a factor if they are in the same cluster.</br></br>
-
<div id="jobs_container">
+
      Though the source code of this part has already done, we lack the experiment information to set a propriate factor. All source code were pushed up to our github.
-
 
+
    </p>
-
               
+
  </div>
-
+
<div class="jobs_trigger"><strong>Network Model</strong></div>
-
<div class="jobs_trigger"><strong>Network Model Abstract</strong></div>
+
  <div class="jobs_item" style="display: none;">
-
<div class="jobs_item" style="display: block;"><p align="justify">Network analysis includes finding stable condition of network, adding new gene, finding new stable condition and changes from original condition to new condition. We use densities of materials to describe network condition. If all material densities are time-invariant, we can say the network condition is stable.</p>
+
<p align="justify">Network analysis includes finding stable condition of network, adding new gene, finding new stable condition and changes from original condition to new condition. We use densities of materials to describe network condition. If all material densities are time-invariant, we can say the network condition is stable.</p>
-
      </div>
+
<p class="bodytext"></p><p align="justify">Regulation relationship in genetic network includes positive regulation, negative regulation, positive-or-negative regulation and no regulation. We store regulation relationship in matrix R. Rji means the unit in line j and row i. For the material of original network, Rji=1 means material i enhance material j, Rji=-1 means material i repress material j, Rji=0 means material i has no influence on material j, Rji=2 means material i enhance or repress material j. For the new material, Rji ranges from -1 to 1. Rji<0 means the possibility of positive regulation is Rji; Rji>0 means the possibility of negative regulation is –Rji; Rji=0 means there is no regulation from i to j.
-
 
+
-
 
+
-
        <div class="jobs_trigger"><strong>Hill Equations</strong></div>
+
-
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">Regulation relationship in genetic network includes positive regulation, negative regulation, positive-or-negative regulation and no regulation. We store regulation relationship in matrix R. Rji means the unit in line j and row i. For the material of original network, Rji=1 means material i enhance material j, Rji=-1 means material i repress material j, Rji=0 means material i has no influence on material j, Rji=2 means material i enhance or repress material j. For the new material, Rji ranges from -1 to 1. Rji<0 means the possibility of positive regulation is Rji; Rji>0 means the possibility of negative regulation is –Rji; Rji=0 means there is no regulation from i to j.
+
We use Hill equations to describe intensity of regulation. Equations are like following:
We use Hill equations to describe intensity of regulation. Equations are like following:
<br/></br>
<br/></br>
<img src="https://static.igem.org/mediawiki/2013/e/e0/USTC_Software_1.png" style="width:600px;"/>
<img src="https://static.igem.org/mediawiki/2013/e/e0/USTC_Software_1.png" style="width:600px;"/>
<br/></br>
<br/></br>
-
The left side of the equation is the derivative x(density) on t(time).”qi”,”pi”,”ri”,”mi”,”ni” are parameters, which determine the intensity of regulation."ri" is degradation rate. Mji is exponent. M is a matrix whose dimensions are equivalent to R's. Mji is 0 or ranges from 0.5 to 1.2 or ranges from -1.2 to 0.5. For the material of original network, if Rji=1,Mji ranges from 0.5 to 1.2;if Rji=-1, Mji ranges from -1.2 to -0.5; if Rji=2;Mji ranges from -1.2 to 0.5 or 0.5 to 1. These Mjis' absolute values are given randomly by program. If Rji=0, Mji=0.
+
The left side of the equation is the derivative x(density) on t(time).”qi”,”pi”,”ri”,”mi”,”ni” are parameters, which determine the intensity of regulation."ri" is degradation rate. Mji is exponent. M is a matrix whose dimensions are equivalent to R's. Mji is 0 or ranges from 0.5 to 1.2 or ranges from -1.2 to 0.5. For the material of original network, if Rji=1,Mji ranges from 0.5 to 1.2;if Rji=-1, Mji ranges from -1.2 to -0.5; if Rji=2;Mji ranges from -1.2 to 0.5 or 0.5 to 1. These Mjis' absolute values are given randomly by program. If Rji=0, Mji=0.  
</br>For the new material,
</br>For the new material,
<br/></br>
<br/></br>
Line 415: Line 386:
</p>
</p>
-
                </div>
+
<p align="justify">
-
+
-
<div class="jobs_trigger"><strong>Find Stable Network Condition</strong></div>
+
-
<div class="jobs_item" style="display: none;"><p align="justify">
+
Stable condition is the condition in which densities are time-invariant. We store material densities in a vector and solve the differential equations with Euler's formula, which is like below
Stable condition is the condition in which densities are time-invariant. We store material densities in a vector and solve the differential equations with Euler's formula, which is like below
<br/></br>
<br/></br>
Line 426: Line 394:
</p>
</p>
-
                </div>
+
 
-
+
 
-
               
+
 
-
               
+
  </div>
-
            <div class="jobs_trigger"> <strong>Find Changes From Original Stable Condition to New Condition</strong></div>
+
<div class="jobs_trigger"><strong>Evaluate Network</strong></div>
-
        <div class="jobs_item" style="display: none;"><p align="justify">Record the original stable condition, set new material density to 0 and this is the new initial density vector. Solve new equations and record density vectors before the new condition is stable and store these data in a text file.</br></br>
+
  <div class="jobs_item" style="display: none;">
 +
<p align="justify">Record the original stable condition, set new material density to 0 and this is the new initial density vector. Solve new equations and record density vectors before the new condition is stable and store these data in a text file.</br></br>
To evaluate the new network, we introduce the grading system.
To evaluate the new network, we introduce the grading system.
Line 442: Line 411:
We did a lot of running and found that the “AbsValue” ranges from 0 to 370, so "ScoreA" ranges from 0 to 4.9.We get the integer part and store it in an array, which has five sections. Generate 100 or 200 matrix M from matrix R and run the original and new network for each M, so we can get 100 or 200 of "ScoreA"s. The section which has maximum "ScoreA"s is the eventual score.
We did a lot of running and found that the “AbsValue” ranges from 0 to 370, so "ScoreA" ranges from 0 to 4.9.We get the integer part and store it in an array, which has five sections. Generate 100 or 200 matrix M from matrix R and run the original and new network for each M, so we can get 100 or 200 of "ScoreA"s. The section which has maximum "ScoreA"s is the eventual score.
</p>           
</p>           
-
            </div>  
+
  </div>
-
 
-
 
-
 
-
    </div><!--jobs container-->
 
 +
<h2>Reverse Analysis</h2>
 +
<div class="jobs_trigger"><strong>Virtual Gene</strong></div>
 +
  <div class="jobs_item" style="display: none;">
 +
<p align="justify">Before reverse analysis, we use the same idea about constructing a new GRN. So we create a virtual gene which replace the gene what users want to get. In calculation, it means that we add a row and a column to the matrix of GRN.</p>
-
</div>
 
-
<div id="Predict">
+
  </div>
-
<h2>Predict</h2>
+
<div class="jobs_trigger"><strong>Expression Range</strong></div>
 +
  <div class="jobs_item" style="display: none;">
 +
<p align="justify">Before prediction, the expression of specific genes which the experimenter needs should be input into our software as well as the improvement or depression. The number of target gene is SIX at most.</br></br>
 +
It is a must that figuring out the strongest and weakest expression strength before inputting the extreme cases into the target expression. The way to find out the strongest and weakest expression is modeling the GRN's steady state by a large amount of random regulation from -1 and 1. We ran it for 1000 times to get the range of gene expression. On the other hand, the expression of genes unpicked by the users should be stable as much as possible. The initial strength of expression is calculated by modeling the original GRN with Hill's equation.
 +
</p>
-
<div id="jobs_container">
+
  </div>
-
 
+
<div class="jobs_trigger"><strong>Particle Swarm Optimaztion</strong></div>
-
               
+
  <div class="jobs_item" style="display: none;">
-
+
<p align="justify">
-
<div class="jobs_trigger"><strong>Predict Abstract</strong></div>
+
-
<div class="jobs_item" style="display: block;"><p align="justify">In some cases, importing exogenous gene is for enhancing or suppressing the expression of some specific genes in engineered bacteria itself. But it is hard to choose an appropriate regulatory gene. Our software analyzes the GRN forward as well as simulates by optimization algorithm backward for giving a reference of choosing to the users. Our software not only focused on the direct regulation but also focused on the global GRN. In the same time, controlling the expression of multiple genes in network has been realized by global prediction. What's more, Particle Swarm Optimization (PSO) Algorithm makes it possible.</p>
+
-
      </div>
+
-
 
+
-
 
+
-
        <div class="jobs_trigger"><strong>Input Target</strong></div>
+
-
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">Before prediction, the expression of specific genes which the experimenter needs should be input into our software as well as the improvement or depression. The number of target gene is SEVEN at most.</br></br>
+
-
It is a must that figuring out the strongest and weakest expression strength before inputting the extreme cases into the target expression. The way to find out the strongest and weakest expression is modeling the GRN's steady state by a large amount of random regulation from -1 and 1. On the other hand, the expression of genes unpicked by the users should be stable as much as possible. The initial strength of expression is calculated by modeling the original GRN with Hill's equation.
+
-
</p>
+
-
                </div>
+
-
+
-
<div class="jobs_trigger"><strong>Particle Swarm Optimization</strong></div>
+
-
<div class="jobs_item" style="display: none;"><p align="justify">
+
For getting the best regulation, our software uses PSO algorithm based on 30 particles to simulate the GRN's changing. First of all, the interactions of regulator and regulated-by have been put into those particles in random so that each particle will have the whole set of regulation. Secondly, the variance between target expressions and stable expression of new GRN have been regarded as the optimize requirements in PSO algorithm. As a result, the minimal variance of 30 particles is the global optimum and the minimal variance of the procession in one particle is the local optimum. Then, taking a step towards global and local optimum as well as considering the inertia and perturbation avoids falling into the sub-optimal condition.</br></br>
For getting the best regulation, our software uses PSO algorithm based on 30 particles to simulate the GRN's changing. First of all, the interactions of regulator and regulated-by have been put into those particles in random so that each particle will have the whole set of regulation. Secondly, the variance between target expressions and stable expression of new GRN have been regarded as the optimize requirements in PSO algorithm. As a result, the minimal variance of 30 particles is the global optimum and the minimal variance of the procession in one particle is the local optimum. Then, taking a step towards global and local optimum as well as considering the inertia and perturbation avoids falling into the sub-optimal condition.</br></br>
At last, when the variance of expression reaches an acceptable range, our software picks out and saves the best global optimum particle following by the movement of those particles stop.</br></br>
At last, when the variance of expression reaches an acceptable range, our software picks out and saves the best global optimum particle following by the movement of those particles stop.</br></br>
We constantly revises the factors in PSO algorithm by machine learning method for accurate simulation with a fast PSO particle-motion equation. At the same time, our software also filter the result of regulatory value which is more intuitive.
We constantly revises the factors in PSO algorithm by machine learning method for accurate simulation with a fast PSO particle-motion equation. At the same time, our software also filter the result of regulatory value which is more intuitive.
</p>
</p>
-
                </div>
+
 
-
+
 
-
               
+
 
-
               
+
  </div>
-
            <div class="jobs_trigger"> <strong>Filter</strong></div>
+
<div class="jobs_trigger"><strong>Locate Optimal Target</strong></div>
-
        <div class="jobs_item" style="display: none;"><p align="justify">To improve the efficiency of choosing a suitable gene after getting a series of regulatory value, our software picks out some obvious regulation. The value of regulation is between -1 to 1 in which -1 means negative effect and 1 means positive effect. As a result, what our software has done is filtering out the absolute value which is lower than 0.9. Because the difference of regulatory intensity lower than 0.1 has very little effect to the stable expression, the final result of regulation is indicated by Boolean value.</br></br>
+
  <div class="jobs_item" style="display: none;">
 +
<p align="justify">To improve the efficiency of choosing a suitable gene after getting a series of regulatory value, our software picks out some obvious regulation. The value of regulation is between -1 to 1 in which -1 means negative effect and 1 means positive effect. As a result, what our software has done is filtering out the absolute value which is lower than 0.9. Because the difference of regulatory intensity lower than 0.1 has very little effect to the stable expression, the final result of regulation is indicated by Boolean value.</br></br>
The format of regulatory prediction in “Result”:</br>
The format of regulatory prediction in “Result”:</br>
Gene_name->Gene_name    regulation(+/-)
Gene_name->Gene_name    regulation(+/-)
</p>           
</p>           
-
            </div> 
 
-
 
-
 
-
 
-
    </div><!--jobs container-->
 
-
 
+
  </div>
-
</div>
+
-
 
+
-
 
+
-
 
+
-
<div id="Database">
+
-
<h2>Database</h2>
+
-
 
+
-
 
+
-
 
+
-
<div id="jobs_container">
+
-
 
+
-
               
+
-
+
-
<div class="jobs_trigger"><strong>TF-TF</strong></div>
+
-
<div class="jobs_item" style="display: block;"><p align="justify">This file contains the regulation between Transcription Factors.</p>
+
-
<img src="https://static.igem.org/mediawiki/2013/6/69/USTC_Software_TT.jpg"/>
+
-
      </div>
+
-
 
+
-
 
+
-
<div class="jobs_trigger"><strong>TF-Gene</strong></div>
+
-
<div class="jobs_item" style="display: block;"><p align="justify">This file contains the regulation between Transcription Factors and Genes</p>
+
-
<img src="https://static.igem.org/mediawiki/2013/4/47/USTC_Software_TG.jpg"/>
+
-
      </div>
+
-
+
-
<div class="jobs_trigger"><strong>Gene Info</strong></div>
+
-
<div class="jobs_item" style="display: block;"><p align="justify">This file contains the information about all genes in E-coli K-12</p>
+
-
<img src="https://static.igem.org/mediawiki/2013/4/45/USTC_Software_GI.jpg"/>
+
-
      </div>
+
-
 
+
-
 
+
-
<div class="jobs_trigger"><strong>Promoter Info</strong></div>
+
-
<div class="jobs_item" style="display: block;"><p align="justify">This file contains the information about all promoters in E-coli K-12</p>
+
-
<img src="https://static.igem.org/mediawiki/2013/8/8a/USTC_Software_PI.jpg"/>
+
-
      </div>
+
-
 
+
-
<div class="jobs_trigger"><strong>TU Info</strong></div>
+
-
<div class="jobs_item" style="display: block;"><p align="justify">This file contains the information about all Transcription Units in E-coli K-12</p>
+
-
<img src="https://static.igem.org/mediawiki/2013/1/1e/USTC_Software_TI.jpg"/>
+
-
      </div>
+
-
 
+
-
 
+
-
      </div>
+
-
 
+
-
 
+
-
<div id="jobs_container" style="display:none;"><div class="jobs_trigger"></div></div>
+
-
 
+
-
 
+
-
</div>
+
-
 
+
</body>
</body>
</html>
</html>

Revision as of 14:13, 27 October 2013

Slide

Take a gNAP before wearing your gloves! Genetic Network Analyze and Predict
The sketch and final GUI of gNAP!
We compare the result of our software with gene expression profile in literature.
We are USTC-Software!


Methodologies

Methodologies

In order to simulate the GRN's working and analyze the changing after exogenous gene imported, some advanced algorithms and classical methods are employed in the software. These algorithms and methods include Binary Tree method, Needle-Wunsch Algorithm, Decision Tree method, Hill Equation and PSO Algorithm.

There are four parts of methodologies: Database, Operon Theory and Regulatory Model, Forward Analysis and Reverse Analysis.

Database

Abstract
Fetch Regulation
Fetch Gene Info
Fetch Promoter Info
Integration

Our software integrates all information we picked out about genes and generates a file named “all_info” —— all information about genes —— for the output graphical interface's reading. In the meanwhile, the array of objects containing all information has been stored in computer memory which greatly improve the computing speed of our software.

  The format of all_info database:
    No.    promoter_sequence    gene_sequence    gene_name    ID    left_position    right_position    promoter_name     description
The fetching module generates three files: old_GRN, all_info and uncertain_database.

Operon Theory and Regulatory Model

Operon Theory
Regulatory Model
Similarity and Homology

Forward Analysis

Construct New GRN
Network Model
Evaluate Network

Reverse Analysis

Virtual Gene
Expression Range
Particle Swarm Optimaztion
Locate Optimal Target