From 2013.igem.org

(Difference between revisions)

Revision as of 00:32, 28 October 2013

Header2

Methodologies

Database
Operon Theory and Regulatory Model
Abstract
Fetch Regulation
Fetch Gene Info
Fetch Promoter Info
Integration
Forward Analysis
Construct New GRN
Network Model
Evaluate Network
Reverse Analysis
Virtual Gene
Expression Range Particle Swarm Optimaztion Locate Optimal Target
Top

Methodologies

In order to simulate the GRN's working and analyze the changing after exogenous gene imported, some advanced algorithms and classical methods are employed in the software. These algorithms and methods include Binary Tree method, Needle-Wunsch Algorithm, Decision Tree method, Hill Equation and PSO Algorithm.

There are four parts of methodologies: Database, Operon Theory and Regulatory Model, Forward Analysis and Reverse Analysis.

Database

Abstract

Fetch Regulation

Fetch Gene Info

Fetch Promoter Info

Integration

Our software integrates all information we picked out about genes and generates a file named “all_info” —— all information about genes —— for the output graphical interface's reading. In the meanwhile, the array of objects containing all information has been stored in computer memory which greatly improve the computing speed of our software.

The format of all_info database:
No. promoter_sequence gene_sequence gene_name ID left_position right_position promoter_name description
The fetching module generates three files: old_GRN, all_info and uncertain_database.

Operon Theory and Regulatory Model

Operon Theory

In genetics, an operon is a functioning unit of genomic DNA containing a cluster of genes under the control of a single regulatory signal or promoter. The genes contained in the operon are either expressed together or not at all. Several genes must be both cotranscribed and co-regulated to define an operon.

The first time "operon" was proposed is in a paper of French Academic Science, 1960. The lac operon of the model bacterium E. coli was discovered and provides a typical example of operon function. It consists a promoter, an operator, three structural genes and a terminator. The operon is regulated by several factors including the availability of glucose and lactose.

From this paper, the so-called general theory of the operon was developed. According to the theory, all genes are controlled by means of operons through a single feedback regulatory mechanism-repression. The first operon to be described was the lac operon in E. coli. The 1965 Nobel Prize in Physiology and Medicine was awarded to François Jacob, André Michel Lwoff and Jacques Lucien Monod for their discoveries concerning the operon and virus synthesis.

Figure 1. Structure of Operon

An operon is made up of several structural genes arranged under a common promoter and regulated by a common operator. It is defined as a set of adjacent structural genes, plus the adjacent regulatory signals that affect transcription of the structural genes. The regulators of a given operon, including repressors, corepressors and activators, are not necessarily coded for by that operon.

As a unit of transcription, upstream of the structural genes lies a promoter sequence which provides a site for RNA polymerase to bind and initiate transcription. Close to the promoter lies a section of DNA called an operator.

Operon regulation can be either negative or positive by induction or repression. Negative control involves the binding of a repressor to the operator to prevent transcription. Operons can also be positively controlled. An activator protein binds to DNA, usually at a site other than the operator, to stimulate transcription.

Figure 2. Regulation of Operon 1: RNA Polymerase, 2: Repressor, 3: Promoter, 4: Operator, 5: Lactose, 6: lacZ, 7: lacY, 8: lacA. Top: The gene is essentially turned off. There is no lactose to inhibit the repressor, so the repressor binds to the operator, which obstructs the RNA polymerase from binding to the promoter and making lactase.Bottom: The gene is turned on.Lactose is inhibiting the repressor, allowing the RNA polymerase to bind with the promoter, and express the genes, which synthesize lactase. Eventually, the lactase will digest all of the lactose, until there is none to bind to the repressor. The repressor will then bind to the operator, stopping the manufacture of lactase.

Regulatory Model

Similarity and Homology

Forward Analysis

Construct New GRN

1 User Input

Some genes' regulation could be get from experiment. So, if users could get the unknow regulation between new gene and old ones, they could manually set the interactions which do not need model. Those regulations will be used in later simulation.

2 Simalarity Analysis

2.1 Sequence

2.1.1 Needleman-Wunsch Algorithm

The Needleman-Wunsch algorithm was first published in1970 by Saul B. Needleman and Christian D. Wunsch. It performs a global alignment of two sequences and is mostly used in bioinformatics to align protein or nucleotide sequence. Our software applied this algorithm in the alignment of DNA and amino acid sequences.

The Needleman-Wunsch algorithm is one kind of dynamic programming and It was the first attempt in biological sequence comparison of dynamic programming.

Here is an example of Needleman-Wunsch algorithm. S(a,b) is the similarity of character a and character b. The scores of characters are shown in the similarity matrix. We assume this matrix was

And we uses linear gap penalty, denoted by d, here, we set the gap penalty as -5.Then the alignment:

A: AGACTAGTTAC
B: CGA - - - GACGT

would have the following score:

S(A,C)+S(G,C)+S(A,A)+(3)+S(G,G)+S(T,A)+S(T,C)+S(A,G)+S(C,T) = -3+7+10-(3x5)+7+(-4)+0+(-1)+0 = 1

To find the highest score of alignment, in this algorithm, a two dimensional matrix F with sequences and scores was allocated. The score in row i, column j is denoted by Fij. There is one column for each character in sequence A and one row for each character in sequence B. Therefore, if we align sequences with sizes of n and m, the amount of memory taken up here is O(n,m).

As the algorithm going on, Fij was calculated to be the optimal score by the principle as following:
Basis:

Fi0 = d*i
F0j = d*j

Recursion:

Fij = max(F(i-1,j-1) + S(Ai,Bj), F(i-1,j) + d, F(i,j-1) + d)

The pseudo-code of this algorithm would look like this:

for i = 0 to length(A)
F(i,0) <-- d*i
for j = 0 to length(B)
F(0,j) <-- d*j
for i = 0 to length(A)
for j = 0 to length(B)
{
Match <-- F(i-1,j-1) + S(Ai,Bj)
Delete <-- F(i-1,j) + d
Insert <-- F(i,j-1) + d
F(i,j) <-- max(Match, Insert, Delete)
}

After the matrix F was computed, Fnm would be the maximum score among all possible alignment.

If you want to see the optimal alignment, you can trace back from Fnm by comparing three possible sources mentioned in the above code (Match, Insert and Delete). If Match, then Aj and Bi are aligned, if Insert, Bi was aligned with a gap and if Delete, then Aj and a gap are aligned. Also, you may find there are not only one optimal alignment.

As for the example, we would get the following matrix by applying Needleman Wunsch algorithm:

And the optimal alignment would be:

- - AGACTAGTTAC
CGAGAC - - GT - - -

2.1.2 A Supplementary Game

The rows and columns in the GRN matrix can be regarded as vectors containing the regulated or the regulating information. The behavior similarity of two units can be described by the dot product of two regulated vectors or two regulating vectors. Biologists usually think the more similar two sequences are, the more likely they have similar behaviors. Whether the ratio of genes with similar behaviors is positively correlated with gene similarity is essential to our project. So we obtained 1.6 million sets of data by pairwise alignment of all the 1748 units in the GRN of K-12. Each set of data consists of gene similarity and behavior similarity. The result is analyzed and plotted in the figure. The linear fit shows that the ratio is positively correlated with the similarity.

Figure 4.Linear fit of ratio-similarity relationship.

Although there are examples that a slight change in DNA sequence will significantly change the property of the gene, for example, sickle-cell disease, the influence is usually determined by the location and scale of the mutation. So the result is still convincing to some degree.

2.2 Filtering

2.2.1 Random Noise

Normally, the similarity of two sequences will not be zero. Some computational experiments were carried out to study the random sequence similarities. We randomly chose a gene in the network and generated 1000 random sequences. The alignment result indicates that the random sequence similarities are Gauss distributed. The result suggests that some similarities are out of statistic significance.

Figure 5. Random similarity distribution

2.2.2 Filter

We need the genes highly similar to the exogenous one to interact with it. The program will align the exogenous gene(query) with genes in the network(subject) and get the original similarities. In order to filter meaningless low values, a certain amount of random sequences are generated for each query-subject alignment. Normally, 100 is sufficient. Because the sequence length will influence alignment result, random sequences are fixed at the same length as the query one. Then align random sequences with the subject sequence. The statistic result of these random similarities is used as a threshold.

Threshold = μ + xσ

In the formula, μ is the average random similarity. σ is the standard deviation. x is used to control the filter determined by machine learning. If the original similarity is lower than the threshold, it is abandoned. It is usually means the original value is usually short of statistical significance.

An example about filtring and consistency is presented in “Example”.

2.3 Regulation Calculation

If there is a three-unit network and they interact with each other as it is shown in the figure. The regulation is described by the GRN matrix.

Figure 6. Example network and its GRN matrix.

If D is the exogenous unit, we can obtain three similarity data sets of D with the units in the original GRN:

Promoter sequence similarity

Gene sequence similarity

Amino acid sequence similarity.

The construction is equivalent to add a new column and a row into the original matrix.

Figure 7. Mathematical Equivalence

When filling the column, D is compared with the regulators of the unit in each row. The regulations in the row are consider separately and marked as “positive group” and “negative group”. The average similarity of each group represents the distance between the exogenous unit and the group. D is supposed to have the larger one's regulatory direction(positive or negative). The regulatory intensity is the weight average regulation of the chose group. The weight here is the amino acid sequence similarity.

There are two conditions when fill the new row:
1. There are units having the same promoter as the exogenous unit.
2. There is no units having the same promoter as the exogenous unit.

In condition 1, the units sharing the same promoter with the new member are picked out, and the following steps are the same as the construction of the column. The difference is the similarity used here is the gene sequence similarity. As explained in the regulation model part, the promoter is the main regulatory region, but the following sequence is also considered. Now the promoter is the same, so what we focus on are the gene sequences.

In condition 2, the process is almost the same as constructing the new column. Promoter similarity is used because it is the main region.

Figure 8. Construct New GRN

3 Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

For get a better regulation, we use online database DAVID to cluster all the genes in our whole GRN. Avoid of supersoftless, we hope to create an online communication with DAVID. After getting the cluster of our genes, we multiply the genes simalarity with a factor if they are in the same cluster.

Though the source code of this part has already done, we lack the experiment information to set a propriate factor. All source code were pushed up to our github.

Network Model

Evaluate Network

Reverse Analysis

Virtual Gene

Expression Range

Particle Swarm Optimaztion

Locate Optimal Target

@@ Line 29: / Line 29: @@
 <body>
-<!--div id="direction">
+<div id="direction">
          <ul>
-           <li><a href="#abstract" class="button">Abstract</a>
+           <li><a href="#Fetch_Database" class="button">Database</a></li>
-           <li><a href="#Fetch_Database" class="button">Fetch Database</a>
-          <li><a href="#Alignment_Analyze" class="button">Alignment Analyze</a>
+           <li><a href="#Alignment_Analyze" class="button">Operon Theory and Regulatory Model</a></br>
-           <li><a href="#New_Network_Construction" class="button">New Network Construction</a>
+                <a href="#dbabstract" class="button" id="subbutton">Abstract</a></br>
-          <li><a href="#Network_Model" class="button">Network Model</a>
+                <a href="#fetch" class="button" id="subbutton">Fetch Regulation</a></br>
-           <li><a href="#Predict" class="button">Predict</a>
+                <a href="#fgi" class="button" id="subbutton">Fetch Gene Info</a></br>
-           <li><a href="#Database" class="button">Database</a>
+                <a href="#fpi" class="button" id="subbutton">Fetch Promoter Info</a></br>
+                <a href="#Integration" class="button" id="subbutton">Integration</a>
+          </li>
+           <li><a href="#fa" class="button">Forward Analysis</a></br>
+                <a href="#cng" class="button" id="subbutton">Construct New GRN</a></br>
+                <a href="#nm" class="button" id="subbutton">Network Model</a></br>
+                <a href="#en" class="button" id="subbutton">Evaluate Network</a>
+         </li>
+           <li><a href="#ra" class="button">Reverse Analysis</a></br>
+                <a href="#vg" class="button" id="subbutton">Virtual Gene</a></br>
+                <a href="#er" class="button" id="subbutton">Expression Range</a>
+                <a href="#pso" class="button" id="subbutton">Particle Swarm Optimaztion</a>
+                <a href="#lot" class="button" id="subbutton">Locate Optimal Target</a>
+         </li>
+           <li><a href="#main" class="button">Top</a></li>
          </ul>
 </div>
@@ Line 45: / Line 65: @@
          var href = $(this).attr("href");
          var pos = $(href).offset().top - 100;
-         $("html,body").animate({scrollTop: pos}, 1500);
+         $("html,body").animate({scrollTop: pos}, 1500);//the smaller the quicker
          return false;
      });
 });
-</script-->
+</script>
@@ Line 73: / Line 93: @@
 <h2>Database</h2>
 <div id="jobs_container">
-	         <div class="jobs_trigger"><strong>Abstract</strong></div>
+	         <div class="jobs_trigger" id="dbabstract"><strong>Abstract</strong></div>
 		 		<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">To simulate and analyze a genetic regulatory network (GRN), we need to build an objects' array to store the complete information of each gene. It contains regulation relationships between genes, sequences of genes, sequences of promoters and so on. However, it's hard to find an appropriate database online containing all information we need in a simple file. RegulonDB has downloadable files about the regulation between transcription factors (TF) and genes. Files about genetic information, transcription unit information and promoter information can also be downloaded from the RegulonDB. All those files have been put into file “source data” in the root directory of our software. They contain all information the simulation needs and we use fetching module to achieve data extraction and integration. There are four steps: fetch regulation relationships, fetch gene information, fetch promoter information and integrate information above.
 </p>
@@ Line 79: / Line 99: @@
    <div id="jobs_container">
-	         <div class="jobs_trigger"><strong>Fetch Regulation</strong></div>
+	         <div class="jobs_trigger" id="fetch"><strong>Fetch Regulation</strong></div>
 		 		<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">In GRN, there are two kinds of files: <a class="content" href="http://regulondb.ccg.unam.mx/menu/download/datasets/files/network_tf_tf.txt"> TF to TF</a> and <a class="content" href="http://regulondb.ccg.unam.mx/menu/download/datasets/files/network_tf_gene.txt">TF to Gene</a>. Since the database about the regulation between TFs and Genes contains only one-way interaction, the matrix of GRN is a rectangle.</br></br>
 First of all, read the regulation relationship of TFs. Our software filters the documentation of RegulonDB on the head of all files and then reads the name of regulate and regulated TF, which is also the name of its genes, one by one. In the same time, our software numerates the genes and stores their names into an objects' array of genetic data. </br></br>
@@ Line 97: / Line 117: @@
                  </div>
-				<div class="jobs_trigger"><strong> Fetch Gene Info</strong></div>
+				<div class="jobs_trigger" id="fgi"><strong> Fetch Gene Info</strong></div>
 				<div class="jobs_item" style="display: none;"><p align="justify">
 All gene information has been deposited into a file named gene_info which could be downloaded <a class="content" href="http://regulondb.ccg.unam.mx/menu/download/datasets/files/Gene_sequence.txt">here</a>. In order of picking out the genes in GRN as fast as possible, all genetic information are stored in a “map”. “Map” is just like a dictionary yet its words are names of genes and its descriptions of words are replaced by genetic information. By using binary tree method, it is very fast to search the “word” wanted in the “dictionary”. As tested, the speed of binary tree method built-in “map” function is 720 times faster than traversal method.</br></br>
@@ Line 110: / Line 130: @@
-              <div class="jobs_trigger"> <strong>Fetch Promoter Info</strong></div>
+              <div class="jobs_trigger" id="fpi"> <strong>Fetch Promoter Info</strong></div>
 		        <div class="jobs_item" style="display: none;"><p align="justify">All promoter information has been deposited into a file named promoter_info which could be downloaded <a class="content" href="http://regulondb.ccg.unam.mx/menu/download/datasets/files/PromoterSet.txt">here</a>. But we also need transcription unit information because the information files about promoter do not contain all genes' names backward. “TU Info” file, which can be downloaded <a class="content" href="http://regulondb.ccg.unam.mx/menu/download/datasets/files/TUSet.txt">here</a>, contains the starting position of each TU and its promoter name. Our software picks out the starting position into a integer array. Using the left position picked out in gene info, our software would find out which unit the gene belongs to through dichotomy method and then stores the name of promoter into corresponding object.</br></br>
 &nbsp;&nbsp;The format of TU info database:</br>
@@ Line 124: / Line 144: @@
-				<div class="jobs_trigger"> <strong>Integration</strong></div>
+				<div class="jobs_trigger" id="Integration"> <strong>Integration</strong></div>
 				<div class="jobs_item" style="display: block;"><p align="justify">
 Our software integrates all information we picked out about genes and generates a file named “all_info” —— all information about genes —— for the output graphical interface's reading. In the meanwhile, the array of objects containing all information has been stored in computer memory which greatly improve the computing speed of our software.</br></br>
@@ Line 233: / Line 253: @@
-<h2>Forward Analysis</h2>
+<h2 id="fa">Forward Analysis</h2>
-<div class="jobs_trigger"><strong>Construct New GRN</strong></div>
+<div class="jobs_trigger" id="cng"><strong>Construct New GRN</strong></div>
    <div class="jobs_item" style="display: none;">
      <h3>1 User Input</h3>
@@ Line 374: / Line 394: @@
      </p>
    </div>
-<div class="jobs_trigger"><strong>Network Model</strong></div>
+<div class="jobs_trigger" id="nm"><strong>Network Model</strong></div>
    <div class="jobs_item" style="display: none;">
 <p align="justify">Network analysis includes finding stable condition of network, adding new gene, finding new stable condition and changes from original condition to new condition. We use densities of materials to describe network condition. If all material densities are time-invariant, we can say the network condition is stable.</p>
@@ Line 401: / Line 421: @@
    </div>
-<div class="jobs_trigger"><strong>Evaluate Network</strong></div>
+<div class="jobs_trigger" id="en"><strong>Evaluate Network</strong></div>
    <div class="jobs_item" style="display: none;">
 <p align="justify">Record the original stable condition, set new material density to 0 and this is the new initial density vector. Solve new equations and record density vectors before the new condition is stable and store these data in a text file.</br></br>
@@ Line 417: / Line 437: @@
-<h2>Reverse Analysis</h2>
+<h2 id="ra">Reverse Analysis</h2>
-<div class="jobs_trigger"><strong>Virtual Gene</strong></div>
+<div class="jobs_trigger" id="vg"><strong>Virtual Gene</strong></div>
    <div class="jobs_item" style="display: none;">
 <p align="justify">Before reverse analysis, we use the same idea about constructing a new GRN. So we create a virtual gene which replace the gene what users want to get. In calculation, it means that we add a row and a column to the matrix of GRN.</p>
@@ Line 425: / Line 445: @@
-<div class="jobs_trigger"><strong>Expression Range</strong></div>
+<div class="jobs_trigger" id="er"><strong>Expression Range</strong></div>
    <div class="jobs_item" style="display: none;">
 <p align="justify">Before prediction, the expression of specific genes which the experimenter needs should be input into our software as well as the improvement or depression. The number of target gene is SIX at most.</br></br>
@@ Line 433: / Line 453: @@
-<div class="jobs_trigger"><strong>Particle Swarm Optimaztion</strong></div>
+<div class="jobs_trigger" id="pso"><strong>Particle Swarm Optimaztion</strong></div>
    <div class="jobs_item" style="display: none;">
 <p align="justify">
@@ Line 443: / Line 463: @@
-<div class="jobs_trigger"><strong>Locate Optimal Target</strong></div>
+<div class="jobs_trigger" id="lot"><strong>Locate Optimal Target</strong></div>
    <div class="jobs_item" style="display: none;">
 <p align="justify">To improve the efficiency of choosing a suitable gene after getting a series of regulatory value, our software picks out some obvious regulation. The value of regulation is between -1 to 1 in which -1 means negative effect and 1 means positive effect. As a result, what our software has done is filtering out the absolute value which is lower than 0.9. Because the difference of regulatory intensity lower than 0.1 has very little effect to the stable expression, the final result of regulation is indicated by Boolean value.</br></br>

Team:USTC-Software/Project/Method

From 2013.igem.org

Revision as of 00:32, 28 October 2013

Methodologies

Database

Operon Theory and Regulatory Model

Forward Analysis

1 User Input

2 Simalarity Analysis

2.1 Sequence

2.1.1 Needleman-Wunsch Algorithm

2.1.2 A Supplementary Game

2.2 Filtering

2.2.1 Random Noise

2.2.2 Filter

2.3 Regulation Calculation

3 Clustering

Reverse Analysis