Team:USTC-Software/Project/Method

From 2013.igem.org

(Difference between revisions)
Line 65: Line 65:
   <div id="jobs_container">
   <div id="jobs_container">
        <div class="jobs_trigger"><strong>Fetching Regulation</strong></div>
        <div class="jobs_trigger"><strong>Fetching Regulation</strong></div>
-
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">In GRN, there are two kinds of files (download here[]): TF to TF and TF to Gene. Since the database about the regulation between TFs and Genes contains only one-way interaction, the matrix of GRN is a rectangle.
+
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">In GRN, there are two kinds of files (download here[]): TF to TF and TF to Gene. Since the database about the regulation between TFs and Genes contains only one-way interaction, the matrix of GRN is a rectangle.</br>
-
First of all, read the regulation relationship of TFs. Our software filters the documentation of RegulonDB on the head of all files and then reads the name of regulate and regulated TF, which is also the name of its genes, one by one. In the same time, our software numerates the genes and stores their names into an objects’ array of genetic data.  
+
First of all, read the regulation relationship of TFs. Our software filters the documentation of RegulonDB on the head of all files and then reads the name of regulate and regulated TF, which is also the name of its genes, one by one. In the same time, our software numerates the genes and stores their names into an objects’ array of genetic data. </br>
-
The format of regulation database:
+
The format of regulation database:</br>
-
TF_name    TF_name    +/-/+-
+
TF_name    TF_name    +/-/+-</br>
-
The regulation of TFs has been put into a square matrix whose row is the regulator and column is the one regulated by. To make our GRN as complete as possible, the regulation between TF and genes has joined into the matrix. The one-way interaction results that we must read the TF in order to fulfill the regulator before completing the TF to gene’s regulation in the same way of TF to TF. The format of regulation database:
+
The regulation of TFs has been put into a square matrix whose row is the regulator and column is the one regulated by. To make our GRN as complete as possible, the regulation between TF and genes has joined into the matrix. The one-way interaction results that we must read the TF in order to fulfill the regulator before completing the TF to gene’s regulation in the same way of TF to TF. The format of regulation database:</br>
-
TF_name    Gene_name    +/-/+-
+
TF_name    Gene_name    +/-/+-</br>
-
At last, a regulatory matrix whose row represents regulate gene (TF) and whose column represents gene regulated by (TF+Gene) has been output into a file called “old_GRN” in root directory. The values in GRN matrix are regulations in which “1” means positive activation, “-1” means repression and “0” means no relationship. There have been some regulations both positive and negative identified regulations are determined by the experimental environment. As a result, our software picks out those uncertain genes and stores them into a file named “uncertain_database”.
+
At last, a regulatory matrix whose row represents regulate gene (TF) and whose column represents gene regulated by (TF+Gene) has been output into a file called “old_GRN” in root directory. The values in GRN matrix are regulations in which “1” means positive activation, “-1” means repression and “0” means no relationship. There have been some regulations both positive and negative identified regulations are determined by the experimental environment. As a result, our software picks out those uncertain genes and stores them into a file named “uncertain_database”.</br>
-
The format of uncertain database:
+
The format of uncertain database:</br>
-
?    Gene_name->Gene_name
+
?    Gene_name->Gene_name</br>
The question mark represents the unknown regulation between regulator and regulated-by whose names presented afterward. Users could replace the question mark with the data known in past experiment. (“+” rep positive, “-” rep negative). Our software will replace the values in matrix automatically. But if not rewrote, our software will regard those regulation as unknown.
The question mark represents the unknown regulation between regulator and regulated-by whose names presented afterward. Users could replace the question mark with the data known in past experiment. (“+” rep positive, “-” rep negative). Our software will replace the values in matrix automatically. But if not rewrote, our software will regard those regulation as unknown.
Line 83: Line 83:
<div class="jobs_trigger"><strong> Fetching Gene Info</strong></div>
<div class="jobs_trigger"><strong> Fetching Gene Info</strong></div>
<div class="jobs_item" style="display: none;"><p align="justify">
<div class="jobs_item" style="display: none;"><p align="justify">
-
All gene information has been deposited into a file named gene_info which could be downloaded here[]. In order of picking out the genes in GRN as fast as possible, all genetic information are stored in a “map”. “Map” is just like a dictionary yet its words are names of genes and its descriptions of words are replaced by genetic information. By using binary tree method, it is very fast to searth the “word” wanted in the “dictionary”. As tested, the speed of binary tree method built-in “map” function is 720 times faster than traversal method.
+
All gene information has been deposited into a file named gene_info which could be downloaded here[]. In order of picking out the genes in GRN as fast as possible, all genetic information are stored in a “map”. “Map” is just like a dictionary yet its words are names of genes and its descriptions of words are replaced by genetic information. By using binary tree method, it is very fast to searth the “word” wanted in the “dictionary”. As tested, the speed of binary tree method built-in “map” function is 720 times faster than traversal method.</br>
-
The format of Gene Info database:
+
The format of Gene Info database:</br>
-
ID_assigned_by_RegulonDB    Gene_name    Left_end_position    Right_end_position    DNA_strand    Product_type    Product_name    Start_codon_sequence    Stop_codon_sequence    Gene_sequence
+
ID_assigned_by_RegulonDB    Gene_name    Left_end_position    Right_end_position    DNA_strand    Product_type    Product_name    Start_codon_sequence    Stop_codon_sequence    Gene_sequence</br>
The label of the map vector is gene name which will be picked out based on the names read in regulation matrix before. It is really fast using the binary tree method to find the specific genetic information and store them into a specific object. Those information includes gene ID, left position, right position, gene description and gene sequence. The gene ID is used to link to RegulonDB’s gene details; The left position is used to find its specific transcription unit; The right position is used to figure out the base amount; The description of genes is used to distinguish the RNA and protein; The sequence is used to predict the regulation by alignment.
The label of the map vector is gene name which will be picked out based on the names read in regulation matrix before. It is really fast using the binary tree method to find the specific genetic information and store them into a specific object. Those information includes gene ID, left position, right position, gene description and gene sequence. The gene ID is used to link to RegulonDB’s gene details; The left position is used to find its specific transcription unit; The right position is used to figure out the base amount; The description of genes is used to distinguish the RNA and protein; The sequence is used to predict the regulation by alignment.
Line 95: Line 95:
                  
                  
             <div class="jobs_trigger"> <strong>Fetching Promoter Info</strong></div>
             <div class="jobs_trigger"> <strong>Fetching Promoter Info</strong></div>
-
        <div class="jobs_item" style="display: none;"><p align="justify">All promoter information has been deposited into a file named promoter_info which could be downloaded here[]. But we also need transcription unit information because the information files about promoter do not contain all genes’ names backward. “TU Info” file, which can be downloaded here[], contains the starting position of each TU and its promoter name. Our software picks out the starting position into a integer array. Using the left position picked out in gene info, our software would find out which unit the gene belongs to through dichotomy method and then stores the name of promoter into corresponding object.
+
        <div class="jobs_item" style="display: none;"><p align="justify">All promoter information has been deposited into a file named promoter_info which could be downloaded here[]. But we also need transcription unit information because the information files about promoter do not contain all genes’ names backward. “TU Info” file, which can be downloaded here[], contains the starting position of each TU and its promoter name. Our software picks out the starting position into a integer array. Using the left position picked out in gene info, our software would find out which unit the gene belongs to through dichotomy method and then stores the name of promoter into corresponding object.</br>
-
The format of TU info database:
+
The format of TU info database:</br>
-
Operon_name    Unit_name    promoter_name    Transcription_start_site ......
+
Operon_name    Unit_name    promoter_name    Transcription_start_site ......</br>
-
The principle of fetching information of promoters is same as fetching genes’s. Our software stores the promoter information from the file named “promoter_info” in a “map” which could be used to pick out the promoter sequence by searching promoter name through binary tree method.
+
The principle of fetching information of promoters is same as fetching genes’s. Our software stores the promoter information from the file named “promoter_info” in a “map” which could be used to pick out the promoter sequence by searching promoter name through binary tree method.</br>
-
The format of Promoter Info database:
+
The format of Promoter Info database:</br>
-
Promoter_ID_assigned_by_RegulonDB    Promoter_name
+
Promoter_ID_assigned_by_RegulonDB    Promoter_name</br>
The sequence of promoter will be used in the alignment method in next module which could make a prediction of exogenous genes’ regulation pattern.
The sequence of promoter will be used in the alignment method in next module which could make a prediction of exogenous genes’ regulation pattern.
Line 111: Line 111:
<div class="jobs_item" style="display: block;"><p align="justify">                     
<div class="jobs_item" style="display: block;"><p align="justify">                     
Our software integrates all information we picked out about genes and generates a file named “all_info” —— all information about genes —— for the output graphical interface’s reading. In the meanwhile, the array of objects containing all information has been stored in computer memory which greatly improve the computing speed of our software.
Our software integrates all information we picked out about genes and generates a file named “all_info” —— all information about genes —— for the output graphical interface’s reading. In the meanwhile, the array of objects containing all information has been stored in computer memory which greatly improve the computing speed of our software.
-
The format of all_info database:
+
The format of all_info database:</br>
-
No.    promoter_sequence    gene_sequence    gene_name    ID    left_position    right_position    promoter_name    description
+
No.    promoter_sequence    gene_sequence    gene_name    ID    left_position    right_position    promoter_name    description</br>
-
The fetching module generates three files: old_GRN, all_info and uncertain_database.
+
The fetching module generates three files: old_GRN, all_info and uncertain_database.</br>
</p>
</p>
           </div>
           </div>
Line 132: Line 132:
   <div id="jobs_container">
   <div id="jobs_container">
        <div class="jobs_trigger"><strong>An example</strong></div>
        <div class="jobs_trigger"><strong>An example</strong></div>
-
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">We would like to start with a simple example. A cell operates like a basketball team. Imagine you are a manager of a team who wants to bring in some talent players making up a “big three” and build a champion-potential team this season. Before you pay the sky-high bills for the “big three”, you can evaluate the effect of the talent introduction. New members‘ records are good reference, but not the whole thing.
+
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">We would like to start with a simple example. A cell operates like a basketball team. Imagine you are a manager of a team who wants to bring in some talent players making up a “big three” and build a champion-potential team this season. Before you pay the sky-high bills for the “big three”, you can evaluate the effect of the talent introduction. New members‘ records are good reference, but not the whole thing.</br>
-
There are various factors influencing the effect of introduction. Let’s carefully choose one of the most profound aspects and focus on the relationship of the members. In the original team, you are familiar with all players’ characteristics, their roles in the team and the coach’s style, i.e. you have the information of the original player interaction network.  
+
There are various factors influencing the effect of introduction. Let’s carefully choose one of the most profound aspects and focus on the relationship of the members. In the original team, you are familiar with all players’ characteristics, their roles in the team and the coach’s style, i.e. you have the information of the original player interaction network. </br>
-
You know Alex is a good shooter and Bob is a strong centre forward. Carl is your target player. Carl is famous for his shooting skills and appears dominant in the court. In other words, Carl shows more obvious similarity with Alex but a low level of similarity with Bob. Then, in the new player interaction network, Carl might play a role 80% like Alex and 20% like Bob. He is similar with Alex and Bob, however different. That’s an analysis and prediction at the global point of view.
+
You know Alex is a good shooter and Bob is a strong centre forward. Carl is your target player. Carl is famous for his shooting skills and appears dominant in the court. In other words, Carl shows more obvious similarity with Alex but a low level of similarity with Bob. Then, in the new player interaction network, Carl might play a role 80% like Alex and 20% like Bob. He is similar with Alex and Bob, however different. That’s an analysis and prediction at the global point of view.</br>
-
[ Pic.1 Alex, Bob and new member Carl]
+
[ Pic.1 Alex, Bob and new member Carl]</br>
-
Just like the basketball team example, researchers often need to insert exogenous genes(new players) into a cell(original team) to achieve a specific goal(win the champion). In the past, the behaviors of exogenous genes are mainly speculated by wet lab experiments. Now we are trying to give an answer before wearing laboratory gloves.  
+
Just like the basketball team example, researchers often need to insert exogenous genes(new players) into a cell(original team) to achieve a specific goal(win the champion). In the past, the behaviors of exogenous genes are mainly speculated by wet lab experiments. Now we are trying to give an answer before wearing laboratory gloves. </br>
</p>
</p>
                 </div>
                 </div>
Line 146: Line 146:
<div class="jobs_trigger"><strong>Models</strong></div>
<div class="jobs_trigger"><strong>Models</strong></div>
<div class="jobs_item" style="display: none;"><p align="justify">
<div class="jobs_item" style="display: none;"><p align="justify">
-
Regulatory Model
+
Regulatory Model</br>
-
Regulation of gene expression includes 4 levels:  
+
Regulation of gene expression includes 4 levels: </br>
-
Level of DNA rearrangement.
+
Level of DNA rearrangement.</br>
-
Level of transcriptional regulation.
+
Level of transcriptional regulation.</br>
-
Level of translation.
+
Level of translation.</br>
-
Level of post-translation
+
Level of post-translation</br>
-
This year we focus on the level of transcriptional regulation both for the importance of the level [FIXME: do i need to describe why it is important?] and model simplification. By carefully examining the lac operon system, which is widely considered as the first discovery of the gene regulation system, we constructed our regulation model with functional units called “Regulation Unit” [FXIME: regulation or regulatory?]. A regulation unit consists of two segments. The first one is a promoter sequence and the second one is a protein coding sequence.
+
This year we focus on the level of transcriptional regulation both for the importance of the level [FIXME: do i need to describe why it is important?] and model simplification. By carefully examining the lac operon system, which is widely considered as the first discovery of the gene regulation system, we constructed our regulation model with functional units called “Regulation Unit” [FXIME: regulation or regulatory?]. A regulation unit consists of two segments. The first one is a promoter sequence and the second one is a protein coding sequence.</br>
-
[Pic. 3 Promoter Sequence, Protein Coding Sequence and Regulation Unit]
+
[Pic. 3 Promoter Sequence, Protein Coding Sequence and Regulation Unit]</br>
-
In our model, the promoter segment is regarded as the main regulated region. A transcription factor is a protein that binds to specific DNA sequences, thereby controlling the flow of genetic information from DNA to mRNA. The binding sites are promoter regions of DNA adjacent to the genes that they regulate. At first, according to the lac operon system, regulation units as a regulatory target with the same promoter are supposed to have same behavior. But we found it is insufficient because there are units with the same promoter showing different properties. Then we took the genes regulated by transcription factors into consideration. The different properties of two units are first owing to their promoters. If they have the same promoter, their protein coding sequences are supposed to make the difference. By taking this method, it turns out that this model works better.
+
In our model, the promoter segment is regarded as the main regulated region. A transcription factor is a protein that binds to specific DNA sequences, thereby controlling the flow of genetic information from DNA to mRNA. The binding sites are promoter regions of DNA adjacent to the genes that they regulate. At first, according to the lac operon system, regulation units as a regulatory target with the same promoter are supposed to have same behavior. But we found it is insufficient because there are units with the same promoter showing different properties. Then we took the genes regulated by transcription factors into consideration. The different properties of two units are first owing to their promoters. If they have the same promoter, their protein coding sequences are supposed to make the difference. By taking this method, it turns out that this model works better.</br>
-
[Pic. 4 Regulated Region]
+
[Pic. 4 Regulated Region]</br>
A unit in the network regulates another through the transcription factor. That is, the product of the protein coding sequence of the unit is a transcription factor and the transcription factor regulates the promoter of the another unit.  
A unit in the network regulates another through the transcription factor. That is, the product of the protein coding sequence of the unit is a transcription factor and the transcription factor regulates the promoter of the another unit.  
Line 167: Line 167:
                  
                  
             <div class="jobs_trigger"> <strong>Prediction Model</strong></div>
             <div class="jobs_trigger"> <strong>Prediction Model</strong></div>
-
        <div class="jobs_item" style="display: none;"><p align="justify">The basic idea behind the prediction model is deceptively simple: the more similar two sequences are, the more likely they have similar behaviors. In fact, it is extremely difficult to predict an exogenous gene’s behavior because of the complexity of the problem, random noise of the system and the coupling of biosystems.  
+
        <div class="jobs_item" style="display: none;"><p align="justify">The basic idea behind the prediction model is deceptively simple: the more similar two sequences are, the more likely they have similar behaviors. In fact, it is extremely difficult to predict an exogenous gene’s behavior because of the complexity of the problem, random noise of the system and the coupling of biosystems. </br>
-
Advanced alignment algorithm is selected to reduce the complexity. Sequences which contain all the information of the species are the entity of the gene regulatory network. Sequence similarity is an essential concept to the prediction model. The selected alignment algorithm can significantly reduce the complexity of the problem and makes it possible to give a reliable prediction from a global point of view.
+
Advanced alignment algorithm is selected to reduce the complexity. Sequences which contain all the information of the species are the entity of the gene regulatory network. Sequence similarity is an essential concept to the prediction model. The selected alignment algorithm can significantly reduce the complexity of the problem and makes it possible to give a reliable prediction from a global point of view.</br>
We designed a random method to filter the noise in sequence alignment. There are no totally different sequences. Even the similarity of any two random sequences is not zero. Filtered results are more significant and reliable to the following steps.
We designed a random method to filter the noise in sequence alignment. There are no totally different sequences. Even the similarity of any two random sequences is not zero. Filtered results are more significant and reliable to the following steps.
-
 
+
</br>
-
Coupling of biosystem is also simulated at some level. When predicting exogenous gene’s behavior, all the units in the original gene regulatory network are taken into consideration.  
+
Coupling of biosystem is also simulated at some level. When predicting exogenous gene’s behavior, all the units in the original gene regulatory network are taken into consideration. </br>
-
Given that the exogenous gene may have never been inserted into E. coli before, all possible reactions in gene regulatory network are reserved to be filtered.  
+
Given that the exogenous gene may have never been inserted into E. coli before, all possible reactions in gene regulatory network are reserved to be filtered. </br>
Using the innovated methods above, we are trying to challenge the difficulties and obtain a global perspective of the relationship between the exogenous gene and the original gene regulatory network.
Using the innovated methods above, we are trying to challenge the difficulties and obtain a global perspective of the relationship between the exogenous gene and the original gene regulatory network.
Line 185: Line 185:
             <div class="jobs_trigger"> <strong>Sequence similarity</strong></div>
             <div class="jobs_trigger"> <strong>Sequence similarity</strong></div>
        <div class="jobs_item" style="display: none;"><p align="justify">The sequence similarity is obtained by sequence alignment based on Needleman-Wunsch algorithm[FIXME: wiki link here]. The Needleman-Wunsch algorithm performs a global alignment on two protein sequences or nucleotide sequences. It was the first application of dynamic programming to biological sequence comparison.
        <div class="jobs_item" style="display: none;"><p align="justify">The sequence similarity is obtained by sequence alignment based on Needleman-Wunsch algorithm[FIXME: wiki link here]. The Needleman-Wunsch algorithm performs a global alignment on two protein sequences or nucleotide sequences. It was the first application of dynamic programming to biological sequence comparison.
 +
</br>
 +
When dynamic programming is applicable, the method takes far less time than naive methods. Using a naive method, many of the subproblems are generated and sovled many times. The dynamic programming approach seeks to solve each subproblem only once. Once the solution to a given subproblem has been computed, it is stored to be looked up next time.</br>
-
When dynamic programming is applicable, the method takes far less time than naive methods. Using a naive method, many of the subproblems are generated and sovled many times. The dynamic programming approach seeks to solve each subproblem only once. Once the solution to a given subproblem has been computed, it is stored to be looked up next time.
+
[Pic. 5 Dynamic programming and naive method]</br>
-
[Pic. 5 Dynamic programming and naive method]
+
Like the Needleman-Wunsch algorithm, of which it is a variation, Smith-Waterman is also a dynamic programming algorithm. But it is a local sequence alignment algorithm. The famous BLAST(Basic Local Alignment Search Tool) is improved from Smith-Waterman algorithm. Although local algorithm has the desirable property that it is guaranteed to find the optimal local alignment, we decided to choose the global one because we regarded the segment sequence as a unit.</br>
-
Like the Needleman-Wunsch algorithm, of which it is a variation, Smith-Waterman is also a dynamic programming algorithm. But it is a local sequence alignment algorithm. The famous BLAST(Basic Local Alignment Search Tool) is improved from Smith-Waterman algorithm. Although local algorithm has the desirable property that it is guaranteed to find the optimal local alignment, we decided to choose the global one because we regarded the segment sequence as a unit.
+
Sequences are aligned with different detailed methods in different situations. In the regulated side, what we care about is the DNA sequence. In the regulating side, it is the amino acid sequence. When it comes to predict the regulated behavior, we use a DNA substitution matrix to align promoter and protein coding sequences. In the prediction of regulating behavior, the substitution matrix BLOSUM_50 is used to align the amino acid sequences translated from protein coding sequences.</br>
-
 
+
-
Sequences are aligned with different detailed methods in different situations. In the regulated side, what we care about is the DNA sequence. In the regulating side, it is the amino acid sequence. When it comes to predict the regulated behavior, we use a DNA substitution matrix to align promoter and protein coding sequences. In the prediction of regulating behavior, the substitution matrix BLOSUM_50 is used to align the amino acid sequences translated from protein coding sequences.
+
The promoter similarities of the query unit and subject units are stored in a vector. The protein coding similarities are stored in another vector. These vectors are prepared to be used in the new network construction.
The promoter similarities of the query unit and subject units are stored in a vector. The protein coding similarities are stored in another vector. These vectors are prepared to be used in the new network construction.
Line 215: Line 215:
<div class="jobs_trigger"><strong>Construct A New Regulated Vector</strong></div>
<div class="jobs_trigger"><strong>Construct A New Regulated Vector</strong></div>
<div class="jobs_item" style="display: none;"><p align="justify">
<div class="jobs_item" style="display: none;"><p align="justify">
-
If there are units in the original network having the same promoter as the exogenous one, the first step is to pick them out. Positive and negative regulations of these units are counted separately and distinguished into “positive group” and “negative group”. Then compare the exogenous one with these units. The similarities have already been calculated and stored in the corresponding positions in the similarity vector. The similarity mentioned here is the similarity of protein coding sequences as explained in the model part. The next step is to calculate the average similarity of each group. The exogenous unit is supposed to have the larger one’s direction(positive or negative). The weighted average regulation value of the chosen group whose weight is the sequence similarity is the new element’s value. It means regulatory intensity.
+
If there are units in the original network having the same promoter as the exogenous one, the first step is to pick them out. Positive and negative regulations of these units are counted separately and distinguished into “positive group” and “negative group”. Then compare the exogenous one with these units. The similarities have already been calculated and stored in the corresponding positions in the similarity vector. The similarity mentioned here is the similarity of protein coding sequences as explained in the model part. The next step is to calculate the average similarity of each group. The exogenous unit is supposed to have the larger one’s direction(positive or negative). The weighted average regulation value of the chosen group whose weight is the sequence similarity is the new element’s value. It means regulatory intensity.</br>
If there is no unit having the same promoter as the exogenous one, given that the promoter is the main regulatory region, the promoter similarity is used as the weight. And the weighted average of the regulation of the whole column is the new element’s value
If there is no unit having the same promoter as the exogenous one, given that the promoter is the main regulatory region, the promoter similarity is used as the weight. And the weighted average of the regulation of the whole column is the new element’s value
Line 231: Line 231:
<div class="jobs_trigger"> <strong>A Supplementary Game: Test of The Model</strong></div>
<div class="jobs_trigger"> <strong>A Supplementary Game: Test of The Model</strong></div>
<div class="jobs_item" style="display: block;"><p align="justify">                     
<div class="jobs_item" style="display: block;"><p align="justify">                     
-
The behavior similarity of two units can be described by the dot product of two regulated vectors or two regulating vectors. A more intuitive way is using the vectorial angle to measured the similarity of two behaviors. But there are some zero vectors in the gene regulatory network which usually means the units either play the role of target or the regulator.
+
The behavior similarity of two units can be described by the dot product of two regulated vectors or two regulating vectors. A more intuitive way is using the vectorial angle to measured the similarity of two behaviors. But there are some zero vectors in the gene regulatory network which usually means the units either play the role of target or the regulator.</br>
-
[Pic. 4 GRN matrix, target vector, regulator vector and their dot product]
+
[Pic. 4 GRN matrix, target vector, regulator vector and their dot product]</br>
We have tested the hypothesis by analyzing all 1748 regulation units of Escherichia coli, K-12, recorded in RegulonDB[FIXME: website link here]. By pairwise comparison of all these units, about 1.6 million sets of data was obtained. Each set of data consists of promoter sequence similarity, protein coding sequence similarity and behavior similarity. We hope to find some structure in the data that supports our hypothesis. And it is lucky enough to find there is a tendency showing the relationship between sequence similarity and behavior similarity(Pic. 2).
We have tested the hypothesis by analyzing all 1748 regulation units of Escherichia coli, K-12, recorded in RegulonDB[FIXME: website link here]. By pairwise comparison of all these units, about 1.6 million sets of data was obtained. Each set of data consists of promoter sequence similarity, protein coding sequence similarity and behavior similarity. We hope to find some structure in the data that supports our hypothesis. And it is lucky enough to find there is a tendency showing the relationship between sequence similarity and behavior similarity(Pic. 2).
-
 
+
</br>
-
[Pic. 2 Sequence similarity and behavior similarity]
+
[Pic. 2 Sequence similarity and behavior similarity]</br>
Sequence similarity is set as x axis and behavior similarity is set as y axis. Obviously sequence similarity is continuous-valued (from 0 to 1) and behavior similarity is discrete-valued. Values of behavior similarity determined by the dimension(N) of the vector are between -N and N. According to the result, promoter sequence similarity mainly distributes[FIXME] from 0.4 to 0.6, protein coding sequence similarity mainly distributes from 0 to 0.7 and behavior similarity mainly distributes from -3 to 5. As it is shown in Picture 4, high behavior similarity is partial to high sequence similarity. Peak value of behavior similarity, 17, appears where sequence similarity is 0.537. When behavior similarity value is fixed, for example, set behavior similarity as 8, it is obvious that the higher the sequence similarity is, the more intensive the dots are.
Sequence similarity is set as x axis and behavior similarity is set as y axis. Obviously sequence similarity is continuous-valued (from 0 to 1) and behavior similarity is discrete-valued. Values of behavior similarity determined by the dimension(N) of the vector are between -N and N. According to the result, promoter sequence similarity mainly distributes[FIXME] from 0.4 to 0.6, protein coding sequence similarity mainly distributes from 0 to 0.7 and behavior similarity mainly distributes from -3 to 5. As it is shown in Picture 4, high behavior similarity is partial to high sequence similarity. Peak value of behavior similarity, 17, appears where sequence similarity is 0.537. When behavior similarity value is fixed, for example, set behavior similarity as 8, it is obvious that the higher the sequence similarity is, the more intensive the dots are.
Line 287: Line 287:
                  
                  
             <div class="jobs_trigger"> <strong>Find Changes From Original Stable Condition to New Condition</strong></div>
             <div class="jobs_trigger"> <strong>Find Changes From Original Stable Condition to New Condition</strong></div>
-
        <div class="jobs_item" style="display: none;"><p align="justify">Record the original stable condition, set new material density to 0 and this is the new initial density vector. Solve new equations and record density vectors before the new condition is stable and store these data in a text file.
+
        <div class="jobs_item" style="display: none;"><p align="justify">Record the original stable condition, set new material density to 0 and this is the new initial density vector. Solve new equations and record density vectors before the new condition is stable and store these data in a text file.</br>
To evaluate the new network, we introduce the grading system.
To evaluate the new network, we introduce the grading system.
Line 295: Line 295:
<br/>
<br/>
-
"xi" and "Xi" are densities of material i, which is not the new material."ny" is the number of materials. The more new densities are close to the original, the less the influence the cell endues. In general, cells close to the original cell are more likely to survive than those who are far different from the original cell. That is the thought of the grading system.
+
"xi" and "Xi" are densities of material i, which is not the new material."ny" is the number of materials. The more new densities are close to the original, the less the influence the cell endues. In general, cells close to the original cell are more likely to survive than those who are far different from the original cell. That is the thought of the grading system.</br>
We did a lot of running and found that the “AbsValue” ranges from 0 to 370, so "ScoreA" ranges from 0 to 4.9.We get the integer part and store it in an array, which has five sections. Generate 100 or 200 matrix M from matrix R and run the original and new network for each M, so we can get 100 or 200 of "ScoreA"s. The section which has maximum "ScoreA"s is the eventual score.
We did a lot of running and found that the “AbsValue” ranges from 0 to 370, so "ScoreA" ranges from 0 to 4.9.We get the integer part and store it in an array, which has five sections. Generate 100 or 200 matrix M from matrix R and run the original and new network for each M, so we can get 100 or 200 of "ScoreA"s. The section which has maximum "ScoreA"s is the eventual score.
</p>           
</p>           
Line 324: Line 324:
        <div class="jobs_trigger"><strong>Input Target</strong></div>
        <div class="jobs_trigger"><strong>Input Target</strong></div>
-
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">Before prediction, the expression of specific genes which the experimenter needs should be input into our software as well as the improvement or depression. The number of target gene is SEVEN at most.
+
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">Before prediction, the expression of specific genes which the experimenter needs should be input into our software as well as the improvement or depression. The number of target gene is SEVEN at most.</br>
It is a must that figuring out the strongest and weakest expression strength before inputting the extreme cases into the target expression. The way to find out the strongest and weakest expression is modeling the GRN’s steady state by a large amount of random regulation from -1 and 1. On the other hand, the expression of genes unpicked by the users should be stable as much as possible. The initial strength of expression is calculated by modeling the original GRN with Hill’s equation.
It is a must that figuring out the strongest and weakest expression strength before inputting the extreme cases into the target expression. The way to find out the strongest and weakest expression is modeling the GRN’s steady state by a large amount of random regulation from -1 and 1. On the other hand, the expression of genes unpicked by the users should be stable as much as possible. The initial strength of expression is calculated by modeling the original GRN with Hill’s equation.
</p>
</p>
Line 331: Line 331:
<div class="jobs_trigger"><strong>Particle Swarm Optimization</strong></div>
<div class="jobs_trigger"><strong>Particle Swarm Optimization</strong></div>
<div class="jobs_item" style="display: none;"><p align="justify">
<div class="jobs_item" style="display: none;"><p align="justify">
-
For getting the best regulation, our software uses PSO algorithm based on 30 particles to simulate the GRN’s changing. First of all, the interactions of regulator and regulated-by have been put into those particles in random so that each particle will have the whole set of regulation. Secondly, the variance between target expressions and stable expression of new GRN have been regarded as the optimize requirements in PSO algorithm. As a result, the minimal variance of 30 particles is the global optimum and the minimal variance of the procession in one particle is the local optimum. Then, taking a step towards global and local optimum as well as considering the inertia and perturbation avoids falling into the sub-optimal condition.
+
For getting the best regulation, our software uses PSO algorithm based on 30 particles to simulate the GRN’s changing. First of all, the interactions of regulator and regulated-by have been put into those particles in random so that each particle will have the whole set of regulation. Secondly, the variance between target expressions and stable expression of new GRN have been regarded as the optimize requirements in PSO algorithm. As a result, the minimal variance of 30 particles is the global optimum and the minimal variance of the procession in one particle is the local optimum. Then, taking a step towards global and local optimum as well as considering the inertia and perturbation avoids falling into the sub-optimal condition.</br>
-
At last, when the variance of expression reaches an acceptable range, our software picks out and saves the best global optimum particle following by the movement of those particles stop.
+
At last, when the variance of expression reaches an acceptable range, our software picks out and saves the best global optimum particle following by the movement of those particles stop.</br>
We constantly revises the factors in PSO algorithm by machine learning method for accurate simulation with a fast PSO particle-motion equation. At the same time, our software also filter the result of regulatory value which is more intuitive.
We constantly revises the factors in PSO algorithm by machine learning method for accurate simulation with a fast PSO particle-motion equation. At the same time, our software also filter the result of regulatory value which is more intuitive.
</p>
</p>
Line 340: Line 340:
                  
                  
             <div class="jobs_trigger"> <strong>Filter</strong></div>
             <div class="jobs_trigger"> <strong>Filter</strong></div>
-
        <div class="jobs_item" style="display: none;"><p align="justify">To improve the efficiency of choosing a suitable gene after getting a series of regulatory value, our software picks out some obvious regulation. The value of regulation is between -1 to 1 in which -1 means negative effect and 1 means positive effect. As a result, what our software has done is filtering out the absolute value which is lower than 0.9. Because the difference of regulatory intensity lower than 0.1 has very little effect to the stable expression, the final result of regulation is indicated by Boolean value.
+
        <div class="jobs_item" style="display: none;"><p align="justify">To improve the efficiency of choosing a suitable gene after getting a series of regulatory value, our software picks out some obvious regulation. The value of regulation is between -1 to 1 in which -1 means negative effect and 1 means positive effect. As a result, what our software has done is filtering out the absolute value which is lower than 0.9. Because the difference of regulatory intensity lower than 0.1 has very little effect to the stable expression, the final result of regulation is indicated by Boolean value.</br>
The format of regulatory prediction in “Result”:
The format of regulatory prediction in “Result”:
Gene_name->Gene_name    regulation(+/-)
Gene_name->Gene_name    regulation(+/-)

Revision as of 06:22, 25 September 2013

Slide

Take a gNAP before wearing your gloves! Genetic Network Analyze and Predict
The sketch and final GUI of gNAP!
We compare the result of our software with gene expression profile in literature.
We are USTC-Software!

Methodologies

Methodologies

To simulate and analyze a genetic regulatory network (GRN), we need to build an objects’ array to store the complete information of each gene. It contains regulation relationships between genes, sequences of genes, sequences of promoters and so on. However, it’s hard to find an appropriate database online containing all information we need in a simple file. RegulonDB has downloadable files about the regulation between transcription factors (TF) and genes. Files about genetic information, transcription unit information and promoter information can also be downloaded from the RegulonDB. All those files have been put into file “source data” in the root directory of our software. They contain all information the simulation needs and we use fetching module to achieve data extraction and integration. There are four steps: fetch regulation relationships, fetch gene information, fetch promoter information and integrate information above.

Fetch Database

Fetching Regulation
Fetching Gene Info
Fetching Promoter Info
Integration

Our software integrates all information we picked out about genes and generates a file named “all_info” —— all information about genes —— for the output graphical interface’s reading. In the meanwhile, the array of objects containing all information has been stored in computer memory which greatly improve the computing speed of our software. The format of all_info database:
No. promoter_sequence gene_sequence gene_name ID left_position right_position promoter_name description
The fetching module generates three files: old_GRN, all_info and uncertain_database.

Alignment Analyze

An example
Models
Prediction Model
Mathematical Description of The Network
Sequence similarity

New Network Construction

Filter
Construct A New Regulated Vector
Construct A New Regulating Vector
A Supplementary Game: Test of The Model

The behavior similarity of two units can be described by the dot product of two regulated vectors or two regulating vectors. A more intuitive way is using the vectorial angle to measured the similarity of two behaviors. But there are some zero vectors in the gene regulatory network which usually means the units either play the role of target or the regulator.
[Pic. 4 GRN matrix, target vector, regulator vector and their dot product]
We have tested the hypothesis by analyzing all 1748 regulation units of Escherichia coli, K-12, recorded in RegulonDB[FIXME: website link here]. By pairwise comparison of all these units, about 1.6 million sets of data was obtained. Each set of data consists of promoter sequence similarity, protein coding sequence similarity and behavior similarity. We hope to find some structure in the data that supports our hypothesis. And it is lucky enough to find there is a tendency showing the relationship between sequence similarity and behavior similarity(Pic. 2).
[Pic. 2 Sequence similarity and behavior similarity]
Sequence similarity is set as x axis and behavior similarity is set as y axis. Obviously sequence similarity is continuous-valued (from 0 to 1) and behavior similarity is discrete-valued. Values of behavior similarity determined by the dimension(N) of the vector are between -N and N. According to the result, promoter sequence similarity mainly distributes[FIXME] from 0.4 to 0.6, protein coding sequence similarity mainly distributes from 0 to 0.7 and behavior similarity mainly distributes from -3 to 5. As it is shown in Picture 4, high behavior similarity is partial to high sequence similarity. Peak value of behavior similarity, 17, appears where sequence similarity is 0.537. When behavior similarity value is fixed, for example, set behavior similarity as 8, it is obvious that the higher the sequence similarity is, the more intensive the dots are.

Network Model

Network Model Abstract

Network analysis includes finding stable condition of network, adding new gene, finding new stable condition and changes from original condition to new condition. We use densities of materials to describe network condition. If all material densities are time-invariant, we can say the network condition is stable.

Hill Equations
Find Stable Network Condition
Find Changes From Original Stable Condition to New Condition

Predict

Predict Abstract

In some cases, importing exogenous gene is for enhancing or suppressing the expression of some specific genes in engineered bacteria itself. But it is hard to choose an appropriate regulatory gene. Our software analyzes the GRN forward as well as simulates by optimization algorithm backward for giving a reference of choosing to the users. Our software not only focused on the direct regulation but also focused on the global GRN. In the same time, controlling the expression of multiple genes in network has been realized by global prediction. What’s more, Particle Swarm Optimization (PSO) Algorithm makes it possible.

Input Target
Particle Swarm Optimization
Filter