|
|
Line 157: |
Line 157: |
| </p> | | </p> |
| | | |
- | <div align="center"><img src="../../method/Figure 1.png" /> | + | <div align="center"><img src="https://static.igem.org/mediawiki/igem.org/7/7d/USTC_Software_Figure_1.png" /> |
| <p align="center"><strong>Figure 1.</strong> Structure of Operon</p></div> | | <p align="center"><strong>Figure 1.</strong> Structure of Operon</p></div> |
| <p align="justify">An operon is made up of several structural genes arranged under a common promoter and | | <p align="justify">An operon is made up of several structural genes arranged under a common promoter and |
Line 172: |
Line 172: |
| site other than the operator, to stimulate transcription. | | site other than the operator, to stimulate transcription. |
| </p> | | </p> |
- | <div align="center"><img style="width:600px;" src="../../method/Figure 2.png"/> | + | <div align="center"><img style="width:600px;" src="https://static.igem.org/mediawiki/igem.org/2/25/USTC_Software_Figure_2.png"/> |
| <p align="justify"><strong>Figure 2.</strong> Regulation of Operon | | <p align="justify"><strong>Figure 2.</strong> Regulation of Operon |
| 1: RNA Polymerase, 2: Repressor, 3: Promoter, 4: Operator, 5: Lactose, 6: lacZ, 7: | | 1: RNA Polymerase, 2: Repressor, 3: Promoter, 4: Operator, 5: Lactose, 6: lacZ, 7: |
Line 188: |
Line 188: |
| <div class="jobs_trigger"><strong>Regulatory Model</strong></div> | | <div class="jobs_trigger"><strong>Regulatory Model</strong></div> |
| <div class="jobs_item" style="display: none;"><p align="justify">Regulation of gene expression includes four levels. We choose the transcriptional level to simulate the regulation both for its significance and model simplification.</p> | | <div class="jobs_item" style="display: none;"><p align="justify">Regulation of gene expression includes four levels. We choose the transcriptional level to simulate the regulation both for its significance and model simplification.</p> |
- | <div align="center"><img style="width:600px; height:auto;"src="../../method/Figure 3.png" /> | + | <div align="center"><img style="width:600px; height:auto;"src="https://static.igem.org/mediawiki/igem.org/8/87/USTC_Software_Figure_3.png" /> |
| <p><strong>Figure 3.</strong>Regulation of gene expression.<br />Our regulation model is built based on the operon theory.<br /> The promoter region is regarded as the main regulatory region.</p></div> | | <p><strong>Figure 3.</strong>Regulation of gene expression.<br />Our regulation model is built based on the operon theory.<br /> The promoter region is regarded as the main regulatory region.</p></div> |
| </div> | | </div> |
Line 241: |
Line 241: |
| that some similarities are out of statistic significance.</p> | | that some similarities are out of statistic significance.</p> |
| <div align="center"> | | <div align="center"> |
- | <img src="../../method/Figure 4.png" /> | + | <img src="https://static.igem.org/mediawiki/igem.org/8/89/USTC_Software_Figure_4.png" /> |
| <p><strong>Figure 4.</strong> Random similarity distribution</p></div> | | <p><strong>Figure 4.</strong> Random similarity distribution</p></div> |
| | | |
Line 269: |
Line 269: |
| <div class="jobs_item" style="display: block;"><p align="justify">If there is a three-unit network and they interact with each other as it is shown in the figure. | | <div class="jobs_item" style="display: block;"><p align="justify">If there is a three-unit network and they interact with each other as it is shown in the figure. |
| The regulation is described by the GRN matrix.</p> | | The regulation is described by the GRN matrix.</p> |
- | <div align="center"><img src="../../method/3.png" /> | + | <div align="center"><img src="https://static.igem.org/mediawiki/igem.org/8/8a/USTC_Software_Figure_5.png" /> |
| <p style="font-size:18px;"><strong>Figure 5.</strong> Example network and its GRN matrix.</p></div> | | <p style="font-size:18px;"><strong>Figure 5.</strong> Example network and its GRN matrix.</p></div> |
| | | |
Line 280: |
Line 280: |
| <p> | | <p> |
| The construction is equivalent to add a new column and a row into the original matrix.</p> | | The construction is equivalent to add a new column and a row into the original matrix.</p> |
- | <div align="center"><img src="../../method/4.png" /> | + | <div align="center"><img src="https://static.igem.org/mediawiki/igem.org/9/97/USTC_Software_Figure_6.png" /> |
| <p><strong>Figure 6.</strong> Mathematical Equivalence</p></div> | | <p><strong>Figure 6.</strong> Mathematical Equivalence</p></div> |
| <p>When filling the column, D is compared with the regulators of the unit in each row. The | | <p>When filling the column, D is compared with the regulators of the unit in each row. The |
Line 299: |
Line 299: |
| similarity is used because it is the main region.</p> | | similarity is used because it is the main region.</p> |
| <div align="center"> | | <div align="center"> |
- | <img src="../../method/5.png" /> | + | <img src="https://static.igem.org/mediawiki/igem.org/c/c5/USTC_Software_Figure_7.png" /> |
| <p><strong>Figure 7.</strong> Construct New GRN</p></div> | | <p><strong>Figure 7.</strong> Construct New GRN</p></div> |
| | | |
Methodologies
In order to simulate the GRN’s working and analyze the changing after exogenous gene imported, some advanced algorithms and classical methods are employed in the software. These algorithms and methods include Binary Tree method, Needle-Wunsch Algorithm, Decision Tree method, Hill Equation and PSO Algorithm.
There are five parts of methodologies: Fetch Database, Alignment Analyze, New Network Construction, Network Model and Predict.
Fetch Database
Fetch Database Abstract
To simulate and analyze a genetic regulatory network (GRN), we need to build an objects' array to store the complete information of each gene. It contains regulation relationships between genes, sequences of genes, sequences of promoters and so on. However, it's hard to find an appropriate database online containing all information we need in a simple file. RegulonDB has downloadable files about the regulation between transcription factors (TF) and genes. Files about genetic information, transcription unit information and promoter information can also be downloaded from the RegulonDB. All those files have been put into file “source data” in the root directory of our software. They contain all information the simulation needs and we use fetching module to achieve data extraction and integration. There are four steps: fetch regulation relationships, fetch gene information, fetch promoter information and integrate information above.
Fetch Regulation
In GRN, there are two kinds of files: TF to TF and TF to Gene. Since the database about the regulation between TFs and Genes contains only one-way interaction, the matrix of GRN is a rectangle.
First of all, read the regulation relationship of TFs. Our software filters the documentation of RegulonDB on the head of all files and then reads the name of regulate and regulated TF, which is also the name of its genes, one by one. In the same time, our software numerates the genes and stores their names into an objects’ array of genetic data.
The format of regulation database:
TF_name TF_name +/-/+-
The regulation of TFs has been put into a square matrix whose row is the regulator and column is the one regulated by. To make our GRN as complete as possible, the regulation between TF and genes has joined into the matrix. The one-way interaction results that we must read the TF in order to fulfill the regulator before completing the TF to gene’s regulation in the same way of TF to TF.
The format of regulation database:
TF_name Gene_name +/-/+-
At last, a regulatory matrix whose row represents regulate gene (TF) and whose column represents gene regulated by (TF+Gene) has been output into a file called “old_GRN” in root directory. The values in GRN matrix are regulations in which “1” means positive activation, “-1” means repression and “0” means no relationship. There have been some regulations both positive and negative identified regulations are determined by the experimental environment. As a result, our software picks out those uncertain genes and stores them into a file named “uncertain_database”.
The format of uncertain database:
? Gene_name->Gene_name
The question mark represents the unknown regulation between regulator and regulated-by whose names presented afterward. Users could replace the question mark with the data known in past experiment. (“+” rep positive, “-” rep negative). Our software will replace the values in matrix automatically. But if not rewrote, our software will regard those regulation as unknown.
Fetch Gene Info
All gene information has been deposited into a file named gene_info which could be downloaded here. In order of picking out the genes in GRN as fast as possible, all genetic information are stored in a “map”. “Map” is just like a dictionary yet its words are names of genes and its descriptions of words are replaced by genetic information. By using binary tree method, it is very fast to search the “word” wanted in the “dictionary”. As tested, the speed of binary tree method built-in “map” function is 720 times faster than traversal method.
The format of Gene Info database:
ID_assigned_by_RegulonDB Gene_name Left_end_position Right_end_position DNA_strand Product_type Product_name Start_codon_sequence Stop_codon_sequence Gene_sequence
The label of the map vector is gene name which will be picked out based on the names read in regulation matrix before. It is really fast using the binary tree method to find the specific genetic information and store them into a specific object. Those information includes gene ID, left position, right position, gene description and gene sequence. The gene ID is used to link to RegulonDB’s gene details; The left position is used to find its specific transcription unit; The right position is used to figure out the base amount; The description of genes is used to distinguish the RNA and protein; The sequence is used to predict the regulation by alignment.
Fetch Promoter Info
All promoter information has been deposited into a file named promoter_info which could be downloaded here. But we also need transcription unit information because the information files about promoter do not contain all genes’ names backward. “TU Info” file, which can be downloaded here, contains the starting position of each TU and its promoter name. Our software picks out the starting position into a integer array. Using the left position picked out in gene info, our software would find out which unit the gene belongs to through dichotomy method and then stores the name of promoter into corresponding object.
The format of TU info database:
Operon_name Unit_name promoter_name Transcription_start_site ......
The principle of fetching information of promoters is same as fetching genes’s. Our software stores the promoter information from the file named “promoter_info” in a “map” which could be used to pick out the promoter sequence by searching promoter name through binary tree method.
The format of Promoter Info database:
Promoter_ID_assigned_by_RegulonDB Promoter_name
The sequence of promoter will be used in the alignment method in next module which could make a prediction of exogenous genes’ regulation pattern.
Integration
Our software integrates all information we picked out about genes and generates a file named “all_info” —— all information about genes —— for the output graphical interface’s reading. In the meanwhile, the array of objects containing all information has been stored in computer memory which greatly improve the computing speed of our software.
The format of all_info database:
No. promoter_sequence gene_sequence gene_name ID left_position right_position promoter_name description
The fetching module generates three files: old_GRN, all_info and uncertain_database.
Operon Theory and Regulatory Model
Operon Theory
In genetics, an operon is a functioning unit of genomic DNA containing a cluster of genes
under the control of a single regulatory signal or promoter.
The genes contained in the
operon are either expressed together or not at all.
Several genes must be both cotranscribed
and co-regulated to define an operon.
The first time “operon” was proposed is in a paper of French Academic Science, 1960.
The lac operon of the model bacterium E. coli was discovered and provides a typical
example of operon function. It consists a promoter, an operator, three structural genes and
a terminator. The operon is regulated by several factors including the availability of glucose
and lactose.
From this paper, the so-called general theory of the operon was developed. According to
the theory, all genes are controlled by means of operons through a single feedback
regulatory mechanism-repression. The first operon to be described was the lac operon in
E. coli. The 1965 Nobel Prize in Physiology and Medicine was awarded to François Jacob,
André Michel Lwoff and Jacques Lucien Monod for their discoveries concerning the operon and virus synthesis.
Figure 1. Structure of Operon
An operon is made up of several structural genes arranged under a common promoter and
regulated by a common operator. It is defined as a set of adjacent structural genes, plus
the adjacent regulatory signals that affect transcription of the structural genes. The
regulators of a given operon, including repressors, corepressors and activators, are not
necessarily coded for by that operon.
As a unit of transcription, upstream of the structural genes lies a promoter sequence which
provides a site for RNA polymerase to bind and initiate transcription. Close to the promoter
lies a section of DNA called an operator.
Operon regulation can be either negative or positive by induction or repression. Negative
control involves the binding of a repressor to the operator to prevent transcription.
Operons can also be positively controlled. An activator protein binds to DNA, usually at a
site other than the operator, to stimulate transcription.
Figure 2. Regulation of Operon
1: RNA Polymerase, 2: Repressor, 3: Promoter, 4: Operator, 5: Lactose, 6: lacZ, 7:
lacY, 8: lacA. Top: The gene is essentially turned off. There is no lactose to inhibit the
repressor, so the repressor binds to the operator, which obstructs the RNA polymerase
from binding to the promoter and making lactase.Bottom: The gene is turned on.Lactose
is inhibiting the repressor, allowing the RNA polymerase to bind with the promoter, and
express the genes, which synthesize lactase. Eventually, the lactase will digest all of the
lactose, until there is none to bind to the repressor. The repressor will then bind to the
operator, stopping the manufacture of lactase.
Regulatory Model
Regulation of gene expression includes four levels. We choose the transcriptional level to simulate the regulation both for its significance and model simplification.
Figure 3.Regulation of gene expression.
Our regulation model is built based on the operon theory.
The promoter region is regarded as the main regulatory region.
Similarity and homology
The sequence similarity is obtained by sequence alignment. It is defined as the proportion of the common subsequence in the aligned sequence. Any two sequences share a certain
similarity. It should be noted that similarity and homology are two different concepts.
As with anatomical structures, homology between protein or DNA sequences is defined in
terms of shared ancestry. Two segments of DNA can have shared ancestry because of
either a speciation event or a duplication event. The terms “percent homology” and
“sequence similarity” are often used interchangeably. As with anatomical structures, high
sequence similarity might occur because of convergent evolution, or, as with shorter
sequences, because of chance. Such sequences are similar but not homologous.
Sequence regions that homologous are also called conserved.
In our project, we use similarity to connect the exogenous gene with the original network.
Because there is a good chance that the exogenous gene is not homologous with the
genes in the network.
The GRN matrix is the mathematical description of gene regulatory network in which “1” represents “enhance”, “-1” represents “repress” and “0” represents “no regulatory relationship”. The units(RU) in x-axis regulate the units in y-axis. A row can be seen as a vector containing all the information of the target(corresponding unit in the y-axis). Similarly, a column can be seen as a vector containing all the information of the regulator(corresponding unit in the x-axis).
The sequence similarity is obtained by sequence alignment based on Needleman-Wunsch algorithm[FIXME: wiki link here]. The Needleman-Wunsch algorithm performs a global alignment on two protein sequences or nucleotide sequences. It was the first application of dynamic programming to biological sequence comparison.
When dynamic programming is applicable, the method takes far less time than naive methods. Using a naive method, many of the subproblems are generated and sovled many times. The dynamic programming approach seeks to solve each subproblem only once. Once the solution to a given subproblem has been computed, it is stored to be looked up next time.
[Pic. 5 Dynamic programming and naive method]
Like the Needleman-Wunsch algorithm, of which it is a variation, Smith-Waterman is also a dynamic programming algorithm. But it is a local sequence alignment algorithm. The famous BLAST(Basic Local Alignment Search Tool) is improved from Smith-Waterman algorithm. Although local algorithm has the desirable property that it is guaranteed to find the optimal local alignment, we decided to choose the global one because we regarded the segment sequence as a unit.
Sequences are aligned with different detailed methods in different situations. In the regulated side, what we care about is the DNA sequence. In the regulating side, it is the amino acid sequence. When it comes to predict the regulated behavior, we use a DNA substitution matrix to align promoter and protein coding sequences. In the prediction of regulating behavior, the substitution matrix BLOSUM_50 is used to align the amino acid sequences translated from protein coding sequences.
The promoter similarities of the query unit and subject units are stored in a vector. The protein coding similarities are stored in another vector. These vectors are prepared to be used in the new network construction.
New Network Construction
Random Noise
Normally, the similarity of two sequences will not be zero. Some computational
experiments were carried out to study the random sequence similarities. We randomly
chose a gene in the network and generated 1000 random sequences. The alignment result
indicates that the random sequence similarities are Gauss distributed. The result suggests
that some similarities are out of statistic significance.
Figure 4. Random similarity distribution
Filter
We need the genes highly similar to the exogenous one to interact with it. The program will
align the exogenous gene(query) with genes in the network(subject) and get the original
similarities. In order to filter meaningless low values, a certain amount of random
sequences are generated for each query-subject alignment. Normally, 100 is sufficient.
Because the sequence length will influence alignment result, random sequences are fixed
at the same length as the query one. Then align random sequences with the subject
sequence. The statistic result of these random similarities is used as a threshold.
Threshold = μ + xσ
In the formula, μ is the average random similarity. σ is the standard deviation. x is used to
control the filter determined by machine learning. If the original similarity is lower than the
threshold, it is abandoned. It is usually means the original value is usually short of
statistical significance.
An example about filtring and consistency is presented in “Example”.
Construct new GRN
If there is a three-unit network and they interact with each other as it is shown in the figure.
The regulation is described by the GRN matrix.
Figure 5. Example network and its GRN matrix.
If D is the exogenous unit, we can obtain three similarity data sets of D with the units in the
original GRN:
Promoter sequence similarity
Gene sequence similarity
Amino acid sequence similarity.
The construction is equivalent to add a new column and a row into the original matrix.
Figure 6. Mathematical Equivalence
When filling the column, D is compared with the regulators of the unit in each row. The
regulations in the row are consider separately and marked as “positive group” and
“negative group”. The average similarity of each group represents the distance between
the exogenous unit and the group. D is supposed to have the larger one’s regulatory
direction(positive or negative). The regulatory intensity is the weight average regulation of
the chose group. The weight here is the amino acid sequence similarity.
There are two conditions when fill the new row:
1. There are units having the same promoter as the exogenous unit.
2. There is no units having the same promoter as the exogenous unit.
In condition 1, the units sharing the same promoter with the new member are picked out,
and the following steps are the same as the construction of the column. The difference is
the similarity used here is the gene sequence similarity. As explained in the regulation
model part, the promoter is the main regulatory region, but the following sequence is also
considered. Now the promoter is the same, so what we focus on are the gene sequences.
In condition 2, the process is almost the same as constructing the new column. Promoter
similarity is used because it is the main region.
Figure 7. Construct New GRN
Network Model
Network Model Abstract
Network analysis includes finding stable condition of network, adding new gene, finding new stable condition and changes from original condition to new condition. We use densities of materials to describe network condition. If all material densities are time-invariant, we can say the network condition is stable.
Hill Equations
Regulation relationship in genetic network includes positive regulation, negative regulation, positive-or-negative regulation and no regulation. We store regulation relationship in matrix R. Rji means the unit in line j and row i. For the material of original network, Rji=1 means material i enhance material j, Rji=-1 means material i repress material j, Rji=0 means material i has no influence on material j, Rji=2 means material i enhance or repress material j. For the new material, Rji ranges from -1 to 1. Rji<0 means the possibility of positive regulation is Rji; Rji>0 means the possibility of negative regulation is –Rji; Rji=0 means there is no regulation from i to j.
We use Hill equations to describe intensity of regulation. Equations are like following:
The left side of the equation is the derivative x(density) on t(time).”qi”,”pi”,”ri”,”mi”,”ni” are parameters, which determine the intensity of regulation."ri" is degradation rate. Mji is exponent. M is a matrix whose dimensions are equivalent to R's. Mji is 0 or ranges from 0.5 to 1.2 or ranges from -1.2 to 0.5. For the material of original network, if Rji=1,Mji ranges from 0.5 to 1.2;if Rji=-1, Mji ranges from -1.2 to -0.5; if Rji=2;Mji ranges from -1.2 to 0.5 or 0.5 to 1. These Mjis’ absolute values are given randomly by program. If Rji=0, Mji=0.
For the new material,
Find Stable Network Condition
Stable condition is the condition in which densities are time-invariant. We store material densities in a vector and solve the differential equations with Euler’s formula, which is like below
We know the network will be stable at last, so every material density has a limitation.
Find Changes From Original Stable Condition to New Condition
Record the original stable condition, set new material density to 0 and this is the new initial density vector. Solve new equations and record density vectors before the new condition is stable and store these data in a text file.
To evaluate the new network, we introduce the grading system.
"xi" and "Xi" are densities of material i, which is not the new material."ny" is the number of materials. The more new densities are close to the original, the less the influence the cell endues. In general, cells close to the original cell are more likely to survive than those who are far different from the original cell. That is the thought of the grading system.
We did a lot of running and found that the “AbsValue” ranges from 0 to 370, so "ScoreA" ranges from 0 to 4.9.We get the integer part and store it in an array, which has five sections. Generate 100 or 200 matrix M from matrix R and run the original and new network for each M, so we can get 100 or 200 of "ScoreA"s. The section which has maximum "ScoreA"s is the eventual score.
Predict
Predict Abstract
In some cases, importing exogenous gene is for enhancing or suppressing the expression of some specific genes in engineered bacteria itself. But it is hard to choose an appropriate regulatory gene. Our software analyzes the GRN forward as well as simulates by optimization algorithm backward for giving a reference of choosing to the users. Our software not only focused on the direct regulation but also focused on the global GRN. In the same time, controlling the expression of multiple genes in network has been realized by global prediction. What’s more, Particle Swarm Optimization (PSO) Algorithm makes it possible.
Input Target
Before prediction, the expression of specific genes which the experimenter needs should be input into our software as well as the improvement or depression. The number of target gene is SEVEN at most.
It is a must that figuring out the strongest and weakest expression strength before inputting the extreme cases into the target expression. The way to find out the strongest and weakest expression is modeling the GRN’s steady state by a large amount of random regulation from -1 and 1. On the other hand, the expression of genes unpicked by the users should be stable as much as possible. The initial strength of expression is calculated by modeling the original GRN with Hill’s equation.
Particle Swarm Optimization
For getting the best regulation, our software uses PSO algorithm based on 30 particles to simulate the GRN’s changing. First of all, the interactions of regulator and regulated-by have been put into those particles in random so that each particle will have the whole set of regulation. Secondly, the variance between target expressions and stable expression of new GRN have been regarded as the optimize requirements in PSO algorithm. As a result, the minimal variance of 30 particles is the global optimum and the minimal variance of the procession in one particle is the local optimum. Then, taking a step towards global and local optimum as well as considering the inertia and perturbation avoids falling into the sub-optimal condition.
At last, when the variance of expression reaches an acceptable range, our software picks out and saves the best global optimum particle following by the movement of those particles stop.
We constantly revises the factors in PSO algorithm by machine learning method for accurate simulation with a fast PSO particle-motion equation. At the same time, our software also filter the result of regulatory value which is more intuitive.
Filter
To improve the efficiency of choosing a suitable gene after getting a series of regulatory value, our software picks out some obvious regulation. The value of regulation is between -1 to 1 in which -1 means negative effect and 1 means positive effect. As a result, what our software has done is filtering out the absolute value which is lower than 0.9. Because the difference of regulatory intensity lower than 0.1 has very little effect to the stable expression, the final result of regulation is indicated by Boolean value.
The format of regulatory prediction in “Result”:
Gene_name->Gene_name regulation(+/-)
Database
TF-TF
This file contains the regulation between Transcription Factors.
TF-Gene
This file contains the regulation between Transcription Factors and Genes
Gene Info
This file contains the information about all genes in E-coli K-12
Promoter Info
This file contains the information about all promoters in E-coli K-12
TU Info
This file contains the information about all Transcription Units in E-coli K-12