1 User Input
Some genes' regulation could be get from experiment. So, if users could get the unknow regulation between new gene and old ones, they could manually set the interactions which do not need model. Those regulations will be used in later simulation.
2 Simalarity Analysis
2.1 Sequence
2.1.1 Needleman-Wunsch Algorithm
The Needleman-Wunsch algorithm was first published in1970 by Saul B. Needleman and Christian D. Wunsch. It performs a global alignment of two sequences and is mostly used in bioinformatics to align protein or nucleotide sequence. Our software applied this algorithm in the alignment of DNA and amino acid sequences.
The Needleman-Wunsch algorithm is one kind of dynamic programming and It was the first attempt in biological sequence comparison of dynamic programming.
Here is an example of Needleman-Wunsch algorithm. S(a,b) is the similarity of character a and character b. The scores of characters are shown in the similarity matrix. We assume this matrix was
And we uses linear gap penalty, denoted by d, here, we set the gap penalty as -5.Then the alignment:
A: AGACTAGTTAC
B: CGA - - - GACGT
would have the following score:
S(A,C)+S(G,C)+S(A,A)+(3)+S(G,G)+S(T,A)+S(T,C)+S(A,G)+S(C,T) = -3+7+10-(3x5)+7+(-4)+0+(-1)+0 = 1
To find the highest score of alignment, in this algorithm, a two dimensional matrix F with sequences and scores was allocated. The score in row i, column j is denoted by Fij. There is one column for each character in sequence A and one row for each character in sequence B. Therefore, if we align sequences with sizes of n and m, the amount of memory taken up here is O(n,m).
As the algorithm going on, Fij was calculated to be the optimal score by the principle as following:
Basis:
Fi0 = d*i
F0j = d*j
Recursion:
Fij = max(F(i-1,j-1) + S(Ai,Bj), F(i-1,j) + d, F(i,j-1) + d)
The pseudo-code of this algorithm would look like this:
for i = 0 to length(A)
F(i,0) <-- d*i
for j = 0 to length(B)
F(0,j) <-- d*j
for i = 0 to length(A)
for j = 0 to length(B)
{
Match <-- F(i-1,j-1) + S(Ai,Bj)
Delete <-- F(i-1,j) + d
Insert <-- F(i,j-1) + d
F(i,j) <-- max(Match, Insert, Delete)
}
After the matrix F was computed, Fnm would be the maximum score among all possible alignment.
If you want to see the optimal alignment, you can trace back from Fnm by comparing three possible sources mentioned in the above code (Match, Insert and Delete). If Match, then Aj and Bi are aligned, if Insert, Bi was aligned with a gap and if Delete, then Aj and a gap are aligned. Also, you may find there are not only one optimal alignment.
As for the example, we would get the following matrix by applying Needleman Wunsch algorithm:
And the optimal alignment would be:
- - AGACTAGTTAC
CGAGAC - - GT - - -
2.1.2 A Supplementary Game
The rows and columns in the GRN matrix can be regarded as vectors containing the regulated or the regulating information. The behavior similarity of two units can be described by the dot product of two regulated vectors or two regulating vectors. Biologists usually think the more similar two sequences are, the more likely they have similar behaviors. Whether the ratio of genes with similar behaviors is positively correlated with gene similarity is essential to our project. So we obtained 1.6 million sets of data by pairwise alignment of all the 1748 units in the GRN of K-12. Each set of data consists of gene similarity and behavior similarity. The result is analyzed and plotted in the figure. The linear fit shows that the ratio is positively correlated with the similarity.
Figure 4.Linear fit of ratio-similarity relationship.
Although there are examples that a slight change in DNA sequence will significantly change the property of the gene, for example, sickle-cell disease, the influence is usually determined by the location and scale of the mutation. So the result is still convincing to some degree.
2.2 Filtering
2.2.1 Random Noise
Normally, the similarity of two sequences will not be zero. Some computational
experiments were carried out to study the random sequence similarities. We randomly
chose a gene in the network and generated 1000 random sequences. The alignment result
indicates that the random sequence similarities are Gauss distributed. The result suggests
that some similarities are out of statistic significance.
Figure 5. Random similarity distribution
2.2.2 Filter
We need the genes highly similar to the exogenous one to interact with it. The program will
align the exogenous gene(query) with genes in the network(subject) and get the original
similarities. In order to filter meaningless low values, a certain amount of random
sequences are generated for each query-subject alignment. Normally, 100 is sufficient.
Because the sequence length will influence alignment result, random sequences are fixed
at the same length as the query one. Then align random sequences with the subject
sequence. The statistic result of these random similarities is used as a threshold.
Threshold = μ + xσ
In the formula, μ is the average random similarity. σ is the standard deviation. x is used to
control the filter determined by machine learning. If the original similarity is lower than the
threshold, it is abandoned. It is usually means the original value is usually short of
statistical significance.
An example about filtring and consistency is presented in “Example”.
2.3 Regulation Calculation
If there is a three-unit network and they interact with each other as it is shown in the figure.
The regulation is described by the GRN matrix.
Figure 6. Example network and its GRN matrix.
If D is the exogenous unit, we can obtain three similarity data sets of D with the units in the
original GRN:
Promoter sequence similarity
Gene sequence similarity
Amino acid sequence similarity.
The construction is equivalent to add a new column and a row into the original matrix.
Figure 7. Mathematical Equivalence
When filling the column, D is compared with the regulators of the unit in each row. The
regulations in the row are consider separately and marked as “positive group” and
“negative group”. The average similarity of each group represents the distance between
the exogenous unit and the group. D is supposed to have the larger one's regulatory
direction(positive or negative). The regulatory intensity is the weight average regulation of
the chose group. The weight here is the amino acid sequence similarity.
There are two conditions when fill the new row:
1. There are units having the same promoter as the exogenous unit.
2. There is no units having the same promoter as the exogenous unit.
In condition 1, the units sharing the same promoter with the new member are picked out,
and the following steps are the same as the construction of the column. The difference is
the similarity used here is the gene sequence similarity. As explained in the regulation
model part, the promoter is the main regulatory region, but the following sequence is also
considered. Now the promoter is the same, so what we focus on are the gene sequences.
In condition 2, the process is almost the same as constructing the new column. Promoter
similarity is used because it is the main region.
Figure 8. Construct New GRN
3 Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
For get a better regulation, we use online database DAVID to cluster all the genes in our whole GRN. Avoid of supersoftless, we hope to create an online communication with DAVID. After getting the cluster of our genes, we multiply the genes simalarity with a factor if they are in the same cluster.
Though the source code of this part has already done, we lack the experiment information to set a propriate factor. All source code were pushed up to our github.
Network analysis includes finding stable condition of network, adding new gene, finding new stable condition and changes from original condition to new condition. We use densities of materials to describe network condition. If all material densities are time-invariant, we can say the network condition is stable.
Regulation relationship in genetic network includes positive regulation, negative regulation, positive-or-negative regulation and no regulation. We store regulation relationship in matrix R. Rji means the unit in line j and row i. For the material of original network, Rji=1 means material i enhance material j, Rji=-1 means material i repress material j, Rji=0 means material i has no influence on material j, Rji=2 means material i enhance or repress material j. For the new material, Rji ranges from -1 to 1. Rji<0 means the possibility of positive regulation is Rji; Rji>0 means the possibility of negative regulation is –Rji; Rji=0 means there is no regulation from i to j.
We use Hill equations to describe intensity of regulation. Equations are like following:
The left side of the equation is the derivative x(density) on t(time).”qi”,”pi”,”ri”,”mi”,”ni” are parameters, which determine the intensity of regulation."ri" is degradation rate. Mji is exponent. M is a matrix whose dimensions are equivalent to R's. Mji is 0 or ranges from 0.5 to 1.2 or ranges from -1.2 to 0.5. For the material of original network, if Rji=1,Mji ranges from 0.5 to 1.2;if Rji=-1, Mji ranges from -1.2 to -0.5; if Rji=2;Mji ranges from -1.2 to 0.5 or 0.5 to 1. These Mjis' absolute values are given randomly by program. If Rji=0, Mji=0.
For the new material,
Stable condition is the condition in which densities are time-invariant. We store material densities in a vector and solve the differential equations with Euler's formula, which is like below
We know the network will be stable at last, so every material density has a limitation.