Operon Theory
In genetics, an operon is a functioning unit of genomic DNA containing a cluster of genes
under the control of a single regulatory signal or promoter.
The genes contained in the
operon are either expressed together or not at all.
Several genes must be both cotranscribed
and co-regulated to define an operon.
The first time “operon” was proposed is in a paper of French Academic Science, 1960.
The lac operon of the model bacterium E. coli was discovered and provides a typical
example of operon function. It consists a promoter, an operator, three structural genes and
a terminator. The operon is regulated by several factors including the availability of glucose
and lactose.
From this paper, the so-called general theory of the operon was developed. According to
the theory, all genes are controlled by means of operons through a single feedback
regulatory mechanism-repression. The first operon to be described was the lac operon in
E. coli. The 1965 Nobel Prize in Physiology and Medicine was awarded to François Jacob,
André Michel Lwoff and Jacques Lucien Monod for their discoveries concerning the operon and virus synthesis.
Figure 1. Structure of Operon
An operon is made up of several structural genes arranged under a common promoter and
regulated by a common operator. It is defined as a set of adjacent structural genes, plus
the adjacent regulatory signals that affect transcription of the structural genes. The
regulators of a given operon, including repressors, corepressors and activators, are not
necessarily coded for by that operon.
As a unit of transcription, upstream of the structural genes lies a promoter sequence which
provides a site for RNA polymerase to bind and initiate transcription. Close to the promoter
lies a section of DNA called an operator.
Operon regulation can be either negative or positive by induction or repression. Negative
control involves the binding of a repressor to the operator to prevent transcription.
Operons can also be positively controlled. An activator protein binds to DNA, usually at a
site other than the operator, to stimulate transcription.
Figure 2. Regulation of Operon
1: RNA Polymerase, 2: Repressor, 3: Promoter, 4: Operator, 5: Lactose, 6: lacZ, 7:
lacY, 8: lacA. Top: The gene is essentially turned off. There is no lactose to inhibit the
repressor, so the repressor binds to the operator, which obstructs the RNA polymerase
from binding to the promoter and making lactase.Bottom: The gene is turned on.Lactose
is inhibiting the repressor, allowing the RNA polymerase to bind with the promoter, and
express the genes, which synthesize lactase. Eventually, the lactase will digest all of the
lactose, until there is none to bind to the repressor. The repressor will then bind to the
operator, stopping the manufacture of lactase.
Regulatory Model
Regulation of gene expression includes four levels. We choose the transcriptional level to simulate the regulation both for its significance and model simplification.
Figure 3.Regulation of gene expression.
Our regulation model is built based on the operon theory.
The promoter region is regarded as the main regulatory region.
Similarity and Homology
The sequence similarity is obtained by sequence alignment. It is defined as the proportion of the common subsequence in the aligned sequence. Any two sequences share a certain
similarity. It should be noted that similarity and homology are two different concepts.
As with anatomical structures, homology between protein or DNA sequences is defined in
terms of shared ancestry. Two segments of DNA can have shared ancestry because of
either a speciation event or a duplication event. The terms “percent homology” and
“sequence similarity” are often used interchangeably. As with anatomical structures, high
sequence similarity might occur because of convergent evolution, or, as with shorter
sequences, because of chance. Such sequences are similar but not homologous.
Sequence regions that homologous are also called conserved.
In our project, we use similarity to connect the exogenous gene with the original network.
Because there is a good chance that the exogenous gene is not homologous with the
genes in the network.
The GRN matrix is the mathematical description of gene regulatory network in which “1” represents “enhance”, “-1” represents “repress” and “0” represents “no regulatory relationship”. The units(RU) in x-axis regulate the units in y-axis. A row can be seen as a vector containing all the information of the target(corresponding unit in the y-axis). Similarly, a column can be seen as a vector containing all the information of the regulator(corresponding unit in the x-axis).
The sequence similarity is obtained by sequence alignment based on Needleman-Wunsch algorithm[FIXME: wiki link here]. The Needleman-Wunsch algorithm performs a global alignment on two protein sequences or nucleotide sequences. It was the first application of dynamic programming to biological sequence comparison.
When dynamic programming is applicable, the method takes far less time than naive methods. Using a naive method, many of the subproblems are generated and sovled many times. The dynamic programming approach seeks to solve each subproblem only once. Once the solution to a given subproblem has been computed, it is stored to be looked up next time.
[Pic. 5 Dynamic programming and naive method]
Like the Needleman-Wunsch algorithm, of which it is a variation, Smith-Waterman is also a dynamic programming algorithm. But it is a local sequence alignment algorithm. The famous BLAST(Basic Local Alignment Search Tool) is improved from Smith-Waterman algorithm. Although local algorithm has the desirable property that it is guaranteed to find the optimal local alignment, we decided to choose the global one because we regarded the segment sequence as a unit.
Sequences are aligned with different detailed methods in different situations. In the regulated side, what we care about is the DNA sequence. In the regulating side, it is the amino acid sequence. When it comes to predict the regulated behavior, we use a DNA substitution matrix to align promoter and protein coding sequences. In the prediction of regulating behavior, the substitution matrix BLOSUM_50 is used to align the amino acid sequences translated from protein coding sequences.
The promoter similarities of the query unit and subject units are stored in a vector. The protein coding similarities are stored in another vector. These vectors are prepared to be used in the new network construction.
The Needleman-Wunsch algorithm was first published in1970 by Saul B. Needleman and Christian D. Wunsch. It performs a global alignment of two sequences and is mostly used in bioinformatics to align protein or nucleotide sequence. Our software applied this algorithm in the alignment of DNA and amino acid sequences.
The Needleman-Wunsch algorithm is one kind of dynamic programming and It was the first attempt in biological sequence comparison of dynamic programming.
Here is an example of Needleman-Wunsch algorithm. S(a,b) is the similarity of character a and character b. The scores of characters are shown in the similarity matrix. We assume this matrix was
And we uses linear gap penalty, denoted by d, here, we set the gap penalty as -5.Then the alignment:
A: AGACTAGTTAC
B: CGA - - - GACGT
would have the following score:
S(A,C)+S(G,C)+S(A,A)+(3)+S(G,G)+S(T,A)+S(T,C)+S(A,G)+S(C,T) = -3+7+10-(3x5)+7+(-4)+0+(-1)+0 = 1
To find the highest score of alignment, in this algorithm, a two dimensional matrix F with sequences and scores was allocated. The score in row i, column j is denoted by Fij. There is one column for each character in sequence A and one row for each character in sequence B. Therefore, if we align sequences with sizes of n and m, the amount of memory taken up here is O(n,m).
As the algorithm going on, Fij was calculated to be the optimal score by the principle as following:
Basis:
Fi0 = d*i
F0j = d*j
Recursion:
Fij = max(F(i-1,j-1) + S(Ai,Bj), F(i-1,j) + d, F(i,j-1) + d)
The pseudo-code of this algorithm would look like this:
for i = 0 to length(A)
F(i,0) <-- d*i
for j = 0 to length(B)
F(0,j) <-- d*j
for i = 0 to length(A)
for j = 0 to length(B)
{
Match <-- F(i-1,j-1) + S(Ai,Bj)
Delete <-- F(i-1,j) + d
Insert <-- F(i,j-1) + d
F(i,j) <-- max(Match, Insert, Delete)
After the matrix F was computed, Fnm would be the maximum score among all possible alignment.
If you want to see the optimal alignment, you can trace back from Fnm by comparing three possible sources mentioned in the above code (Match, Insert and Delete). If Match, then Aj and Bi are aligned, if Insert, Bi was aligned with a gap and if Delete, then Aj and a gap are aligned. Also, you may find there are not only one optimal alignment.
As for the example, we would get the following matrix by applying Needleman Wunsch algorithm:
And the optimal alignment would be:
- - AGACTAGTTAC
CGAGAC - - GT - - -
The rows and columns in the GRN matrix can be regarded as vectors containing the regulated or the regulating information. The behavior similarity of two units can be described by the dot product of two regulated vectors or two regulating vectors. Biologists usually think the more similar two sequences are, the more likely they have similar behaviors. Whether the ratio of genes with similar behaviors is positively correlated with gene similarity is essential to our project. So we obtained 1.6 million sets of data by pairwise alignment of all the 1748 units in the GRN of K-12. Each set of data consists of gene similarity and behavior similarity. The result is analyzed and plotted in the figure. The linear fit shows that the ratio is positively correlated with the similarity.
Figure 9.Linear fit of ratio-similarity relationship.
Although there are examples that a slight change in DNA sequence will significantly change the property of the gene, for example, sickle-cell disease, the influence is usually determined by the location and scale of the mutation. So the result is still convincing to some degree.