Team:XMU Software/Project
From 2013.igem.org
YuezhenChen (Talk | contribs) |
YuezhenChen (Talk | contribs) |
||
Line 46: | Line 46: | ||
<img src="https://static.igem.org/mediawiki/2013/5/50/%E6%A0%87%E9%A2%98%E6%A0%8F.png" width="1348px" class="sep_photo"> | <img src="https://static.igem.org/mediawiki/2013/5/50/%E6%A0%87%E9%A2%98%E6%A0%8F.png" width="1348px" class="sep_photo"> | ||
<div class="sep_title">Promoter-decoder</div> | <div class="sep_title">Promoter-decoder</div> | ||
- | |||
</div> | </div> | ||
<div id="promoter_m2"> | <div id="promoter_m2"> |
Revision as of 10:29, 23 October 2013
Abstract
In a promoter sequences, the sigma factor binding site and other transcription factor binding site affect the strength of binding significantly. For annotating promoters, some software was developed which mostly focused on the prediction of other transcription factors or one particular type of sigma factors but failed to analyze the promoter with both sigma factors and other transcription factors. 1-2 To solute this problem, a module of our software was designed which can analyze and evaluate promoters.
Our software use PWM method to calculate the similarity between promoter sequencess and the position frequency matrix of transcription factor binding sites (TFBS) to locate the TFBS as well as to predict the relative strength of the promoter. Promoter-Decoder overshadows its counterparts with all-round analysis and the prediction of promoter strength. It enables users to figure out promoter types, predict promoter strength, changeit by mutating the key sites and even change the property of certain promoter by adding new TFBS to the promoter sequences.
Background
Sigma Factors
Bacteria encode several thousands of different proteins, which are necessary for normal cell functions or for adaptation to environmental changes.3 These proteins are not required at the same time or in the same amount. Regulation of gene expression therefore enables the cell to control the production of proteins needed for its life cycle or for adaptation to extracellular changes. The various steps during transcription and translation are therefore subject to different regulatory mechanisms.4
The most prominent step in gene regulation is the initiation of transcription in which the DNA-dependent RNA polymerase (RNAP) is the key enzyme. The RNAP or the RNAP core enzyme is the catalytic machinery for the synthesis of RNA from a DNA template. However, RNAP cannot initiate transcription by itself. Initiation of transcription requires an additional polypeptide known as a sigma-factor.5 Sigma-factors are a family of relatively small proteins that can associate in a reversible way with the RNAP core enzyme. Together, the sigma-factor and the RNAP core enzyme form an initiation-specific enzyme, the RNAP holoenzyme.
The sigma-factor directs RNA polymerase to a specific class of promoter sequencess. Most bacterial species synthesize several different sigma-factors that recognize different consensus sequencess.6
This variety in sigma-factors provides bacteria with the opportunity to maintain basal gene expression as well as for regulation of gene expression in response to altered environmental or developmental signals.
The frequency at which the RNAP holoenzyme initiates transcription, also known as the strength of a promoter, is influenced by the promoter sequences and the conformation of the DNA in the promoter region. The sigma-factors recognize two conserved sequencess in the promoter region, known as the promoter consensus sequences. Sigma-factors or fragments of sigma-factors bind specifically to promoter DNA sequences and by specific base pair and amino acid substitutions in the promoter consensus sequencess or sigma factors. Most bacterial species synthesize several different sigma-factors which direct the RNAP holoenzyme to distinct classes of promoters with a different consensus sequences. This variety in sigma-factors provides the bacterium with the opportunity to maintain basal gene expression as well as for regulation of gene expression in response to specific environmental stimuli.
The identification of bacterial promoters is an essential step in the elucidation of gene regulation.7
As a general rule, the more complex the life-cycle and environmental niche of a bacterium, the greater the number of sigma factors with corresponding promoter types. Typically however, the most common promoter type is that which regulates the housekeeping genes and the corresponding major sigma-factor is shared by all bacteria (sigma 70 in the well studied E. coli, and its homologues in other species). The binding site for the sigma70-family of promoters is defined by two consensus hexamers, TTGACA and TATAAT, located at approximately −35 and −10, respectively relative to the transcript start site (TSS) and spaced 15–21 base pairs (bp) apart2. RNA polymerase core enzyme associates with the major sigma-factor to form the holoenzyme which in turn binds to its cognate promoters to initiate transcription.
In prokaryotes, the minimum requirement for RNA polymerase binding is recognition of the promoter by the sigma factor. In general, prokaryotic RNA polymerases can interchange a number of sigma factors which bind and initiate different groups of genes.3
Transcription Factors
Sigma factors are essential for the transcription initiation in E. coli.10
In addition, promoter strengths are not determined purely by the binding of the sigma factor. Other transcription factors can bind specific sequencess surrounding or overlapping the promoter to either activate or repress transcription.4 The mechanism is transcriptional activators and repressors contribute to and detract from the accessibility of DNA by the RNA polymerase. 12
These transcription-regulating nuclear proteins bind to specific binding sites in the regulatory regions (e.g. promoters, enhancers) of the genes thus providing their activation or repression.
Computational methods of predicting TF binding sites in DNA are very important for understanding the molecular mechanisms of gene regulation.
The binding sites of the same transcription factor show a significant sequences conservation, which is often summarized as a short (5–20 bases long) common pattern called a transcription factor binding site (TFBS) or binding consensus. Our software aims to figure out the possible TFBS in promoters and precisely locate the TFBS so that the user may know the exact sites that play a role in regulating the transcription.
In prokaryotes (lower organisms without nuclei), there are fewer TFs, their motifs tend to be relatively long and the strength of regulation for a particular gene often depends on how closely a particular site matches the consensus for the motif. The more mismatches to the consensus in a binding site, the less often the TF will bind and therefore the less control it will exert on the target gene. So our software will calculate the similarity between the possible TFBS in the promoter and the standard motifs so the user will know to which extent the transcription factor will control the promoter transcription
Primer Design
To facilitate the design of PCR primers of various promoters, we've developed an additional function, namely, primer design in this part of our program. After inputting the promoter sequences, the software will figure out the most suitable primers based on the theory of Thomas Kämpke1, Markus Kieninger, and Michael Mecklenburg.13
Data Source
RegulonDB
Genes and operons that are under control of the same TF are members of that TF's regulon. Although methods for the prediction of regulons have been substantially improved, they are still far from perfect.
Comparative genomics tools can be used to predict regulons in bacterial genomes but the procedure can lead to incorrect regulon calling. Despite this drawback, several regulon databases are available that are based on comparative genomics methods and lack experimental evidence.
Probably the extended and accurate databases of regulons for E. coli are RegulonDB which provides the data source for our program.
Algorithm
Experimental results show that these are the strongest promoters that have been characterized in vitro so far and confirm the hypothesis that the consensus promoter sequences is "best". To calculate the similarity between the promoter sequences and the best sequences, we implement the PWM method6 in our program.
PWM (Position Weight Matrix)
Molecular techniques for the identification of promoters are both costly and time consuming, hence in silico methods are an attractive and well explored alternative. The most common in silico method to identify sigma 70 promoters uses position weight matrices (PWMs) and depends on the relative conservation of the transcription factor binding site (TFBS, or motifs ).
The algorithm can be divided into two parts regarding to the difference between the motifs of sigma factors and other transcription factors.
Part 1: the recognition of other transcription factors.7
Other transcription factors are proteins that can bind to a specific DNA sequences (motifs) and regulate the promoter's transcription. To recognize these possible motifs in a given promoter sequences, we calculate the Matrix Similarity Score (MSS) of every possible sites in the promoter sequences using the position frequency matrix of 86 transcription factors published by RegulonDB. The algorithm reports only those matches of a matrix that have got MSS higher than the settled threshold. And MSS for a subsequences x of the length L is calculated in following steps:
fi,Bi , frequency of nucleotide B to occur at the position i of the matrix (B ∈{A, T, G, C})
f imin , frequency of the nucleotide which is rarest in position i in the matrix
f imax , highest frequency in position i.
The information vector
describes the conservation of the positions i in a matrix. Multiplication of the frequencies with the information vector leads to a higher acceptance of mismatches in less conserved regions, whereas mismatches in highly conserved regions are very much discouraged. This leads to a better performance in recognition of TF binding sites if compared with methods that do not use the information vector.
To determine the best threshold of the motif finding algorithm, we test various threshold values and analyze the true negative and false positive rate of each threshold value. The ideal threshold is supposed to have both the least true negative and false positive rates.
Threshold | 0.5977 | 0.598 | 0.69 | 0.7 | 0.73 | 0.76 | 0.0778 | 0.84 | 0.85 | 0.86 | 0.9 |
True megtive | 0.1 | 0.11 | 0.21 | 0.23 | 0.3 | 0.45 | 0.5 | 0.63 | 0.7 | 0.72 | 0.77 |
False positive | 56.4778 | 57.01124 | 31.56962 | 29.15584 | 23.58571 | 20.92727 | 17.90796 | 9.945946 | 10.66667 | 10.17857 | 6.608696 |
The picture above shows part of our test results and to keep both the true negative and false positive rates at a reasonable level, we adopt 3 threshold values, namely, low (0.5977) , median (0.0778) and high (0.85), with a true negative rate at 0.1, 0.5, and 0.7 respectively. And for more flexibility, we also allow the users to set their own thresholds.
Part 2: the recognition of sigma factor motif and the evaluation of relative promoter strength.
In the case of sigma 70 factors, the motifs are the −35 and −10 hexamers. Enclosing a spacer of length 15–19 bp.
Given a promoter sequence, the -10 and -35 hexamers are located by the total MSS of the two hexamers calculated by the position frequency matrices of the sigma factor binding sites, which are derived from Regulon DB. And the calculating process is subject to two constraints:
1. That the spacer length (the number of base pairs between the −35 hexamer and the −10 hexamer) should lie in the range (15–20);
2. The total MSS (our results are the sum of the scores for the −10 and −35 hexamers and therefore lie in the interval [0,2], with a score of 2 corresponding to the joint consensus TTGACA (−35) and TATAAT (−10).
Score(Promoter)=score(-10 box)+score(-35 box)+score(spacer between -10 & -35 boxes)
And the score of spacer length is calculated by algorithms propozed by Ryan K. Shultzaberger.el. in
In prokaryotes, the strength of sigma factor regulation for a particular gene often depends on how closely a particular site matches the consensus for the motif. The more mismatches to the consensus in a binding site, the less often the sigma factor will bind and therefore the less strength the promoter will have. Experiments have confirmed the hypothesis that the consensus promoter sequences is "best". We set the best promoter strength to 100% and calculate the relative strength of a given promoter by the Score (promoter).
Primer design
A primer pair (p, q) is assigned the scoring vector
sc (p, q) = (|p|, |q|, GC (p), GC (q), Tm (p), Tm (q), sa (p), sa (q), sea (p), sea (q), pa (p, q), pea (p, q))T ∈ R12.
All primers are designed to have ideal values of length, GC content, and melting temperature which are specified externally by the designer of the hybridization experiment. These ideal values are to be specified for forward and reverse primers. The ideal score vector or reference vector for the primer pair is
scideal =(lengthf , lengthr , GCf , GCr , Tm,f , Tm,r , 0, 0, 0, 0, 0, 0)T.
All ideal annealing values are set to zero and typically
Tmf = Tm,r as well as GCf = GCr . The final assessment of a primer pair (p, q) can be its deviation from the reference in terms of the l1-distance
Here, we employ a weighted distance
with weights given in the following table.
The formulas for calculating the variations above are provided in Efficient primer design algorithms.13
Results
Sigma factors recognition
Our program has a correctness rate of 56% in recognition various types of sigma factors. We ran our program with 100 various promoters sequencess whose types have already been confirmed experimentally. And we've recognized 56 of them correctly. Specifically, as for sigma 70 promoters, which are the most prevalent, the recogtion correctness rate has reached 92%. The results are showed below.
TFBS Location
We then test the reliability of our software regarding TFBS location and results show that the correct site prediction rate is 64%. We used the sigma70 promoter sequences with annotated -35 and -10 region provided by RegulonDB to test the correct prediction rate of the binding site of a specific transcription. We input 89 sigma70 promoter sequencess and ran our program to precisely locate the sigma factor binding site.
The test results are as follows. The numbers represent the site of actual-35 motif, the actual spacer length, the predicted site and predicted spacer length respectively.
Promoter strength correlation & experiments
To testify our prediction of promoter strength, our team has done a considerable amount of lab work. First, we located the -10 region of the pBAD promoter (BBa _K206000) and accordingly mutated the -10 region into BBa _K1070002, BBa _K1070003, the sequencess of these promoters are given bellow (-10 regions are highlighted):
Pba dSt rong (BBa _K206000):a c a t t g a t t a t t t g c a c g g c g t c a c a c t t t g c t a t g c c a t a g c a a g a t a g t c c a t a a g a t t a g c g g a t c c t a c c t g a c g c t t t t t a t c g c a a c t c t c t a c t g t t t c t c c a t a c c g t t t t t t t g g g c t a g c
BBa _K1070002:a c a t t g a t t a t t t g c a c g g c g t c a c a c t t t g c t a t g c c a t a g c a a t a t a g t c c a t a a g a t t a g c g g a t c c t a c c t g a c g c t t t t t a t c g c a a c t c t c t a c t g t t t c t c c a t a c c g t t t t t t t g g g c t a g c
BBa _K1070003:a c a t t g a t t a t t t g c a c g g c g t c a c a c t t t g c t a t g c c a t a g c a a g a t a a t c c a t a a g a t t a g c g g a t c c t a c c t g a c g c t t t t t a t c g c a a c t c t c t a c t g t t t c t c c a t a c c g t t t t t t t g g g c t a g c
Subsequently, we measured the fluorescence intensity of these promoters and relate it to the actual promoter strength. The experimental results are shown in Figure 8.
Than we fit the actual strength and the predicted strength, as can be seen in figure 9, the coefficient of determination is 0.8924.
Future work
Apply our algorithms to more species. Now Pro-decoder is designed expressly for prediction and evaluation of E. coli promoters, in the near future we'll study the transcription regulation mechanism of other species and try to apply our algorithms to an extended range of species.
Enhance promoter strength prediction accuracy. Because our experimental data is so limited, the weight of the spacer length and the motif similarity is roughly determined, which lead to a weak correlation between the promoter strength. In the future we hope to obtain more experimental data with regard to the effect of spacer length and motif similarity having on promoter strength so we can revise the weight coefficients of the two factors and get more reliable results.
The next version of this part of our program will be able to analyze not only the promoters of E. coli, but other species such as Bacillus subtilis, we'll integrate the transcription factor binding site data of more species into our database and use PWM algorithm to predict the TFBS in the promoters.
References
Abstract
The efficiency of translation in bacteria is greatly influenced by the binding affinity between the ribosome and RBS, which can be measured by RBS strength. Experimental work to determine a RBS sequences can be awfully laborious while our software program can solve this problem easily. RBS-decoder is a software tool for evaluating RBS strength and locating SD sequences. This program uses the same method—PWM to calculate the similarity between the RBS sequences and the position frequency matrix of SD sequences and transform the similarity to the relative strength of a RBS sequences.
Background
Translational efficiency in Escherichia coli is generally determined at the stage of initiation. There are several principal mRNA sequences elements which can affect the kinetics of ternary initiation complex formation (30 S-mRNA-fMet-tRNA): SD sequences and the start codon (ATG). The SD sequences base-pairs with a RNA molecule that forms part of the bacterial ribosome (the 16s rRNA), while the start codon base-pairs with the initiator tRNA which is bound to the ribosome. In addition the SD sequences and the start codon being important, the spacer between them also influences the RBS strength, these two sequencess need to be positioned approximately 6-7 nucleotides apart so they can both make contact with the appropriate parts of the ribosome complex1.
Introduction
How do baterial Ribosome Binding Sites work?
The bacterial ribosome binds to particular sequencess on an mRNA, primarily the SD sequences and the start codon (ATG). The SD sequences base-pairs with an RNA molecule that forms part of the bacterial ribosome (the 16s rRNA), while the start codon base-pairs with the initiator tRNA which is bound to the ribosome. In addition to SD sequences and the start codon being important, these two sequences need to be positioned approximately 6-7 nucleotides apart so they can both make contact with the appropriate parts of the ribosome complex.1
The Shine-Dalgarno sequences
The end of the 16s rRNA that is free to bind with the mRNA includes the sequences 5′–ACCUCC–3′. The complementary sequences, 5′–GGAGGU–3′, named the Shine-Dalgarno sequences, can be found in whole or in part in many bacterial mRNA. Very roughly speaking, ribosome binding sites with purine-rich sequencess (A's and G's close to the Shine-Dalgarno sequences will lead to high rates of translation initiation whereas sequencess that are very different from the Shine-Dalgarno sequences will lead to low or negligible translation rates. You can see how common the sequences is by looking at the RBS sequences logo on the right (where the height of a letter indicates the frequency of the letter at that location).
Algorithms
As we know, the RBS strength is greatly influenced by the SD sequences, where the 16s RNA of the ribosome binds to, so the strength can be determined by the binding free energy between the SD sequences and the 16s RNA. So we designed a program calculating the binding free energy but the results show that the correlation between the free energy and the strength of RBS is rather weak (R2=0.5517). So we decide to find other algorithms for better accuracy.
Inspired by the strength prediction algorithms used in promoter part, in which the similarity to the sigma factors' PWM is interlocking with the binding affinity between the protein and DNA sequences. We obtained the Position Frequency Matrix of SD sequences of E. coli and use the PMW method (illustrated in detail in the promoter part) to calculate the similarity between the RBS sequences and the Position frequency sequences, what is different from the promoter is that, the spacer length between the SD sequences and the startcodon and the start codon itself both act as constraints in locating the SD sequences, which is confined to 3-16 bp and ATG/TTG/GTG. And similar to the prediction of promoter strength, the spacer length between the SD sequence also contributes to the RBS strength, the optimal spacer length is 7 bp, and the spacer score is calculated using the same algorithm applied in the promoter part.2 The weight of the influence of the spacer on the strength isderived from the algorithm to predict the promoter strength, in which the weight of the total MSS and the spacer is 0.29:0.71, and since in promoter the total MSS is the sum of two motifs while the SD sequences is only one motif, the weight between the MSS(SD sequences) and the spacer is 0.29:0.355.
1 | 2 | 3 | 4 | 5 | |
T | 0.161 | 0.050 | 0.012 | 0.071 | 0.115 |
C | 0.077 | 0.037 | 0.012 | 0.025 | 0.046 |
A | 0.681 | 0.105 | 0.105 | 0.861 | 0.164 |
G | 0.077 | 0.808 | 0.960 | 0.043 | 0.659 |
Results
We use the RBS sequences listed on the iGEM registry with experimentally determined relative strength,4 and the correlation between the RBS strength predicted by our software and the actual relative strength is strong, with a determination coefficient value 0.8039.
Future work
Due to scarcity of experimental data, the relative weight of the SD sequences and the spacer length used currently is roughly determined which may undermine the accuracy of RBS strength prediction. For further improvement of our program, we'll try to obtain more reliable experimental data to accurately determine the weight used in our algorithm and hopefully elevate the accuracy of RBS-decoder.
In the next version of RBS-decoder, the secondary structure of the RBS sequences will be shown on the software and we'll also include the other species' SD sequence data in order to predict the RBS strength of a larger range of species.