Abstract
Pro-decoder is part of our software designed to analyze and evaluate promoters. The software use PWM method to calculate the similarity between promoter sequences and the position frequency matrix of transcription factor binding sites (TFBS) to locate the TFBS as well as predict the relative strength of the promoter. So the user may have a better understanding of the promoter’s regulation mechanism and the key sites which influences the promoter’s performance mostly.
Background
Sigma factors
Bacteria encode several thousands of different proteins, which are necessary for normal cell functions or for adaptation to environmental changes. These proteins are not required at the same time or in the same amount. Regulation of gene expression therefore enables the cell to control the production of proteins needed for its life cycle or for adaptation to extracellular changes. This regulation in turn makes it possible for the bacterium to adequately adapt to rapid changes in the environment. The various steps during transcription and translation are therefore subject to different regulatory mechanisms. The most prominent step in gene regulation is the initiation of transcription in which the DNA-dependent RNA polymerase (RNAP) is the key enzyme. The RNAP or the RNAP core enzyme is the catalytic machinery for the synthesis of RNA from a DNA template. However, RNAP cannot initiate transcription by itself. Initiation of transcription requires an additional polypeptide known as a sigma-factor. Sigma-factors are a family of relatively small proteins that can associate in a reversible way with the RNAP core enzyme. Together, the sigma-factor and the RNAP core enzyme form an initiation-specific enzyme, the RNAP holoenzyme
The sigma-factor directs RNA polymerase to a specific class of promoter sequences. Most bacterial species synthesize several different sigma-factors that recognize different consensus sequences. This variety in sigma-factors provides bacteria with the opportunity to maintain basal gene expression as well as for regulation of gene expression in response to altered environmental or developmental signals.
The frequency at which the RNAP holoenzyme initiates transcription, also known as the strength of a promoter, is influenced by the promoter sequence and the conformation of the DNA in the promoter region. The sigma-factors recognize two conserved sequences in the promoter region, known as the promoter consensus sequence. Sigma-factors or fragments of sigma-factors bind specifically to promoter DNA sequence and by specific base pair and amino acid substitutions in the promoter consensus sequences or sigma factors. Most bacterial species synthesize several different sigma-factors which direct the RNAP holoenzyme to distinct classes of promoters with a different consensus sequence. This variety in sigma-factors provides the bacterium with the opportunity to maintain basal gene expression as well as for regulation of gene expression in response to specific environmental stimuli.
The identification of bacterial promoters is an essential step in the elucidation of gene regulation. As a general rule, the more complex the life-cycle and environmental niche of a bacterium, the greater the number of sigma factors with corresponding promoter types. Typically however, the most common promoter type is that which regulates the housekeeping genes and the corresponding major sigma-factor is shared by all bacteria (sigma 70 in the well studied Escherichia coli, and its homologues in other species). The binding site for the sigma70-family of promoters is defined by two consensus hexamers, TTGACA and TATAAT, located at approximately −35 and −10, respectively relative to the transcript start site (TSS) and spaced 15–21 base pairs (bp) apart2. RNA polymerase core enzyme associates with the major sigma-factor to form the holoenzyme which in turn binds to its cognate promoters to initiate transcription.
In prokaryotes, the minimum requirement for RNA polymerase binding is recognition of the promoter by the sigma factor. In general, prokaryotic RNA polymerases can interchange a number of sigma factors which bind and initiate different groups of genes3.
Transcription Factors
Sigma factors are essential for the transcription initiation in Escherichia coli. In addition, promoter strengths are not determined purely by the binding of the sigma factor. other transcription factors can bind specific sequences surrounding or overlapping the promoter to either activate or repress transcription4. The mechanism is Transcriptional activators and repressors contribute to and detract from the accessibility of DNA by the RNA polymerase. These transcription-regulating nuclear proteins bind to specific binding sites in the regulatory regions (e.g. promoters, enhancers) of the genes thus providing their activation or repression.
Computational methods of predicting TF binding sites in DNA are very important for understanding the molecular mechanisms of gene regulation.
The binding sites of the same transcription factor show a significant sequence conservation, which is often summarized as a short (5–20 bases long) common pattern called a transcription factor binding site (TFBS) or binding consensus. Our software aims to figure out the possible TFBS in promoters and precisely locate the TFBS so that the user may know the exact sites that play a role in regulating the transcription.
In prokaryotes (lower organisms without nuclei), there are fewer TFs, their motifs tend to be relatively long and the strength of regulation for a particular gene often depends on how closely a particular site matches the consensus for the motif. The more mismatches to the consensus in a binding site, the less often the TF will bind and therefore the less control it will exert on the target gene. So our software will calculate the similarity between the possible TFBS in the promoter and the standard motifs so the user will know to which extent the transcription factor will control the promoter transcription
Primer design
To facilitate the design of PCR primers of various promoters, we’ve developed an additional function, namely, primer design in this part og our program. After inputting the promoter sequence, the software will figure out the most suitable primers based on the theory of Thomas K¨ampke1, Markus Kieninger, and Michael Mecklenburg.
Data source
RegulonDB
Genes and operons that are under control of the same TF are members of that TF’s regulon. Although methods for the prediction of regulons have been substantially improved, they are still far from perfect.
Comparative genomics tools can be used to predict regulons in bacterial genomes but the procedure can lead to incorrect regulon calling. Despite this drawback, several regulon databases are available that are based on comparative genomics methods and lack experimental evidence.
Probably the extended and accurate databases of regulons for E.coli are RegulonDB which provides the data source for our program.
Algorithm
Experimental results show that these are the strongest promoters that have been characterized in vitro so far and confirm the hypothesis that the consensus promoter sequence is “best.” To calculate the similarity between the promoter sequence and the best sequence, we implement the PWM method 5.
PWM (Position Weight Matrix)
Molecular techniques for the identification of promoters are both costly and time consuming, hence in silico methods are an attractive and well explored alternative. The most common in silico method to identify _70 promoters uses position weight matrices (PWMs) and depends on the relative conservation of the transcription factor binding site (TFBS, or motifs ).
The algorithm can be divided into two parts regarding to the difference between the motifs of sigma factors and other transcription factors.
Part 1: the recognition of other transcription factors6
Other transcription factors are proteins that can bind to a specific DNA sequence (motifs) and regulate the promoter’s transcription. To recognize these possible motifs in a given promoter sequence, we calculate the Matrix Similarity Score (MSS) of every possible sites in the promoter sequence using the position frequency matrix of 86 transcription factors published by RegulonDB. The algorithm reports only those matches of a matrix that have got MSS higher than the settled threshold. And MSS for a subsequence x of the length L is calculated in following steps:
fi,B, frequency of nucleotide B to occur at the position i of the matrix (B ∈{A, T, G, C})
fimin, frequency of the nucleotide which is rarest in position i in the matrix
fimax, highest frequency in position i.
The information vector
describes the conservation of the positions i in a matrix (5).Multiplication of the frequencies with the information vector leads to a higher acceptance of mismatches in less conserved regions, whereas mismatches in highly conserved regions are very much discouraged. This leads to a better performance in recognition of TF binding sites if compared with methods that do not use the information vector.
To determine the best threshold of the motif finding algorithm, we test various threshold values and analyze the true negative and false positive rate of each threshold value. The ideal threshold is supposed to have both the least true negative and false positive rates.
The picture above shows part of our test results and to keep both the true negative and false positive rates at a reasonable level, we adopt 3 threshold values, namely, low (0.5977) , median (0.0778) and high (0.85), with a true negative rate at 0.1, 0.5, and 0.7 respectively. And for more flexibility, we also allow the users to set their own thresholds.
Part2: the recognition of sigma factor motif and the evaluation of relative promoter strength.
In the case of sigma 70 factors, the motifs are the −35 and −10 hexamers. enclosing a spacer of length 15–19 bp.
Given a known or predicted TSS location, the corresponding predictions for the −10 and −35 hexamers are located using a combination of two PWMs which are derived from literatures. For any known or putative TSS, the −35 and −10 hexamers are located upstream of the TSS by searching for the highest combination of PWM scores, subject to two constraints:
(i) That the spacer length (the number of base pairs between the −35 hexamer and the −10 hexamer) should lie in the range {14–20};
(ii) the total MSS (our results are the sum of the scores for the −10 and −35 hexamers and therefore lie in the interval [0,2], with a score of 2 corresponding to the joint consensus TTGACA (−35) and TATAAT (−10).
In prokaryotes, the strength of sigma factor regulation for a particular gene often depends on how closely a particular site matches the consensus for the motif. The more mismatches to the consensus in a binding site, the less often the sigma factor will bind and therefore the less strength the promoter will have. Experiments have confirmed the hypothesis that the consensus promoter sequence is “best”. We set the best promoter strength to 100% and calculate the relative strength of a given promoter by the Score (promoter)
Primer design
A primer pair (p, q) is assigned the scoring vector
sc(p, q) = (|p|, |q|,GC(p),GC(q), Tm(p), Tm(q), sa(p),sa(q), sea(p), sea(q), pa(p, q), pea(p, q))T ∈ R12
All primers are designed to have ideal values of length,GC content, and melting temperature which are specified externally by the designer of the hybridization experiment. These ideal values are to be specified for forward and reverse primers. The ideal score vector or reference vector for the primer pair is
                              
scideal =(lengthf , lengthr,GCf ,GCr , Tm,f ,Tm,r , 0, 0, 0, 0, 0, 0)T.
All ideal annealing values are set to zero and typically
Tmf= Tm,ras well as GCf = GCr . The final assessment of a primer pair (p, q) can be its deviation from the reference in terms of the l1-distance
Here, we employ a weighted distance
with weights given in the following table.
The formulas for calculating the variations above are provided in 7
Results
Sigma factors recognition
To test the ability of our program to recognize the exact type of promoters, i.e., to nail down by which sigma factor a promoter is regulated, we run our program with 100 various promoters sequences whose type have already confirmed experimentally. And we’ve recognized 56 of them correctly. The results are showed below.
TFBS Location
Specifically, we use the sigma70 promoter sequence with annotated -35 and -10 region provided by RegulonDB to test the correct prediction rate of the binding site of a specific transcription. We input 89 sigma70 promoter sequences and run our program to precisely locate the sigma factor binding site. Results show that the correct site prediction rate of our program is 64%.
The test results are as follows. the numbers represent the site of actual-35 motif, the actual spacer length, the predicted site and predicted spacer length respectively.
Promoter strength correlation
Making an attempt to predict the promoter strength through motif similarity and spacer length is one of our software’s shine points. We input XX promoter sequences with experimentally determined strength data. And the result shows that the determination coefficient is X.
Future work
Apply our algorithms to more species. Now Pro-decoder is designed expressly for prediction and evaluation of E.coli promoters, in the near future we’ll study the transcription regulation mechanism of other species and try to apply our algorithms to a extended range of species.
Enhance promoter strength prediction accuracy. Because our experimental data is so limited, the weight of the spacer length and the motif similarity is roughly determined, which lead to a weak correlation between the promoter strength. In the future we hope to obtain more experimental data with regard to the effect of spacer length and motif similarity having on promoter strength so we can revise the weight coefficient of the two factors and get more reliable results.
The next version of this part of our program will be able to analyze not only the promoters of E.coli, but other species such as Bacillus subtilis, we’ll integrate the transcription factor binding site data of more species into our database and use PWM algorithm to predict the TFBS in the promoters.
References
1. Wösten, M., Eubacterial sigma‐factors. FEMS microbiology reviews 1998, 22 (3), 127-150.
2. Shultzaberger, R. K.; Chen, Z.; Lewis, K. A.; Schneider, T. D., Anatomy of Escherichia coli σ70 promoters. Nucleic acids research 2007, 35 (3), 771-788.
3. Paget, M.; Helmann, J. D., The sigma70 family of sigma factors. Genome Biol 2003, 4 (1), 203.
4. Jensen, S. T.; Liu, X. S.; Zhou, Q.; Liu, J. S., Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Statistical Science 2004, 19 (1), 188-204.
5. (a) Rhodius, V. A.; Mutalik, V. K., Predicting strength and function for promoters of the Escherichia coli alternative sigma factor, σE. Proceedings of the National Academy of Sciences 2010, 107 (7), 2854-2859; (b) Mulligan, M. E.; Brosius, J.; McClure, W. R., Characterization in vitro of the effect of spacer length on the activity of Escherichia coli RNA polymerase at the TAC promoter. Journal of Biological Chemistry 1985, 260 (6), 3529-3538; (c) Qureshi, S. A.; Jackson, S. P., Sequence-Specific DNA Binding by the<i> S. shibatae TFIIB Homolog, TFB, and Its Effect on Promoter Strength. Molecular cell 1998, 1 (3), 389-400.
6. Kel, A. E.; Gößling, E.; Reuter, I.; Cheremushkin, E.; Kel-Margoulis, O. V.; Wingender, E., MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucleic acids research 2003, 31 (13), 3576-3579.
7. Kämpke, T.; Kieninger, M.; Mecklenburg, M., Efficient primer design algorithms. Bioinformatics 2001, 17 (3), 214-225.