Team:XMU Software/Project/promoter

From 2013.igem.org

PROJECT
Our project includes 2 independent software tools-the brick worker and E' NOTE. The former is a software suit for the evaluation and optimization of biobricks, i.e., promoter, RBS, protein coding sequences and terminator. E' NOTE is a web application serving as an assistant for experiments. Its useful functions such as experiments recording and experimental template customization make experimental process easier and more enjoyable.
Promoter-decoder


Abstract

In a promoter sequences, the sigma factor binding site and other transcription factor binding site affect the strength of binding significantly. For annotating promoters, some software was developed which mostly focused on the prediction of other transcription factors or one particular type of sigma factors but failed to analyze the promoter with both sigma factors and other transcription factors. 1-2 To solute this problem, a module of our software was designed which can analyze and evaluate promoters.

Our software use PWM method to calculate the similarity between promoter sequences and the position frequency matrix of transcription factor binding sites (TFBS) to locate the TFBS as well as to predict the relative strength of the promoter. Promoter-Decoder overshadows its counterparts with all-round analysis and the prediction of promoter strength. It enables users to figure out promoter types, predict promoter strength, changeit by mutating the key sites and even change the property of certain promoter by adding new TFBS to the promoter sequences.


Background

Sigma Factors

Bacteria encode several thousands of different proteins, which are necessary for normal cell functions or for adaptation to environmental changes.3 These proteins are not required at the same time or in the same amount. Regulation of gene expression therefore enables the cell to control the production of proteins needed for its life cycle or for adaptation to extracellular changes. The various steps during transcription and translation are therefore subject to different regulatory mechanisms.4

The most prominent step in gene regulation is the initiation of transcription in which the DNA-dependent RNA polymerase (RNAP) is the key enzyme. The RNAP or the RNAP core enzyme is the catalytic machinery for the synthesis of RNA from a DNA template. However, RNAP cannot initiate transcription by itself. Initiation of transcription requires an additional polypeptide known as a sigma factor.5 sigma factors are a family of relatively small proteins that can associate in a reversible way with the RNAP core enzyme. Together, the sigma factor and the RNAP core enzyme form an initiation-specific enzyme, the RNAP holoenzyme.

Figure 1 The initiation of transcription

The sigma factor directs RNA polymerase to a specific class of promoter sequences. Most bacterial species synthesize several different sigma factors that recognize different consensus sequences.6

This variety in sigma factors provides bacteria with the opportunity to maintain basal gene expression as well as for regulation of gene expression in response to altered environmental or developmental signals.

The frequency at which the RNAP holoenzyme initiates transcription, also known as the strength of a promoter, is influenced by the promoter sequences and the conformation of the DNA in the promoter region. The sigma factors recognize two conserved sequences in the promoter region, known as the promoter consensus sequences. sigma factors or fragments of sigma factors bind specifically to promoter DNA sequences and by specific base pair and amino acid substitutions in the promoter consensus sequences or sigma factors. Most bacterial species synthesize several different sigma factors which direct the RNAP holoenzyme to distinct classes of promoters with a different consensus sequences. This variety in sigma factors provides the bacterium with the opportunity to maintain basal gene expression as well as for regulation of gene expression in response to specific environmental stimuli.

The identification of bacterial promoters is an essential step in the elucidation of gene regulation.7

As a general rule, the more complex the life-cycle and environmental niche of a bacterium, the greater the number of sigma factors with corresponding promoter types. Typically however, the most common promoter type is that which regulates the housekeeping genes and the corresponding major sigma factor is shared by all bacteria (sigma 70 in the well studied E. coli, and its homologues in other species). The binding site for the sigma70-family of promoters is defined by two consensus hexamers, TTGACA and TATAAT, located at approximately −35 and −10, respectively relative to the transcript start site (TSS) and spaced 15–21 base pairs (bp) apart2. RNA polymerase core enzyme associates with the major sigma factor to form the holoenzyme which in turn binds to its cognate promoters to initiate transcription.

Figure 2 The RNA polymerase Figure 3 Consensus sequences of sigma 70 factor

In prokaryotes, the minimum requirement for RNA polymerase binding is recognition of the promoter by the sigma factor. In general, prokaryotic RNA polymerases can interchange a number of sigma factors which bind and initiate different groups of genes.3

Transcription Factors

Figure 4 Transcription factor binding site

Sigma factors are essential for the transcription initiation in E. coli.10

In addition, promoter strengths are not determined purely by the binding of the sigma factor. Other transcription factors can bind specific sequences surrounding or overlapping the promoter to either activate or repress transcription.4 The mechanism is transcriptional activators and repressors contribute to and detract from the accessibility of DNA by the RNA polymerase. 12

These transcription-regulating nuclear proteins bind to specific binding sites in the regulatory regions (e.g. promoters, enhancers) of the genes thus providing their activation or repression.

Figure 5 Transcription factor binding site

Computational methods of predicting TF binding sites in DNA are very important for understanding the molecular mechanisms of gene regulation.

The binding sites of the same transcription factor show a significant sequences conservation, which is often summarized as a short (5–20 bases long) common pattern called a transcription factor binding site (TFBS) or binding consensus. Our software aims to figure out the possible TFBS in promoters and precisely locate the TFBS so that the user may know the exact sites that play a role in regulating the transcription.

In prokaryotes (lower organisms without nuclei), there are fewer TFs, their motifs tend to be relatively long and the strength of regulation for a particular gene often depends on how closely a particular site matches the consensus for the motif. The more mismatches to the consensus in a binding site, the less often the TF will bind and therefore the less control it will exert on the target gene. So our software will calculate the similarity between the possible TFBS in the promoter and the standard motifs so the user will know to which extent the transcription factor will control the promoter transcription

Primer Design

To facilitate the design of PCR primers of various promoters, we've developed an additional function, namely, primer design in this part of our program. After inputting the promoter sequences, the software will figure out the most suitable primers based on the theory of Thomas Kämpke1, Markus Kieninger, and Michael Mecklenburg.13


Data Source

RegulonDB

Genes and operons that are under control of the same TF are members of that TF's regulon. Although methods for the prediction of regulons have been substantially improved, they are still far from perfect.

Comparative genomics tools can be used to predict regulons in bacterial genomes but the procedure can lead to incorrect regulon calling. Despite this drawback, several regulon databases are available that are based on comparative genomics methods and lack experimental evidence.

Probably the extended and accurate databases of regulons for E. coli are RegulonDB which provides the data source for our program.


Algorithm

Experimental results show that these are the strongest promoters that have been characterized in vitro so far and confirm the hypothesis that the consensus promoter sequences is "best". To calculate the similarity between the promoter sequences and the best sequences, we implement the PWM method6 in our program.

PWM (Position Weight Matrix)

Molecular techniques for the identification of promoters are both costly and time consuming, hence in silico methods are an attractive and well explored alternative. The most common in silico method to identify sigma 70 promoters uses position weight matrices (PWMs) and depends on the relative conservation of the transcription factor binding site (TFBS, or motifs ).

The algorithm can be divided into two parts regarding to the difference between the motifs of sigma factors and other transcription factors.

Figure 6 The consensus sequences, the position frequency matrix and the frequency logo

Part 1: the recognition of other transcription factors.7

Other transcription factors are proteins that can bind to a specific DNA sequences (motifs) and regulate the promoter's transcription. To recognize these possible motifs in a given promoter sequences, we calculate the Matrix Similarity Score (MSS) of every possible sites in the promoter sequences using the position frequency matrix of 86 transcription factors published by RegulonDB. The algorithm reports only those matches of a matrix that have got MSS higher than the settled threshold. And MSS for a subsequences x of the length L is calculated in following steps:

fi,Bi , frequency of nucleotide B to occur at the position i of the matrix (B ∈{A, T, G, C})

f imin , frequency of the nucleotide which is rarest in position i in the matrix

f imax , highest frequency in position i.

The information vector

describes the conservation of the positions i in a matrix. Multiplication of the frequencies with the information vector leads to a higher acceptance of mismatches in less conserved regions, whereas mismatches in highly conserved regions are very much discouraged. This leads to a better performance in recognition of TF binding sites if compared with methods that do not use the information vector.

To determine the best threshold of the motif finding algorithm, we test various threshold values and analyze the true negative and false positive rate of each threshold value. The ideal threshold is supposed to have both the least true negative and false positive rates.

Table 1 The threshold setting data
Threshold 0.5977 0.598 0.69 0.7 0.73 0.76 0.0778 0.84 0.85 0.86 0.9
True megtive 0.1 0.11 0.21 0.23 0.3 0.45 0.5 0.63 0.7 0.72 0.77
False positive 56.4778 57.01124 31.56962 29.15584 23.58571 20.92727 17.90796 9.945946 10.66667 10.17857 6.608696

The picture above shows part of our test results and to keep both the true negative and false positive rates at a reasonable level, we adopt 3 threshold values, namely, low (0.5977) , median (0.0778) and high (0.85), with a true negative rate at 0.1, 0.5, and 0.7 respectively. And for more flexibility, we also allow the users to set their own thresholds.

Part 2: the recognition of sigma factor motif and the evaluation of relative promoter strength.

In the case of sigma 70 factors, the motifs are the −35 and −10 hexamers. Enclosing a spacer of length 15–19 bp.

Given a promoter sequence, the -10 and -35 hexamers are located by the total MSS of the two hexamers calculated by the position frequency matrices of the sigma factor binding sites, which are derived from Regulon DB. And the calculating process is subject to two constraints:

1. That the spacer length (the number of base pairs between the −35 hexamer and the −10 hexamer) should lie in the range (15–20);

Figure 7 The consensus sequences of sigma 70 factor binding site

2. The total MSS (our results are the sum of the scores for the −10 and −35 hexamers and therefore lie in the interval [0,2], with a score of 2 corresponding to the joint consensus TTGACA (−35) and TATAAT (−10).

Score(Promoter)=score(-10 box)+score(-35 box)+score(spacer between -10 & -35 boxes)

And the score of spacer length is calculated by algorithms propozed by Ryan K. Shultzaberger.el. in E. coli sigma70 promoters.8 But due to a lack of experimental data of promoter strength with both different motifs and spacer length, the weight of the total MSS and the spacer score is very roughly determined with few experimental data available. Currently our weight is determined with the promoter strength data in a literature16 to best fit the the promoter score with promoter strength. Now the relative weight between the total MSS of the two motifs and the spacer score is 0.29:0.71.

In prokaryotes, the strength of sigma factor regulation for a particular gene often depends on how closely a particular site matches the consensus for the motif. The more mismatches to the consensus in a binding site, the less often the sigma factor will bind and therefore the less strength the promoter will have. Experiments have confirmed the hypothesis that the consensus promoter sequences is "best". We set the best promoter strength to 100% and calculate the relative strength of a given promoter by the Score (promoter).

Primer design

A primer pair (p, q) is assigned the scoring vector

sc (p, q) = (|p|, |q|, GC (p), GC (q), Tm (p), Tm (q), sa (p), sa (q), sea (p), sea (q), pa (p, q), pea (p, q))TR12.

All primers are designed to have ideal values of length, GC content, and melting temperature which are specified externally by the designer of the hybridization experiment. These ideal values are to be specified for forward and reverse primers. The ideal score vector or reference vector for the primer pair is

scideal =(lengthf , lengthr , GCf , GCr , Tm,f , Tm,r , 0, 0, 0, 0, 0, 0)T.

All ideal annealing values are set to zero and typically

Tmf = Tm,r as well as GCf = GCr . The final assessment of a primer pair (p, q) can be its deviation from the reference in terms of the l1-distance

Here, we employ a weighted distance

with weights given in the following table.

The formulas for calculating the variations above are provided in Efficient primer design algorithms.13


Results

Sigma factors recognition

Our program has a correctness rate of 56% in recognition various types of sigma factors. We ran our program with 100 various promoters sequences whose types have already been confirmed experimentally. And we've recognized 56 of them correctly. Specifically, as for sigma 70 promoters, which are the most prevalent, the recogtion correctness rate has reached 92%. The results are showed below.

Link to the page of results


TFBS Location

We then tested the reliability of our software regarding TFBS location and results show that the correct site prediction rate is 64%. We used the sigma70 promoter sequences with annotated -35 and -10 region provided by RegulonDB to test the correct prediction rate of the binding site of a specific transcription. We input 89 sigma70 promoter sequences and ran our program to precisely locate the sigma factor binding site.

The test results are as follows. The numbers represent the site of actual-35 motif, the actual spacer length, the predicted site and predicted spacer length respectively.

Link to the page of results


Promoter strength correlation & experiments

To testify our prediction of promoter strength, our team has done a considerable amount of lab work. First, we located the -10 region of the pBAD promoter (BBa _K206000) and accordingly mutated the -10 region into BBa _K1070002, BBa _K1070003, the sequences of these promoters are given bellow (-10 regions are highlighted):

pBAD Strong (BBa _K206000):a c a t t g a t t a t t t g c a c g g c g t c a c a c t t t g c t a t g c c a t a g c a a g a t a g t c c a t a a g a t t a g c g g a t c c t a c c t g a c g c t t t t t a t c g c a a c t c t c t a c t g t t t c t c c a t a c c g t t t t t t t g g g c t a g c

BBa _K1070002:a c a t t g a t t a t t t g c a c g g c g t c a c a c t t t g c t a t g c c a t a g c a a t a t a g t c c a t a a g a t t a g c g g a t c c t a c c t g a c g c t t t t t a t c g c a a c t c t c t a c t g t t t c t c c a t a c c g t t t t t t t g g g c t a g c

BBa _K1070003:a c a t t g a t t a t t t g c a c g g c g t c a c a c t t t g c t a t g c c a t a g c a a g a t a a t c c a t a a g a t t a g c g g a t c c t a c c t g a c g c t t t t t a t c g c a a c t c t c t a c t g t t t c t c c a t a c c g t t t t t t t g g g c t a g c

Subsequently, we measured the fluorescence intensity of these promoters and related it to the actual promoter strength. The experimental results are shown in Figure 8.

Figure 8 The fluorescence intensity reached a stable stage after 60 min. The fluorescence intensity is measured with the inducer, L-arabinose at 1 mM, 37℃, 200 rpm . The promoter strength is related to the relative fluorescence intensity compared to the control group (K206000 without the inducing of L-arabinose).

Then we fitted the actual strength and the predicted strength, as can be seen in figure 9, the coefficient of determination is 0.8924.

Figure 9 The correlation between experimental data and the strength predicted by our program.

Future work

Apply our algorithms to more species. Now Pro-decoder is designed expressly for prediction and evaluation of E. coli promoters, in the near future we'll study the transcription regulation mechanism of other species and try to apply our algorithms to an extended range of species.

Enhance promoter strength prediction accuracy. Because our experimental data is so limited, the weight of the spacer length and the motif similarity is roughly determined, which lead to a weak correlation between the promoter strength. In the future we hope to obtain more experimental data with regard to the effect of spacer length and motif similarity having on promoter strength so we can revise the weight coefficients of the two factors and get more reliable results.

The next version of this part of our program will be able to analyze not only the promoters of E. coli, but other species such as Bacillus subtilis, we'll integrate the transcription factor binding site data of more species into our database and use PWM algorithm to predict the TFBS in the promoters.


References

[1] Wösten, M., Eubacterial sigma‐factors. FEMS microbiology reviews 1998, 22 (3), 127-150.
[2] Shultzaberger, R. K.; Chen, Z.; Lewis, K. A.; Schneider, T. D., Anatomy of E. coli σ70 promoters. Nucleic acids research 2007, 35 (3),771-788.
[3] Paget, M.; Helmann, J. D., The sigma70 family of sigma factors. Genome Biol 2003, 4 (1),203.
[4] Jensen, S. T.; Liu, X. S.; Zhou, Q.; Liu, J. S., Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Statistical Science 2004, 19 (1), 188-204.
[5] Kämpke, T.; Kieninger, M.; Mecklenburg, M., Efficient primer design algorithms. Bioinformatics 2001, 17 (3),214-225.
[6] (a) Rhodius, V. A.; Mutalik, V. K., Predicting strength and function for promoters of the E. coli alternative sigma factor, σE. Proceedings of the National Academy of Sciences 2010, 107 (7), 2854-2859; (b) Mulligan, M. E.; Brosius, J.; McClure, W. R., Characterization in vitro of the effect of spacer length on the activity of E. coli RNA polymerase at the TAC promoter. Journal of Biological Chemistry 1985, 260 (6), 3529-3538; (c) Qureshi, S. A.; Jackson, S. P., sequences-Specific DNA Binding by the S. shibatae TFIIB Homolog, TFB, and Its Effect on Promoter Strength. Molecular cell 1998, 1 (3), 389-400.
[7] Kel, A. E.; Gößling, E.; Reuter, I.; Cheremushkin, E.; Kel-Margoulis, O. V.; Wingender, E., MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucleic acids research 2003, 31 (13), 3576-3579.
[8] Deuschle, U.; Kammerer, W.; Gentz, R.; Bujard, H., Promoters of E. coli: a hierarchy of in vivo strength indicates alternate structures. The EMBO journal 1986, 5 (11), 2987.