Team:XMU Software/Project

From 2013.igem.org

Revision as of 05:08, 27 October 2013 by LiKaiQiu (Talk | contribs)

PROJECT
Our project includes 2 independent software tools-the brick worker and E' NOTE. The former is a software suit for the evaluation and optimization of biobricks, i.e., promoter, RBS, protein coding sequences and terminator. E' NOTE is a web application serving as an assistant for experiments. Its useful functions such as experiments recording and experimental template customization make experimental process easier and more enjoyable.
Promoter-decoder


Abstract

In a promoter sequences, the sigma factor binding site and other transcription factor binding site affect the strength of binding significantly. For annotating promoters, some software was developed which mostly focused on the prediction of other transcription factors or one particular type of sigma factors but failed to analyze the promoter with both sigma factors and other transcription factors. 1-2 To solute this problem, a module of our software was designed which can analyze and evaluate promoters.

Our software use PWM method to calculate the similarity between promoter sequencess and the position frequency matrix of transcription factor binding sites (TFBS) to locate the TFBS as well as to predict the relative strength of the promoter. Promoter-Decoder overshadows its counterparts with all-round analysis and the prediction of promoter strength. It enables users to figure out promoter types, predict promoter strength, changeit by mutating the key sites and even change the property of certain promoter by adding new TFBS to the promoter sequences.


Background

Sigma Factors

Bacteria encode several thousands of different proteins, which are necessary for normal cell functions or for adaptation to environmental changes.3 These proteins are not required at the same time or in the same amount. Regulation of gene expression therefore enables the cell to control the production of proteins needed for its life cycle or for adaptation to extracellular changes. The various steps during transcription and translation are therefore subject to different regulatory mechanisms.4

The most prominent step in gene regulation is the initiation of transcription in which the DNA-dependent RNA polymerase (RNAP) is the key enzyme. The RNAP or the RNAP core enzyme is the catalytic machinery for the synthesis of RNA from a DNA template. However, RNAP cannot initiate transcription by itself. Initiation of transcription requires an additional polypeptide known as a sigma-factor.5 Sigma-factors are a family of relatively small proteins that can associate in a reversible way with the RNAP core enzyme. Together, the sigma-factor and the RNAP core enzyme form an initiation-specific enzyme, the RNAP holoenzyme.

Figure 1 The initiation of transcription

The sigma-factor directs RNA polymerase to a specific class of promoter sequencess. Most bacterial species synthesize several different sigma-factors that recognize different consensus sequencess.6

This variety in sigma-factors provides bacteria with the opportunity to maintain basal gene expression as well as for regulation of gene expression in response to altered environmental or developmental signals.

The frequency at which the RNAP holoenzyme initiates transcription, also known as the strength of a promoter, is influenced by the promoter sequences and the conformation of the DNA in the promoter region. The sigma-factors recognize two conserved sequencess in the promoter region, known as the promoter consensus sequences. Sigma-factors or fragments of sigma-factors bind specifically to promoter DNA sequences and by specific base pair and amino acid substitutions in the promoter consensus sequencess or sigma factors. Most bacterial species synthesize several different sigma-factors which direct the RNAP holoenzyme to distinct classes of promoters with a different consensus sequences. This variety in sigma-factors provides the bacterium with the opportunity to maintain basal gene expression as well as for regulation of gene expression in response to specific environmental stimuli.

The identification of bacterial promoters is an essential step in the elucidation of gene regulation.7

As a general rule, the more complex the life-cycle and environmental niche of a bacterium, the greater the number of sigma factors with corresponding promoter types. Typically however, the most common promoter type is that which regulates the housekeeping genes and the corresponding major sigma-factor is shared by all bacteria (sigma 70 in the well studied E. coli, and its homologues in other species). The binding site for the sigma70-family of promoters is defined by two consensus hexamers, TTGACA and TATAAT, located at approximately −35 and −10, respectively relative to the transcript start site (TSS) and spaced 15–21 base pairs (bp) apart2. RNA polymerase core enzyme associates with the major sigma-factor to form the holoenzyme which in turn binds to its cognate promoters to initiate transcription.

Figure 2 The RNA polymerase Figure 3 Consensus sequences of sigma 70 factor

In prokaryotes, the minimum requirement for RNA polymerase binding is recognition of the promoter by the sigma factor. In general, prokaryotic RNA polymerases can interchange a number of sigma factors which bind and initiate different groups of genes.3

Transcription Factors

Figure 4 Transcription factor binding site

Sigma factors are essential for the transcription initiation in E. coli.10

In addition, promoter strengths are not determined purely by the binding of the sigma factor. Other transcription factors can bind specific sequencess surrounding or overlapping the promoter to either activate or repress transcription.4 The mechanism is transcriptional activators and repressors contribute to and detract from the accessibility of DNA by the RNA polymerase. 12

These transcription-regulating nuclear proteins bind to specific binding sites in the regulatory regions (e.g. promoters, enhancers) of the genes thus providing their activation or repression.

Figure 5 Transcription factor binding site

Computational methods of predicting TF binding sites in DNA are very important for understanding the molecular mechanisms of gene regulation.

The binding sites of the same transcription factor show a significant sequences conservation, which is often summarized as a short (5–20 bases long) common pattern called a transcription factor binding site (TFBS) or binding consensus. Our software aims to figure out the possible TFBS in promoters and precisely locate the TFBS so that the user may know the exact sites that play a role in regulating the transcription.

In prokaryotes (lower organisms without nuclei), there are fewer TFs, their motifs tend to be relatively long and the strength of regulation for a particular gene often depends on how closely a particular site matches the consensus for the motif. The more mismatches to the consensus in a binding site, the less often the TF will bind and therefore the less control it will exert on the target gene. So our software will calculate the similarity between the possible TFBS in the promoter and the standard motifs so the user will know to which extent the transcription factor will control the promoter transcription

Primer Design

To facilitate the design of PCR primers of various promoters, we've developed an additional function, namely, primer design in this part of our program. After inputting the promoter sequences, the software will figure out the most suitable primers based on the theory of Thomas Kämpke1, Markus Kieninger, and Michael Mecklenburg.13


Data Source

RegulonDB

Genes and operons that are under control of the same TF are members of that TF's regulon. Although methods for the prediction of regulons have been substantially improved, they are still far from perfect.

Comparative genomics tools can be used to predict regulons in bacterial genomes but the procedure can lead to incorrect regulon calling. Despite this drawback, several regulon databases are available that are based on comparative genomics methods and lack experimental evidence.

Probably the extended and accurate databases of regulons for E. coli are RegulonDB which provides the data source for our program.


Algorithm

Experimental results show that these are the strongest promoters that have been characterized in vitro so far and confirm the hypothesis that the consensus promoter sequences is "best". To calculate the similarity between the promoter sequences and the best sequences, we implement the PWM method6 in our program.

PWM (Position Weight Matrix)

Molecular techniques for the identification of promoters are both costly and time consuming, hence in silico methods are an attractive and well explored alternative. The most common in silico method to identify sigma 70 promoters uses position weight matrices (PWMs) and depends on the relative conservation of the transcription factor binding site (TFBS, or motifs ).

The algorithm can be divided into two parts regarding to the difference between the motifs of sigma factors and other transcription factors.

Figure 6 The consensus sequences, the position frequency matrix and the frequency logo

Part 1: the recognition of other transcription factors.7

Other transcription factors are proteins that can bind to a specific DNA sequences (motifs) and regulate the promoter's transcription. To recognize these possible motifs in a given promoter sequences, we calculate the Matrix Similarity Score (MSS) of every possible sites in the promoter sequences using the position frequency matrix of 86 transcription factors published by RegulonDB. The algorithm reports only those matches of a matrix that have got MSS higher than the settled threshold. And MSS for a subsequences x of the length L is calculated in following steps:

fi,Bi , frequency of nucleotide B to occur at the position i of the matrix (B ∈{A, T, G, C})

f imin , frequency of the nucleotide which is rarest in position i in the matrix

f imax , highest frequency in position i.

The information vector

describes the conservation of the positions i in a matrix. Multiplication of the frequencies with the information vector leads to a higher acceptance of mismatches in less conserved regions, whereas mismatches in highly conserved regions are very much discouraged. This leads to a better performance in recognition of TF binding sites if compared with methods that do not use the information vector.

To determine the best threshold of the motif finding algorithm, we test various threshold values and analyze the true negative and false positive rate of each threshold value. The ideal threshold is supposed to have both the least true negative and false positive rates.

Table 1 The threshold setting data
Threshold 0.5977 0.598 0.69 0.7 0.73 0.76 0.0778 0.84 0.85 0.86 0.9
True megtive 0.1 0.11 0.21 0.23 0.3 0.45 0.5 0.63 0.7 0.72 0.77
False positive 56.4778 57.01124 31.56962 29.15584 23.58571 20.92727 17.90796 9.945946 10.66667 10.17857 6.608696

The picture above shows part of our test results and to keep both the true negative and false positive rates at a reasonable level, we adopt 3 threshold values, namely, low (0.5977) , median (0.0778) and high (0.85), with a true negative rate at 0.1, 0.5, and 0.7 respectively. And for more flexibility, we also allow the users to set their own thresholds.

Part 2: the recognition of sigma factor motif and the evaluation of relative promoter strength.

In the case of sigma 70 factors, the motifs are the −35 and −10 hexamers. Enclosing a spacer of length 15–19 bp.

Given a promoter sequence, the -10 and -35 hexamers are located by the total MSS of the two hexamers calculated by the position frequency matrices of the sigma factor binding sites, which are derived from Regulon DB. And the calculating process is subject to two constraints:

1. That the spacer length (the number of base pairs between the −35 hexamer and the −10 hexamer) should lie in the range (15–20);

Figure 7 The consensus sequences of sigma 70 factor binding site

2. The total MSS (our results are the sum of the scores for the −10 and −35 hexamers and therefore lie in the interval [0,2], with a score of 2 corresponding to the joint consensus TTGACA (−35) and TATAAT (−10).

Score(Promoter)=score(-10 box)+score(-35 box)+score(spacer between -10 & -35 boxes)

And the score of spacer length is calculated by algorithms propozed by Ryan K. Shultzaberger.el. in E. coli sigma70 promoters.8 But due to a lack of experimental data of promoter strength with both different motifs and spacer length, the weight of the total MSS and the spacer score is very roughly determined with few experimental data available. Currently our weight is determined with the promoter strength data in a literature16 to best fit the the promoter score with promoter strength. Now the relative weight between the total MSS of the two motifs and the spacer score is 0.29:0.71.

In prokaryotes, the strength of sigma factor regulation for a particular gene often depends on how closely a particular site matches the consensus for the motif. The more mismatches to the consensus in a binding site, the less often the sigma factor will bind and therefore the less strength the promoter will have. Experiments have confirmed the hypothesis that the consensus promoter sequences is "best". We set the best promoter strength to 100% and calculate the relative strength of a given promoter by the Score (promoter).

Primer design

A primer pair (p, q) is assigned the scoring vector

sc (p, q) = (|p|, |q|, GC (p), GC (q), Tm (p), Tm (q), sa (p), sa (q), sea (p), sea (q), pa (p, q), pea (p, q))TR12.

All primers are designed to have ideal values of length, GC content, and melting temperature which are specified externally by the designer of the hybridization experiment. These ideal values are to be specified for forward and reverse primers. The ideal score vector or reference vector for the primer pair is

scideal =(lengthf , lengthr , GCf , GCr , Tm,f , Tm,r , 0, 0, 0, 0, 0, 0)T.

All ideal annealing values are set to zero and typically

Tmf = Tm,r as well as GCf = GCr . The final assessment of a primer pair (p, q) can be its deviation from the reference in terms of the l1-distance

Here, we employ a weighted distance

with weights given in the following table.

The formulas for calculating the variations above are provided in Efficient primer design algorithms.13


Results

Sigma factors recognition

Our program has a correctness rate of 56% in recognition various types of sigma factors. We ran our program with 100 various promoters sequencess whose types have already been confirmed experimentally. And we've recognized 56 of them correctly. Specifically, as for sigma 70 promoters, which are the most prevalent, the recogtion correctness rate has reached 92%. The results are showed below.

Link to the page of results


TFBS Location

We then test the reliability of our software regarding TFBS location and results show that the correct site prediction rate is 64%. We used the sigma70 promoter sequences with annotated -35 and -10 region provided by RegulonDB to test the correct prediction rate of the binding site of a specific transcription. We input 89 sigma70 promoter sequencess and ran our program to precisely locate the sigma factor binding site.

The test results are as follows. The numbers represent the site of actual-35 motif, the actual spacer length, the predicted site and predicted spacer length respectively.

Link to the page of results


Promoter strength correlation & experiments

To testify our prediction of promoter strength, our team has done a considerable amount of lab work. First, we located the -10 region of the pBAD promoter (BBa _K206000) and accordingly mutated the -10 region into BBa _K1070002, BBa _K1070003, the sequencess of these promoters are given bellow (-10 regions are highlighted):

Pba dSt rong (BBa _K206000):a c a t t g a t t a t t t g c a c g g c g t c a c a c t t t g c t a t g c c a t a g c a a g a t a g t c c a t a a g a t t a g c g g a t c c t a c c t g a c g c t t t t t a t c g c a a c t c t c t a c t g t t t c t c c a t a c c g t t t t t t t g g g c t a g c

BBa _K1070002:a c a t t g a t t a t t t g c a c g g c g t c a c a c t t t g c t a t g c c a t a g c a a t a t a g t c c a t a a g a t t a g c g g a t c c t a c c t g a c g c t t t t t a t c g c a a c t c t c t a c t g t t t c t c c a t a c c g t t t t t t t g g g c t a g c

BBa _K1070003:a c a t t g a t t a t t t g c a c g g c g t c a c a c t t t g c t a t g c c a t a g c a a g a t a a t c c a t a a g a t t a g c g g a t c c t a c c t g a c g c t t t t t a t c g c a a c t c t c t a c t g t t t c t c c a t a c c g t t t t t t t g g g c t a g c

Subsequently, we measured the fluorescence intensity of these promoters and relate it to the actual promoter strength. The experimental results are shown in Figure 8.

Figure 8 The fluorescence intensity reached a stable stage after 60 min. The fluorescence intensity is measured with the inducer, L-arabinose at 1 mM. The promoter strength is related to the relative fluorescence intensity compared to the control group (K206000 without the inducing of L-arabinose).

Than we fit the actual strength and the predicted strength, as can be seen in figure 9, the coefficient of determination is 0.8924.

Figure 9 The correlation between experimentally determined strength and the strength predicted by our program.

Future work

Apply our algorithms to more species. Now Pro-decoder is designed expressly for prediction and evaluation of E. coli promoters, in the near future we'll study the transcription regulation mechanism of other species and try to apply our algorithms to an extended range of species.

Enhance promoter strength prediction accuracy. Because our experimental data is so limited, the weight of the spacer length and the motif similarity is roughly determined, which lead to a weak correlation between the promoter strength. In the future we hope to obtain more experimental data with regard to the effect of spacer length and motif similarity having on promoter strength so we can revise the weight coefficients of the two factors and get more reliable results.

The next version of this part of our program will be able to analyze not only the promoters of E. coli, but other species such as Bacillus subtilis, we'll integrate the transcription factor binding site data of more species into our database and use PWM algorithm to predict the TFBS in the promoters.


References

[1] Wösten, M., Eubacterial sigma‐factors. FEMS microbiology reviews 1998, 22 (3), 127-150.
[2] Shultzaberger, R. K.; Chen, Z.; Lewis, K. A.; Schneider, T. D., Anatomy of E. coli σ70 promoters. Nucleic acids research 2007, 35 (3),771-788.
[3] Paget, M.; Helmann, J. D., The sigma70 family of sigma factors. Genome Biol 2003, 4 (1),203.
[4] Jensen, S. T.; Liu, X. S.; Zhou, Q.; Liu, J. S., Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Statistical Science 2004, 19 (1), 188-204.
[5] Kämpke, T.; Kieninger, M.; Mecklenburg, M., Efficient primer design algorithms. Bioinformatics 2001, 17 (3),214-225.
[6] (a) Rhodius, V. A.; Mutalik, V. K., Predicting strength and function for promoters of the E. coli alternative sigma factor, σE. Proceedings of the National Academy of Sciences 2010, 107 (7), 2854-2859; (b) Mulligan, M. E.; Brosius, J.; McClure, W. R., Characterization in vitro of the effect of spacer length on the activity of E. coli RNA polymerase at the TAC promoter. Journal of Biological Chemistry 1985, 260 (6), 3529-3538; (c) Qureshi, S. A.; Jackson, S. P., sequences-Specific DNA Binding by the S. shibatae TFIIB Homolog, TFB, and Its Effect on Promoter Strength. Molecular cell 1998, 1 (3), 389-400.
[7] Kel, A. E.; Gößling, E.; Reuter, I.; Cheremushkin, E.; Kel-Margoulis, O. V.; Wingender, E., MATCHTM: a tool for searching transcription factor binding sites in DNA sequencess. Nucleic acids research 2003, 31 (13), 3576-3579.
[8] Deuschle, U.; Kammerer, W.; Gentz, R.; Bujard, H., Promoters of E. coli: a hierarchy of in vivo strength indicates alternate structures. The EMBO journal 1986, 5 (11), 2987.








RBS-decoder


Abstract

The efficiency of translation in bacteria is greatly influenced by the binding affinity between the ribosome and RBS, which can be measured by RBS strength. Experimental work to determine a RBS sequences can be awfully laborious while our software program can solve this problem easily. RBS-decoder is a software tool for evaluating RBS strength and locating SD sequences. This program uses the same method—PWM to calculate the similarity between the RBS sequences and the position frequency matrix of SD sequences and transform the similarity to the relative strength of a RBS sequences.


Background

Translational efficiency in Escherichia coli is generally determined at the stage of initiation. There are several principal mRNA sequences elements which can affect the kinetics of ternary initiation complex formation (30 S-mRNA-fMet-tRNA): SD sequences and the start codon (ATG). The SD sequences base-pairs with a RNA molecule that forms part of the bacterial ribosome (the 16s rRNA), while the start codon base-pairs with the initiator tRNA which is bound to the ribosome. In addition the SD sequences and the start codon being important, the spacer between them also influences the RBS strength, these two sequencess need to be positioned approximately 6-7 nucleotides apart so they can both make contact with the appropriate parts of the ribosome complex1.


Introduction

How do baterial Ribosome Binding Sites work?

The bacterial ribosome binds to particular sequencess on an mRNA, primarily the SD sequences and the start codon (ATG). The SD sequences base-pairs with an RNA molecule that forms part of the bacterial ribosome (the 16s rRNA), while the start codon base-pairs with the initiator tRNA which is bound to the ribosome. In addition to SD sequences and the start codon being important, these two sequences need to be positioned approximately 6-7 nucleotides apart so they can both make contact with the appropriate parts of the ribosome complex.1

The Shine-Dalgarno sequences

Figure 1 The RBS sequences logo representing the sequencess of 149 RBS from E. coli. The height of each letter represents the frequency of the base at that location. From Tom Schneider, "A Gallery of sequences Logos".

The end of the 16s rRNA that is free to bind with the mRNA includes the sequences 5′–ACCUCC–3′. The complementary sequences, 5′–GGAGGU–3′, named the Shine-Dalgarno sequences, can be found in whole or in part in many bacterial mRNA. Very roughly speaking, ribosome binding sites with purine-rich sequencess (A's and G's close to the Shine-Dalgarno sequences will lead to high rates of translation initiation whereas sequencess that are very different from the Shine-Dalgarno sequences will lead to low or negligible translation rates. You can see how common the sequences is by looking at the RBS sequences logo on the right (where the height of a letter indicates the frequency of the letter at that location).


Algorithms

As we know, the RBS strength is greatly influenced by the SD sequences, where the 16s RNA of the ribosome binds to, so the strength can be determined by the binding free energy between the SD sequences and the 16s RNA. So we designed a program calculating the binding free energy but the results show that the correlation between the free energy and the strength of RBS is rather weak (R2=0.5517). So we decide to find other algorithms for better accuracy.

Inspired by the strength prediction algorithms used in promoter part, in which the similarity to the sigma factors' PWM is interlocking with the binding affinity between the protein and DNA sequences. We obtained the Position Frequency Matrix of SD sequences of E. coli and use the PMW method (illustrated in detail in the promoter part) to calculate the similarity between the RBS sequences and the Position frequency sequences, what is different from the promoter is that, the spacer length between the SD sequences and the startcodon and the start codon itself both act as constraints in locating the SD sequences, which is confined to 3-16 bp and ATG/TTG/GTG. And similar to the prediction of promoter strength, the spacer length between the SD sequence also contributes to the RBS strength, the optimal spacer length is 7 bp, and the spacer score is calculated using the same algorithm applied in the promoter part.2 The weight of the influence of the spacer on the strength isderived from the algorithm to predict the promoter strength, in which the weight of the total MSS and the spacer is 0.29:0.71, and since in promoter the total MSS is the sum of two motifs while the SD sequences is only one motif, the weight between the MSS(SD sequences) and the spacer is 0.29:0.355.

Nucleotide frequencies for the RBS model
1 2 3 4 5
T 0.161 0.050 0.012 0.071 0.115
C 0.077 0.037 0.012 0.025 0.046
A 0.681 0.105 0.105 0.861 0.164
G 0.077 0.808 0.960 0.043 0.659
Figure 2 The RBS nucleotide position frequency matrix.3

Results

We use the RBS sequences listed on the iGEM registry with experimentally determined relative strength,4 and the correlation between the RBS strength predicted by our software and the actual relative strength is strong, with a determination coefficient value 0.8039.

Figure 3 The correlation between actual RBS strength and predicted strength

Future work

Due to scarcity of experimental data, the relative weight of the SD sequences and the spacer length used currently is roughly determined which may undermine the accuracy of RBS strength prediction. For further improvement of our program, we'll try to obtain more reliable experimental data to accurately determine the weight used in our algorithm and hopefully elevate the accuracy of RBS-decoder.

In the next version of RBS-decoder, the secondary structure of the RBS sequences will be shown on the software and we'll also include the other species' SD sequence data in order to predict the RBS strength of a larger range of species.


Reference

[1] Ma, J.; Campbell, A.; Karlin, S., Correlations between Shine-Dalgarno sequencess and gene features such as predicted expression levels and operon structures. Journal of bacteriology 2002, 184 (20), 5733-5745.
[2] Noguchi, H.; Taniguchi, T.; Itoh, T., MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA research 2008, 15 (6), 387-396.
[3] Alexander V. Lukashin, Mark B, GeneMark.hmm: new solutions for gene finding, Nucleic Acids Research, 1998, 1107–11153.
[4] http://parts.igem.org/Ribosome_Binding_Sites/Prokaryotic/Constitutive/Community_Collection.








Terminator


Background

Gene expression in both prokaryotes and eukaryotes is frequently controlled at the level of transcription. This process can be represented as a cycle consisting of four major steps: (1) promoter binding; (2) RNA chain initiation; (3) RNA chain elongation; and (4) termination. Since regulatory controls are exerted at each step, an understanding of the mechanism of each step is of general importance in understanding gene expression.

In the promoter part of our program, we've discussed the mechanism of promoter binding step and how it affects the transcription level. To complete our biobrick evaluation program and to better understanding of transcription process, we integrated the software developed by 2012 iGEM team SUSTC-Shenzhen-B to realize the prediction of transcription termination efficiency.


Introduction

Termination, the last step of the transcription cycle, occurs when the RNA polymerase releases the RNA transcript and dissociates from the DNA template. It is important that transcription is imperfectly terminated at some terminator so that the ratio of the amount of the mRNA transcribed from upstream and that from downstream of the terminator is controlled. This regulation is quantified by the termination efficiency (%T).

Two mechanisms of transcription termination and two classes of termination signals have been described in bacteria: rho-dependent and rho-independent.

Rho-independent (also known as intrinsic) terminators are sequences motifs found in many prokaryotes that cause the transcription of DNA to RNA to stop. These termination signals typically consist of a short, often GC-rich hairpin followed by a sequences enriched in thymine residues.

The conventional model of transcriptional termination is that the stem loop causes RNA polymerase to pause and transcription of the poly-A tail causes the RNA: DNA duplex to unwind and dissociate from RNA polymerase.


Algorithm

In 2011, iGEM team SUSTC-Shenzhen-B developed a software tool TTEC to predict terminator efficiency. It takes DNA sequencess as input and returns the terminator efficiency value.

In the algorithm, it takes 3 steps to calculate the terminator efficiency:

1. Use RNA folding algorithm to predict the secondary structure of terminator and and recognize A tail, stemloop and T tail.

2. From the secondary structure, we calculate the free energy of stem loop, and generate a score by considering stem loop free energy and T tail

3. From the score, we predict the terminator efficiency based on the score-terminator equation.

The prediction of secondary and recognition of A tail, stemloop and T tail are achieved by Kingsford scoring system.


Kingsford Scoring System

In 2007, Carleton L. Kingsford et al. described TransTermHP1, a new computational method to rapidly and accurately detect Rho-independent transcription terminators.

They put forward an algorithm to predict Rho-independent terminators. The first 15 bases of the potential tail sequences are scored using a function:

where

for n=1,2,..,15 and =1.

The energy of potential hairpin configurations adjacent to a reference position can be found efficiently with a dynamic programming algorithm. The table entry hairpin_score[i,j] gives the cost of the best hairpin structure for which the base of the 5' stem is at nucleotide position i and the base of the 3' stem is at position j. The entry hairpin_score[i,j] can be computed recursively as follows:

The function energy(i,j) gives the cost of pairing the nucleotide at i with that at j, and loop_pen(n) gives the cost of a hairpin loop of length n. The hairpin's loop is forced to have a length between 3 and 13 nt, inclusive, by setting loop_pen(n) to a large constant for any n outside that range. The constant 'gap' gives the cost of not pairing a base with some base on the opposite stem and thus introducing a gap on one side of the hairpin stem.

Table 1

Parameters used to evaluate hairpins

Pairing Energy

G-C -2.3

A-T -0.9

G-T 1.3

Mismatch 3.5

Gap 6.0

Loop_pen(n) 1•(n - 2)

Parameters used to evaluate the energy of a potential hairpin where n is the length of the hairpin loop

The D score is calculated by Carafa Scoring System.


Carafa Scoring System

Scoring System 2 is based on the model created by d'Aubenton Carafa 2. The score of terminator consists of two parts, the free energy of stemloop and the score of 15 nt poly T tail. The free energy of stemloop is calculated using Loop Dependent Energy Rules 3. The minimization of the free energy also determined the secondary structure of the stemloop. T tail score is calculated by the formula given by d' Aubenton Carafa.

Detailed Calculation of Score

1. Some definitions3

i. Closing Base Pair

For an RNA sequences, we number it from 5' to 3' . If i < j and nucleotides ri and rj form a base pair,we denote it by i.j. We call base ri' or base pair i'.j' is accessible from i.j if i <i' ( <j' ) <j and if there is no other base pair k.l so that i <k <i' ( <j' ) <l <j. We denote the collection of base and base pair accessible from i.j by L(i,j). Then i.j is the closing base pair. Here “L” means loop.

ii. n-loop

If the loop contain n – 1 base pairs, we denote it by n-loop. (Because there is a closing base pair, so we denote it by n-loop even though the closing base pair is not included in the loop.)

Here we can divide loops which may be formed in the terminator secondary structure into two kinds.

1-loop : Hairpin loop(size of loop shouldn't be smaller than 3)

2-loop : Interior Loop(right strand size and left strand size are both bigger than 0.)

Buldge(Size of one strand is bigger than 0 and that of another strand is 0.)Stack(size of the loop is 0.)

2. Calculation of the Minimum Free Energy Change of Stemloop Formation4 Assume i.j is the closing base pair of the loop

G(i,j)= min { GH ( i , j ) , GS( i , j ) + G ( i + 1 , j – 1 ) , GBI( i , j ) } ;

GBI ( i , j ) = min{ gbi( i , j , k , l ) + G( k , l ) } for all 0 < k – i + l – j - 2 < max_size

G(i,j) is the minimum free energy change of stemloop formation. GH is the free energy change to form a hairpin loop. GS is the free energy change to form a stack. GBI is to calculate the minimum free energy change of structure containing 2-loop. gbi(i,j,k,l) is the free energy change to form 2-loop.

3.Calculation of T Tail Score

Here we consider 15 nucleotide in the downstream of stemloop. T tail score nT is calculated as follows :

In our program, if the length of the T tail( n ) is less than 15, we will only consider n nucleotides. If TL is more than 15, we will only consider 15 nucleotides.

4.Calculation of Score

Score = nT * 18.16 + deltaG / LH * 96.59 – 116.87

Here nT is T tail score. deltaG is the minimum free energy change of stemloop formation. LH is the length of stemloop.5,6


References

[1] Kingsford, C. L.; Ayanbule, K.; Salzberg, S. L., Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake. Genome biology 2007, 8 (2), R22.
[2] Carafa, Y. d. A.; Brody, E.; Thermes, C., Prediction of rho-independent E. coli transcription terminators: A statistical analysis of their RNA stem-loop structures. Journal of molecular biology 1990,216 (4), 835-858.
[3] Manual of Mfold Version 3.5.
[4] http://unafold.math.rpi.edu/lectures/old_RNAfold/node2.html.
[5] Lesnik, E. A.; Sampath, R.; Levene, H. B.; Henderson, T. J.; McNeil, J. A.; Ecker, D. J., Prediction of rho-independent transcriptional terminators in E. coli.Nucleic acids research 3583-3594.
[6] Sugimoto, N.; Nakano, S.-i.; Katoh, M.; Matsumura, A.; Nakamuta, H.; Ohmichi, T.; Yoneyama, M.; Sasaki, M., Thermodynamic parameters to predict stability of RNA/DNA hybrid duplexes. Biochemistry 1995, 34 (35),11211-11216.








SynoProteiner


Abstract

Our team mainly focuses on programming the software by two methods using NSGA-II algorithm, evaluating both optimization of single codon and codon pair and hence determining the fittest optimized sequences for expression in heterologous host cell.

Apart from the optimization, we have two additional functions. One is the statistics analysis, which provides the numbers and the proportion of the codon in the original and optimized sequences, making the optimization easier to understand. The other is the prediction of the protein folding rate. The purpose of the prediction is to seek the law of the folding rate in general, computing a relatively accurate folding rate value of the optimized sequences for the users.


Background

Synonymous codons and the efficiency

Except methionine and tryptophan, all amino acids can be encoded by two to six synonymous codons, resulting from the degeneracy of the genetic code.1 However, unequal utilization of the synonymous condons leads to the phenomenon of codon usage bias, which is mainly due to natural selection, mutation and genetic drift.2 According to related studies, codon usage bias has certain connection with gene expression level.3 The larger the value of codon usage bias is, the higher gene expression will be. So the problem, how to substitute the synonymous codons aimed at raising the efficiency of gene expression and thus increasing the production of recombination protein in heterologous host cell, is expected to be addressed.

Protein folding rate

Protein is an important class of biological macromolecules. It is the main bearer of life activities and occupies a special position in vivo. Each protein has its own unique amino acid composition and sequences. Only when the amino acid chain is folded into the correct three-dimensional structure, will the protein have normal biological functions. Misfolded ones will not only lose its biological function but also even cause diseases such as mad cow disease, Alzheimer's syndrome, etc. The protein folding problem, an important biological question that the central dogma of molecular biology has not solved yet, has been listed as an important topic in twenty-first century. The folding mechanism of the protein is a challenging task, one of which is to determine factor influencing the folding rate. Although the answer can be found in a variety of biological experiments, such as various spectroscopy, mass spectrometry and nuclear magnetic resonance, these methods are time-consuming and costly. With the development of physics, mathematics, especially the progress of computer technology, how to apply a fast and accurate calculation method to predict protein folding rate attracts more and more attention.4


Introduction

Balance with single codon and codon pair

Individual codon usage optimization has been attached importance to, taking Codon optimizer,5 Gene Designer,6 OPTIMIZER7 for example. Subsequently, people found the effect of gene expression optimization cannot be perfect just by single codon optimization. Codon pair, namely the pair of k-th and (k+1)-th codons from the 5’ to 3’ end, is another crucial factor. Due to potential tRNA-tRNA steric interaction within the ribosomes,8 the usage of rare condon pairs, which correlate with translation elongation, decrease protein translation rates.9 Optimization of individual codon has an influence on the corresponding codon pair resulting in maybe-not-the-best codon pair optimization. In the same way, optimizing codon pair merely contributes to maybe-not-the-best single codon optimization. Therefore, it is a challenging way for us to apply a method considering and weighing the effects of single codon and codon pair optimization and thus make the whole best.

Our team focuses on evaluating both optimization of single codon and codon pair and thus selecting the best sequences for expression in heterologous host cell.

Host Cell

Considering E. coli and S. cerevisiae are the ideal hosts for recombinant proteinexpression, and Gram-positive bacterium L. lactis and methylotrophic yeast P. pastoris are also promising candidates for expressing recombinant proteins,10 we attached importance to selecting these four kinds of bacterium as host cell to optimize the sequences.

Method of prediction

In recent years, many researchers have made great efforts to explore the determinants of the folding rate, and various forecasting methods have been proposed. The existed prediction methods can be roughly divided into three categories.11-12 The first one is based on the tertiary structure.13-19 However, it takes lots of molecular experiments, expensive and in long period, to acquire the information of the tertiary structure, which fails to meet the demand of rapid prediction. The second category is based on the secondary structure.20-24 This kind of method requires information of the secondary structure, similarly obtained by molecular experiments, or from the primary sequences prediction, but it will be limited by accuracy of the secondary structure prediction method. The last one is based on the primary structure,25-34 which predicts the folding rate from amino acid sequencess without most structure information.4 And our prediction of the protein folding rate focuses on the last method.


Algorithm

Part I—Method I: MOCO35

Basic Table

Based on the table below, we calculate function of single codon (ICU) , function of codon pair (CCO) and the function of multi-objective codon optimization (MOCO). MOCO aims at make the optimization of whole best by calculating the relative effect of ICU and CCO.

Amino acid abbreviation and synonymous codons.11

Amino Acid Abbreviation Synonymous Codon(s)
Methionine M AUG
Tryptophan W UGG
Cysteine C UGC, UGU
Aspartate D GAC, GAU
Glutamate E GAA, GAG
Phenylalanine F UUC, UUU
Histidine H CAC, CAU
Lysine K AAA, AAG
Asparagine N AAC, AAU
Glutamine Q CAA, CAG
Tyrosine Y UAC, UAU
Isoleucine I AUA, AUC, AUU
Alanine A GCA, GCC, GCG, GCU
Glycine G GGA, GGC, GGG, GGU
Proline P CCA, CCC, CCG, CCU
Threonine T ACA, ACC, ACG, ACU
Valine V GUA, GUC, GUG, GUU
Leucine L CUA, CUC, CUG, CUU, UUA, UUG
Arginine R AGA, AGG, CGA, CGC, CGG, CGU
Serine S AGC, AGU, UCA, UCG, UCC, UCU
(Stop) * UAA,UAG,UGA

Calculation of ICU

max

s.t.

Calculation of CC

max

s.t.

In the function,

Kronecker Delta symbol

Calculation of MOCO

The MOCO calculation is as follows (NSGA-II algorithm applied):

1. Randomly initialize a population of coding sequencess for target protein.

2. Evaluate ICU and CC fitness of each sequences in the population.

3. Group the sequencess into nondominated sets and rank the sets.

4. Check termination criterion.

5. If termination criterion is not satisfied, select the “fittest” sequencess (top 50% of the population) as the parents for creation of offsprings via recombination and mutation.

6. Combine the parents and offsprings to form a new population.

7. Repeat steps 2 to 5 until termination criterion is satisfied.

The identification and ranking of nondominated sets in step 3 is performed via pair-wise comparison of the sequencess' ICU and CC fitness. For a given pair of sequencess with fitness values expressed as and , the domination status can be evaluated as follows:

• If and , sequences 1 dominates sequences 2.

• If and , sequences 1 dominates sequences 2.

• If and , sequences 2 dominates sequences 1.

• If and , sequences 2 dominates sequences 1.

The process is showed in the figure below:

Figure 1

Multi-objective codon optimization solution. The optimal solutions generated by MOCO lies on the pareto front (region in yellow).11

At the first, we decided on this method and we also programmed the software by MOCO method. However, we found two problems. One is that the calculation of ICU and CC fitness is only based on mathematical rationality, and lacks of enough experimental data to prove the result. And the other, we would like to find a fitness function, which weighs both the two aspects as a whole. Therefore, in order to tackle with the problem, we chose the method II below as the one recommended.


Part II—Method II: Fitness36

Fitness function:

In the function,

cpi is a value larger than zero, ranging from 10-4 to 0.5,fitcp (g) is the fitness function of the codon pair,fitsc (g) is the fitness function of the single codon,w ( (c (k),c (k+1)) is the weight of codon pairs in sequences g,|g| is the length of encoding sequences, c (k) is k-th codon in the sequences, is the target ratio of k-th codon, is the actual ratio of k-th codon in the sequences,the best value of cpi is 0.2 in the software.

In the function, the target ratio of k-th codon can be approximated by the equation below:

In the function, weight can be calculated by the equation below:

stands for the ratio of single codon ck in the complete genome'is the number of pair ( ci,cj ) in high-expression genes, and high-expression genes are genes whose copy numbers of mRNA can be detected at least 20 per cell.

syn (ck) stands for the synonymous codon set related to ck,equals to the number of amino acid encoded by ci in the whole protein set.

By this method, there are enough experimental data to prove the sequences optimized works. Xylose isomerase in Bacillus stearothermophilus, Xylose isomerase in Streptomyces olivochromogenes and L-arabinose isomerase in Thermoanaerobacter mathranii all, the optimized ones, were highly expressed in Bacillus subtilis. In addition, the activity of the optimized Aspergillusniger fungal amylase was enhanced to 400% compared with the original sequences in A. niger.36

Part III—Prediction of protein folding rate

In order to illustrate protein folding rate quantitatively, we determine the folding rate of 60 kinds of proteins as an experimental data set from literature and database37, and information of the sequences comes from PBD and NCBI.

   protein    Logarithm of the folding rate Ln(kf)    protein    Logarithm of the folding rate Ln(kf)    protein    Logarithm of the folding rate Ln(kf)
2PDD 9.8 1FKB 1.5 1RA9 -2.5
2ABD6.62CI23.91QOP-6.9
256B12.21URN5.81PHP2.3
1IMQ7.31APS-1.51PHP-3.5
1LMB8.51RIS5.91BNI2.6
1WIT0.41POH2.72LZM4.1
1TEN1.11DIV6.11UBQ5.9
1SHG1.42VIK6.81SCE4.2
1SRL41A6N1.11YCC9.62
1PNJ-1.11CEI5.81VII11.52
1SHF4.52CRO3.71NYF4.54
1PSF3.22A5E3.52AIT4.2
1CSP71IFC3.41PIN9.44
1C9O7.21EAL1.31C8C6.91
1G6P6.31OPA1.41BRS3.4
1MJC5.31CBI-3.21UBQ5.9
1LOP6.61QOP-2.53CHY1
1C8C71BRS3.41BIN2.6
1HZ64.13CHY11SCE4.2
1PGB62RN20.11GXT4.38

In order that the characteristic factors of the folding rate can be extracted from protein sequencess, we introduced the Chou's pseudo amino acid composition concept.38 According to the pseudo amino acid composition principle, the position information of protein sequencess can be, to some extent, reflected by a group of serial correlation factors θ1,θ2 ,θ3……,θn ,which is defined as follows:

in the function, θ1 is called the first-tier correlation factor that reflects the sequences order correlation between all the most contiguous residues along a protein chain (Fig. 2a), θ2 the second-tier correlation factor that reflects the sequences order correlation between all the second most contiguous residues (Fig.2b), θ3 the third-tier correlation factor that reflects the sequences order correlation between all the 3rd most contiguous residues (Fig.2c), and so forth.38

Figure 2

the correlation function is given by4:

Θ(Ri,Rj)=|H(Rj)-H(Ri)|

where H1(Ri)), H2(Ri), and M(Ri) are, respectively, the hydrophobicity value. Studies have shown that λ=10 will be the best predictor.39 But there will be a large amount of calculation considering all possible situations—the 30 factors. We should select factors that can obtain the best prediction accuracy in least calculation. For that reason, we drew lessons from the literature4 by using the method of Monte Carlo simulation and then 14 optimal characteristic factor were obtained. Other studies have indicated that the logarithm of the sequences length has a good correlation with folding rate, so Ln (L) will be the fifteenth factors. We apply SPSS software to calculate the coefficient of 15 factor by multivariate linear regression, and this will be the forecast formula of the rate of protein folding. We compared the experimental data and the predicted data and the results are as follows:

Through the test, our software succeeded in showing a relatively accurate folding rate value.


Future work

First of all, we will modify our software by advancing the program and the framework to improve its ability of concurrent computation and shorten the computing time.

Secondly, to accelerate the calculation, we may simplify the function of calculation by neglecting some term in our equations. However, considering the time spent on running program was extremely little, we will pay more attention on how to modify the equations for increasing the accuracy which maybe dramatically progress optimization result.

Thirdly, enriching the database is other way to improve our software. According to time-space tradeoff law, we could pre-process a bunch of sequencess in common use to optimized one and save the result into our database. By assessing our data, investigators could select the optimized sequencess for their synthesis. Then, users are required to feedback their result. When it collects enough information, our app will learn users’ bias therefore modify our optimizing function by some methods, like NSGA-II algorithm.

The specific points are listed as following:

1. Shortening the computing time of the software.

2. Expanding the range of the host cells.

3. Improving bacterium's resistance to toxic molecule.

4. Advancing existed paths of synthetic biology by the method.

5. Designing new paths of synthetic biology by the method.

6. Increasing the output of recombinant protein.

7. Predicting the expression of heterologous gene in a new host cell.

8. Considering more factors such as spiral structure in folding which influence the folding rate and thereby obtaining more accurate prediction rate.

9. Providing a set of software tools for protein folding, especially in molecular dynamics simulation of protein folding.


References

[1] Grantham, R.; Gautier, C.; Gouy, M.; Mercier, R.; Pave, A., Codon catalog usage and the genome hypothesis. Nucleic acids research 1980, 8 (1), 197-197.
[2] Hershberg, R.; Petrov, D. A., Selection on codon bias. Annual review of genetics 2008, 42, 287-299.
[3] Gouy, M.; Gautier, C., Codon usage in bacteria: correlation with gene expressivity. Nucleic acids research 1982, 10 (22), 7055-7074.
[4] 郭建秀,饶妮妮, 刘广雄, 李杰, & 王云鹤. 从氨基酸序列预测蛋白质折叠速率. 生物化学与生物物理进展 Progress in Biochemistry and Biophysics 2010, 37(12): 1331~1338
[5] Fuglsang, A., Codon optimizer: a freeware tool for codon optimization. Protein expression and purification 2003, 31 (2), 247-249.
[6] Villalobos, A.; Ness, J. E.; Gustafsson, C.; Minshull, J.; Govindarajan, S., Gene Designer: a synthetic biology tool for constructing artificial DNA segments. Bmc Bioinformatics 2006, 7 (1), 285.
[7] Puigbò, P.; Guzmán, E.; Romeu, A.; Garcia-Vallvé, S., OPTIMIZER: a web server for optimizing the codon usage of DNA sequencess. Nucleic acids research 2007, 35 (suppl 2), W126-W131.
[8] Smith, D.; Yarus, M., tRNA-tRNA interactions within cellular ribosomes. Proceedings of the National Academy of Sciences 1989, 86 (12),4397-4401.
[9] Coleman, J. R.; Papamichail, D.; Skiena, S.; Futcher, B.; Wimmer, E.; Mueller, S., Virus attenuation by genome-scale changes in codon pair bias. Science 2008, 320 (5884), 1784-1787.
[10] (a) Wildt, S.; Gerngross, T. U., The humanization of N-glycosylation pathways in yeast. Nature Reviews Microbiology 2005, 3 (2), 119-128; (b) Morello, E.; Bermudez-Humaran, L.; Llull, D.; Sole, V.; Miraglio, N.; Langella, P.; Poquet, I., Lactococcus lactis, an efficient cell factory for recombinant protein production and secretion. Journal of molecular microbiology and biotechnology 2007, 14 (1-3), 48-58.
[11] 郭建秀, 马彬广, 张红雨. 蛋白质折叠速率预测研究进展. 生物物理学报, 2006, 22(2):89-95 Guo J X, Ma B G, Zhang H Y. Acta Biophys Sin, 2006, 22 (2):89-95.
[12] Gromiha M M, Selvaraj S. Bioinformatics approaches for understanding and predicting protein folding rates. Current Bioinformatics, 2008, 3(1): 1-9
[13] Plaxco K W, Simons K T, Baker D. Contact order, transition state placement and the refolding rates of single domain proteins. J MolBiol, 1998, 277(4): 985-994
[14] Gromiha M M, Selvaraj S. Comparison between long-range interactions and contact order in determining the folding rate of two-state proteins: application of long-range order to folding rate prediction. J Mol Biol, 2001, 310(1): 27-32
[15] Zhou H, Zhou Y. Folding rate prediction using total contact distance. Biophys J, 2002, 82(1): 458-463
[16] Nölting B, Schälike W, Hampel P, et al. Structural determinants of the rate of protein folding. J Theor Biol, 2003, 223(3): 299-307
[17] Weikl T R, Dill K A. Folding kinetics of two-state proteins: Effect of circularization, permutation, and crosslinks. J Mol Biol, 2003,332(4): 953-963
[18] Ivankov D N, Garbuzynskiy S O, Alm E, et al. Contact order revisited: influence of protein size on the folding rate. Protein Sci,2003, 12(9): 2057-2062
[19] Mirny L, Shakhnovich E. Protein folding theory: from lattice to all-atom models. Annu Rev Biophys Biomol Struct, 2001, 30 (1):361-396
[20] Gong H, Isom D G, Srinivasan R, et al. Local secondary structure content predicts folding rates for simple two-state proteins. J MolBiol, 2003, 327(5): 1149-1154
[21] Ivankov D N, Finkelstein A V. Prediction of protein folding rates from the amino acid sequences-predicted secondary structure. Proc Nat Acad Sci USA, 2004, 101(24): 8942-8944
[22] Fleming P J, Gong H P, Rose G D. Secondary structure determines protein topology. Protein Sci, 2006, 15(8): 1829-1834
[23] Huang J T, Cheng J P, Chen H. Secondary structure length as a determinant of folding rate of proteins with two- and three-state kinetics. Proteins, 2007, 67(1): 12-17
[24] Prabhu N P, Bhuyan A K. Prediction of folding rates of small proteins: empirical relations based on length, secondary structure content, residue type, and stability. Biochemistry, 2006, 45 (11):3805-3812
[25] Shao H, Peng Y, Zeng Z H. A simple parameter relating sequencesswith folding rates of small helical proteins. Protein Pept Lett, 2003,10(3): 277-280
[26] Galzitskaya O V, Garbuzynskiy S O, Ivankov D N, et al. Chainlength is the main determinant of the folding rate for proteins withthree-state folding kinetics. Proteins, 2003, 51(2): 162-166
[27] Huang J T, Jing T. Amino acid sequences predicts folding rate for middle-size two-state proteins. Proteins, 2006, 63(3): 551-554
[28] Gromiha M M. A statistical model for predicting protein folding rates from amino acid sequences with structural class information.J Chem Inf Model, 2005, 45(2): 494-501
[29] Ma B G, Guo J X, Zhang H Y. Direct correlation between proteins'folding rates and their amino acid compositions: an ab initio foldingrate prediction. Proteins, 2006, 65(2): 362-372
[30] Gromiha M M, Thangakani A M, Selvaraj S. FOLD-RATE:prediction of protein folding rates from amino acid sequences.Nucleic Acids Res, 2006, 34(suppl_2): 70-74.
[31] OuYang Z, Liang J. Predicting protein folding rates from geometric contact and amino acid sequences. Protein Sci, 2008, 17(7): 1256-1263
[32] Huang L T, Gromiha M M. Analysis and prediction of proteinfolding rates using quadratic responde surface models. J ComputChem, 2008, 29(10): 1675-1683
[33] Shen H B, Song J N, Chou K C. Prediction of protein folding ratesfrom primary sequences by fusing multiple sequential features.J Biomedical Science and Engineering, 2009, 2(3): 136-143
[34] Jiang Y, Iglinski P, Kurgan L. Prediction of protein folding ratesfrom primary sequencess using hybrid sequences representation.J Comput Chem, 2009, 30(5): 772-783
[35] Chung, B.; Lee, D.-Y., Computational codon optimization of synthetic gene for protein expression. BMC systems biology 2012, 6 (1), 134.
[36] 帝斯曼知识产权资产管理有限. 公司实现改进的多肽表达的方法: 中国, 200780024670.5[P]. 2009-07-22
[37] Gromiha, M. M.; Thangakani, A. M.; Selvaraj, S., FOLD-RATE: prediction of protein folding rates from amino acid sequences. Nucleic acids research 2006, 34 (suppl 2), W70-W74.
[38] Chou, K. C., Prediction of protein cellular attributes using pseudo‐amino acid composition. Proteins: Structure, Function, and Bioinformatics 2001, 43 (3), 246-255.
[39] Galzitskaya, O. V.; Garbuzynskiy, S. O.; Ivankov, D. N.; Finkelstein, A. V., Chain length is the main determinant of the folding rate for proteins with three‐state folding kinetics. Proteins: Structure, Function, and Bioinformatics 2003, 51 (2), 162-166.








E' NOTE


Abstract

E' NOTE is an experiment recording tool specifically designed for iGEMers. It allows users to take notes, draw tables, upload images, and provides a series of templates for them. The templates are capable of some basic calculations such as enzyme digestion system and ligation systems setting, which significantly ease the burden of experiment recording. Besides, E' NOTE also provides a plasmid library which can be further constructed by users, in addition to basic information, the library can offer the data of the time of purification and breed conservation to avoid the death of strains caused by freezing for too long. The library is linked to the templates, hence, after recording the data in the library, the templates will present the data which the users recorded in the library.

E' NOTE also offers a calculation board for some basic calculations during solution configuration and a built-in E-mail sender to facilitate communication among team members. Of course, the tools provided by E' NOTE is far from complete, so E' NOTE contains a software integration board. It can integrate the software tools useful for synthetic biology experiment process on the internet. Users can easily find their desired software in this part.

The broad majority of iGEMers may often get entangled with uploading the experiment notes to the wiki by either PDF file or constructing a webpage. But it’s a pity that simple experiment notes cannot show the experimental method and thus barricade communication. If you use E' NOTE to record your experiments, simply input your data according to the specification and the experiments data will be transferred to wiki conveniently, just let E' NOTE to present your experiment note perfectly.




Future work

In the future, E' NOTE will integrate more online tools for synthetic biology experiments and classify them to facilitate the users’ searching process(Figure 1). In addition, it will provide a port for the users to contribute the software they think might help with synthetic biology experiments. And the software will appear on the software integration board after verification(Figure 2). Furthwer nore, E' NOTE also make an attempt to integrate the software tools offline. We have known that we can do it through Python and even provide a draft for that (Figure 3). E' NOTE is far more than a experiment recording tool for iGEMers, it is a platform for iGEMers to start can develop their projects. So, experiments recording is, just a start.








Introduction for using

1.Click here to learn how to use E' NOTE: Tutorial of E' note.

2.Output the record to create a wiki page: Data output. (xml.css)


Demo



Achievement


Brick Worker

























Reading the Brick-Worker's source code, please click here:    Tutorial of Brick Worker

Download the Brick-Worker's source code, please click here:          Brick-Worker





TEAM