Team:XMU Software/Project/protein


Our project includes 2 independent software tools-the brick worker and E' NOTE. The former is a software suit for the evaluation and optimization of biobricks, i.e., promoter, RBS, protein coding sequences and terminator. E' NOTE is a web application serving as an assistant for experiments. Its useful functions such as experiments recording and experimental template customization make experimental process easier and more enjoyable.


Our team mainly focuses on programming the software by genetic algorithm, evaluating both optimization of single codon and codon pair and hence determining the fittest optimized sequences for expression in heterologous host cell.

In addition, Synoproteiner protects enzyme cutting site. Users can choose corresponding cutting sites, not substituted during the optimization, so that the power of chosen enzyme cutting site won’t fail.

Apart from the optimization, we have two additional functions. One is the statistics analysis, which provides the numbers and the proportion of the codon in the original and optimized sequences, making the optimization easier to understand. The other is the prediction of the protein folding rate. The purpose of the prediction is to seek the law of the folding rate in general, computing a relatively accurate folding rate value of the optimized sequences for the users.


Synonymous codons and the efficiency

Except methionine and tryptophan, all amino acids can be encoded by two to six synonymous codons, resulting from the degeneracy of the genetic code.1 However, unequal utilization of the synonymous condons leads to the phenomenon of codon usage bias, which is mainly due to natural selection, mutation and genetic drift.2 According to related studies, codon usage bias has certain connection with gene expression level.3 The larger the value of codon usage bias is, the higher gene expression will be. So the problem, how to substitute the synonymous codons aimed at raising the efficiency of gene expression and thus increasing the production of recombination protein in heterologous host cell, is expected to be addressed.

Protein folding rate

Protein is an important class of biological macromolecules. It is the main bearer of life activities and occupies a special position in vivo. Each protein has its own unique amino acid composition and sequences. Only when the amino acid chain is folded into the correct three-dimensional structure, will the protein have normal biological functions. Misfolded ones will not only lose its biological function but also even cause diseases such as mad cow disease, Alzheimer's syndrome, etc. The protein folding problem, an important biological question that the central dogma of molecular biology has not solved yet, has been listed as an important topic in twenty-first century. The folding mechanism of the protein is a challenging task, one of which is to determine factor influencing the folding rate. Although the answer can be found in a variety of biological experiments, such as various spectroscopy, mass spectrometry and nuclear magnetic resonance, these methods are time-consuming and costly. With the development of physics, mathematics, especially the progress of computer technology, how to apply a fast and accurate calculation method to predict protein folding rate attracts more and more attention.4


Balance with single codon and codon pair

Individual codon usage optimization has been attached importance to, taking Codon optimizer,5 Gene Designer,6 OPTIMIZER7 for example. Subsequently, people found the effect of gene expression optimization cannot be perfect just by single codon optimization. Codon pair, namely the pair of k-th and (k+1)-th codons from the 5’ to 3’ end, is another crucial factor. Due to potential tRNA-tRNA steric interaction within the ribosomes,8 the usage of rare condon pairs, which correlate with translation elongation, decrease protein translation rates.9 Optimization of individual codon has an influence on the corresponding codon pair resulting in maybe-not-the-best codon pair optimization. In the same way, optimizing codon pair merely contributes to maybe-not-the-best single codon optimization. Therefore, it is a challenging way for us to apply a method considering and weighing the effects of single codon and codon pair optimization and thus make the whole best.

Our team focuses on evaluating both optimization of single codon and codon pair and thus selecting the best sequences for expression in heterologous host cell.

Host Cell

Considering E. coli and S. cerevisiae are the ideal hosts for recombinant proteinexpression, and Gram-positive bacterium L. lactis and methylotrophic yeast P. pastoris are also promising candidates for expressing recombinant proteins,10 we attached importance to selecting these four kinds of bacterium as host cell to optimize the sequences.

Enzyme cutting site protection

To avoid deconstruction of enzyme cutting site and therefore incapability during substitution of the base, Synoproteiner succeeds in exchanging 14 kinds of commonly used cutting sites such as EcoRI, XbaI and so on.

Users can multi-select the enzyme cutting site that they want to stay unaltered and hence get an optimized sequence without changing selected region.

Method of prediction

In recent years, many researchers have made great efforts to explore the determinants of the folding rate, and various forecasting methods have been proposed. The existed prediction methods can be roughly divided into three categories.11-12 The first one is based on the tertiary structure.13-19 However, it takes lots of molecular experiments, expensive and in long period, to acquire the information of the tertiary structure, which fails to meet the demand of rapid prediction. The second category is based on the secondary structure.20-24 This kind of method requires information of the secondary structure, similarly obtained by molecular experiments, or from the primary sequences prediction, but it will be limited by accuracy of the secondary structure prediction method. The last one is based on the primary structure,25-34 which predicts the folding rate from amino acid sequences without most structure information.4 And our prediction of the protein folding rate focuses on the last method.


Part I—Optimization

Basic Table

Based on the synonymous codons table, we calculate, we calculate function of single codon, function of codon pair and the function of multi-objective codon optimization. The method aims at make the optimization of whole best by calculating the relative effect of single codon and codon pair.

Fitness calculation36

Fitness function:

In the function,

cpi is a value larger than zero, ranging from 10-4 to 0.5,fitcp (g) is the fitness function of the codon pair,fitsc (g) is the fitness function of the single codon,w ( (c (k),c (k+1)) is the weight of codon pairs in sequences g,|g| is the length of encoding sequences, c (k) is k-th codon in the sequences, is the target ratio of k-th codon, is the actual ratio of k-th codon in the sequences,the best value of cpi is 0.2 in the software.

In the function, the target ratio of k-th codon can be approximated by the equation below:

In the function, weight can be calculated by the equation below:

stands for the ratio of single codon ck in the complete genome'is the number of pair ( ci,cj ) in high-expression genes, and high-expression genes are genes whose copy numbers of mRNA can be detected at least 20 per cell.

syn (ck) stands for the synonymous codon set related to ck,equals to the number of amino acid encoded by ci in the whole protein set.

Calculation of Fitness

NSGA-II algorithm applied35:

1. Randomly initialize a population of coding sequences for target protein.

2. Evaluate fitsc (g) and fitsp (g) fitness of each sequences in the population.

3. Group the sequences into nondominated sets and rank the sets.

4. Check termination criterion.

5. If termination criterion is not satisfied, select the “fittest” sequences (top 50% of the population) as the parents for creation of offsprings via recombination and mutation.

6. Combine the parents and offsprings to form a new population.

7. Repeat steps 2 to 5 until termination criterion is satisfied.

The identification and ranking of nondominated sets in step 3 is performed via pair-wise comparison of the sequences' fitsc (g) and fitsp (g) fitness. For a given pair of sequences with fitness values expressed as (fit1cp (g),fit1sc (g)) and (fit2cp (g),fit2sc (g)), the domination status can be evaluated as follows:

• If (fit1sc (g)>fit2sc (g))and (fit1cp (g)>=fit2cp (g)), sequences 1 dominates sequences 2.

• If (fit1sc (g)>=fit2sc (g)) and (fit1cp (g)>fit2cp (g)), sequences 1 dominates sequences 2.

• If (fit1sc (g)<fit2sc (g))and (fit1cp (g)<=fit2cp (g)), sequences 2 dominates sequences 1.

• If (fit1sc (g)<=fit2sc (g))and (fit1cp (g)<fit2cp (g)), sequences 2 dominates sequences 1.

The process is showed in the figure below:

Figure 1


By this method, there are enough experimental data to prove the sequences optimized works. Xylose isomerase in Bacillus stearothermophilus, Xylose isomerase in Streptomyces olivochromogenes and L-arabinose isomerase in Thermoanaerobacter mathranii all, the optimized ones, were highly expressed in Bacillus subtilis. In addition, the activity of the optimized Aspergillusniger fungal amylase was enhanced to 400% compared with the original sequences in A. niger.36

Link to the page of Example of optimization

Part II—Prediction of protein folding rate

In order to illustrate protein folding rate quantitatively, we determine the folding rate of 60 kinds of proteins as an experimental data set from literature and database37, and information of the sequences comes from PBD and NCBI.

   protein    Logarithm of the folding rate Ln(kf)    protein    Logarithm of the folding rate Ln(kf)    protein    Logarithm of the folding rate Ln(kf)
2PDD 9.8 1FKB 1.5 1RA9 -2.5

In order that the characteristic factors of the folding rate can be extracted from protein sequences, we introduced the Chou's pseudo amino acid composition concept.38 According to the pseudo amino acid composition principle, the position information of protein sequences can be, to some extent, reflected by a group of serial correlation factors θ1,θ2 ,θ3……,θn ,which is defined as follows:

in the function, θ1 is called the first-tier correlation factor that reflects the sequences order correlation between all the most contiguous residues along a protein chain (Fig. 2a), θ2 the second-tier correlation factor that reflects the sequences order correlation between all the second most contiguous residues (Fig.2b), θ3 the third-tier correlation factor that reflects the sequences order correlation between all the 3rd most contiguous residues (Fig.2c), and so forth.38

Figure 2

the correlation function is given by4:


where H1(Ri)), H2(Ri), and M(Ri) are, respectively, the hydrophobicity value. Studies have shown that λ=10 will be the best predictor.39 But there will be a large amount of calculation considering all possible situations—the 30 factors. We should select factors that can obtain the best prediction accuracy in least calculation. For that reason, we drew lessons from the literature4 by using the method of Monte Carlo simulation and then 14 optimal characteristic factor were obtained. Other studies have indicated that the logarithm of the sequences length has a good correlation with folding rate, so Ln (L) will be the fifteenth factors. We apply SPSS software to calculate the coefficient of 15 factor by multivariate linear regression, and this will be the forecast formula of the rate of protein folding. We compared the experimental data and the predicted data and the results are as follows:

Figure 3 The comparision of the experimental and the predicted results

Through the test, our software succeeded in showing a relatively accurate folding rate value.

Future work

First of all, we will modify our software by advancing the program and the framework to improve its ability of concurrent computation and shorten the computing time.

Secondly, to accelerate the calculation, we may simplify the function of calculation by neglecting some term in our equations. However, considering the time spent on running program was extremely little, we will pay more attention on how to modify the equations for increasing the accuracy which maybe dramatically progress optimization result.

Thirdly, enriching the database is other way to improve our software. According to time-space tradeoff law, we could pre-process a bunch of sequences in common use to optimized one and save the result into our database. By assessing our data, investigators could select the optimized sequences for their synthesis. Then, users are required to feedback their result. When it collects enough information, our app will learn users’ bias therefore modify our optimizing function by some methods, like NSGA-II algorithm.

The specific points are listed as following:

1. Shortening the computing time of the software.

2. Expanding the range of the host cells.

3. Improving bacterium's resistance to toxic molecule.

4. Advancing existed paths of synthetic biology by the method.

5. Designing new paths of synthetic biology by the method.

6. Increasing the output of recombinant protein.

7. Predicting the expression of heterologous gene in a new host cell.

8. Considering more factors such as spiral structure in folding which influence the folding rate and thereby obtaining more accurate prediction rate.

9. Providing a set of software tools for protein folding, especially in molecular dynamics simulation of protein folding.


[1] Grantham, R.; Gautier, C.; Gouy, M.; Mercier, R.; Pave, A., Codon catalog usage and the genome hypothesis. Nucleic acids research 1980, 8 (1), 197-197.
[2] Hershberg, R.; Petrov, D. A., Selection on codon bias. Annual review of genetics 2008, 42, 287-299.
[3] Gouy, M.; Gautier, C., Codon usage in bacteria: correlation with gene expressivity. Nucleic acids research 1982, 10 (22), 7055-7074.
[4] 郭建秀,饶妮妮, 刘广雄, 李杰, & 王云鹤. 从氨基酸序列预测蛋白质折叠速率. 生物化学与生物物理进展 Progress in Biochemistry and Biophysics 2010, 37(12): 1331~1338
[5] Fuglsang, A., Codon optimizer: a freeware tool for codon optimization. Protein expression and purification 2003, 31 (2), 247-249.
[6] Villalobos, A.; Ness, J. E.; Gustafsson, C.; Minshull, J.; Govindarajan, S., Gene Designer: a synthetic biology tool for constructing artificial DNA segments. Bmc Bioinformatics 2006, 7 (1), 285.
[7] Puigbò, P.; Guzmán, E.; Romeu, A.; Garcia-Vallvé, S., OPTIMIZER: a web server for optimizing the codon usage of DNA sequences. Nucleic acids research 2007, 35 (suppl 2), W126-W131.
[8] Smith, D.; Yarus, M., tRNA-tRNA interactions within cellular ribosomes. Proceedings of the National Academy of Sciences 1989, 86 (12),4397-4401.
[9] Coleman, J. R.; Papamichail, D.; Skiena, S.; Futcher, B.; Wimmer, E.; Mueller, S., Virus attenuation by genome-scale changes in codon pair bias. Science 2008, 320 (5884), 1784-1787.
[10] (a) Wildt, S.; Gerngross, T. U., The humanization of N-glycosylation pathways in yeast. Nature Reviews Microbiology 2005, 3 (2), 119-128; (b) Morello, E.; Bermudez-Humaran, L.; Llull, D.; Sole, V.; Miraglio, N.; Langella, P.; Poquet, I., Lactococcus lactis, an efficient cell factory for recombinant protein production and secretion. Journal of molecular microbiology and biotechnology 2007, 14 (1-3), 48-58.
[11] 郭建秀, 马彬广, 张红雨. 蛋白质折叠速率预测研究进展. 生物物理学报, 2006, 22(2):89-95 Guo J X, Ma B G, Zhang H Y. Acta Biophys Sin, 2006, 22 (2):89-95.
[12] Gromiha M M, Selvaraj S. Bioinformatics approaches for understanding and predicting protein folding rates. Current Bioinformatics, 2008, 3(1): 1-9
[13] Plaxco K W, Simons K T, Baker D. Contact order, transition state placement and the refolding rates of single domain proteins. J MolBiol, 1998, 277(4): 985-994
[14] Gromiha M M, Selvaraj S. Comparison between long-range interactions and contact order in determining the folding rate of two-state proteins: application of long-range order to folding rate prediction. J Mol Biol, 2001, 310(1): 27-32
[15] Zhou H, Zhou Y. Folding rate prediction using total contact distance. Biophys J, 2002, 82(1): 458-463
[16] Nölting B, Schälike W, Hampel P, et al. Structural determinants of the rate of protein folding. J Theor Biol, 2003, 223(3): 299-307
[17] Weikl T R, Dill K A. Folding kinetics of two-state proteins: Effect of circularization, permutation, and crosslinks. J Mol Biol, 2003,332(4): 953-963
[18] Ivankov D N, Garbuzynskiy S O, Alm E, et al. Contact order revisited: influence of protein size on the folding rate. Protein Sci,2003, 12(9): 2057-2062
[19] Mirny L, Shakhnovich E. Protein folding theory: from lattice to all-atom models. Annu Rev Biophys Biomol Struct, 2001, 30 (1):361-396
[20] Gong H, Isom D G, Srinivasan R, et al. Local secondary structure content predicts folding rates for simple two-state proteins. J MolBiol, 2003, 327(5): 1149-1154
[21] Ivankov D N, Finkelstein A V. Prediction of protein folding rates from the amino acid sequences-predicted secondary structure. Proc Nat Acad Sci USA, 2004, 101(24): 8942-8944
[22] Fleming P J, Gong H P, Rose G D. Secondary structure determines protein topology. Protein Sci, 2006, 15(8): 1829-1834
[23] Huang J T, Cheng J P, Chen H. Secondary structure length as a determinant of folding rate of proteins with two- and three-state kinetics. Proteins, 2007, 67(1): 12-17
[24] Prabhu N P, Bhuyan A K. Prediction of folding rates of small proteins: empirical relations based on length, secondary structure content, residue type, and stability. Biochemistry, 2006, 45 (11):3805-3812
[25] Shao H, Peng Y, Zeng Z H. A simple parameter relating sequenceswith folding rates of small helical proteins. Protein Pept Lett, 2003,10(3): 277-280
[26] Galzitskaya O V, Garbuzynskiy S O, Ivankov D N, et al. Chainlength is the main determinant of the folding rate for proteins withthree-state folding kinetics. Proteins, 2003, 51(2): 162-166
[27] Huang J T, Jing T. Amino acid sequences predicts folding rate for middle-size two-state proteins. Proteins, 2006, 63(3): 551-554
[28] Gromiha M M. A statistical model for predicting protein folding rates from amino acid sequences with structural class information.J Chem Inf Model, 2005, 45(2): 494-501
[29] Ma B G, Guo J X, Zhang H Y. Direct correlation between proteins'folding rates and their amino acid compositions: an ab initio foldingrate prediction. Proteins, 2006, 65(2): 362-372
[30] Gromiha M M, Thangakani A M, Selvaraj S. FOLD-RATE:prediction of protein folding rates from amino acid sequences.Nucleic Acids Res, 2006, 34(suppl_2): 70-74.
[31] OuYang Z, Liang J. Predicting protein folding rates from geometric contact and amino acid sequences. Protein Sci, 2008, 17(7): 1256-1263
[32] Huang L T, Gromiha M M. Analysis and prediction of proteinfolding rates using quadratic responde surface models. J ComputChem, 2008, 29(10): 1675-1683
[33] Shen H B, Song J N, Chou K C. Prediction of protein folding ratesfrom primary sequences by fusing multiple sequential features.J Biomedical Science and Engineering, 2009, 2(3): 136-143
[34] Jiang Y, Iglinski P, Kurgan L. Prediction of protein folding ratesfrom primary sequences using hybrid sequences representation.J Comput Chem, 2009, 30(5): 772-783
[35] Chung, B.; Lee, D.-Y., Computational codon optimization of synthetic gene for protein expression. BMC systems biology 2012, 6 (1), 134.
[36] 帝斯曼知识产权资产管理有限. 公司实现改进的多肽表达的方法: 中国, 200780024670.5[P]. 2009-07-22
[37] Gromiha, M. M.; Thangakani, A. M.; Selvaraj, S., FOLD-RATE: prediction of protein folding rates from amino acid sequences. Nucleic acids research 2006, 34 (suppl 2), W70-W74.
[38] Chou, K. C., Prediction of protein cellular attributes using pseudo‐amino acid composition. Proteins: Structure, Function, and Bioinformatics 2001, 43 (3), 246-255.
[39] Galzitskaya, O. V.; Garbuzynskiy, S. O.; Ivankov, D. N.; Finkelstein, A. V., Chain length is the main determinant of the folding rate for proteins with three‐state folding kinetics. Proteins: Structure, Function, and Bioinformatics 2003, 51 (2), 162-166.