# Team:XMU Software/Project/protein

### From 2013.igem.org

# Abstract

Our team mainly focuses on programming the software by genetic algorithm, evaluating both optimization of single codon and codon pair and hence determining the fittest optimized sequences for expression in heterologous host cell.

In addition, Synoproteiner protects enzyme cutting site. Users can choose corresponding cutting sites, not substituted during the optimization, so that the power of chosen enzyme cutting site won’t fail.

Apart from the optimization, we have two additional functions. One is the statistics analysis, which provides the numbers and the proportion of the codon in the original and optimized sequences, making the optimization easier to understand. The other is the prediction of the protein folding rate. The purpose of the prediction is to seek the law of the folding rate in general, computing a relatively accurate folding rate value of the optimized sequences for the users.

# Background

#### Synonymous codons and the efficiency

Except methionine and tryptophan, all amino acids can be encoded by two to six synonymous codons, resulting from the degeneracy of the genetic code.^{1} However, unequal utilization of the synonymous condons leads to the phenomenon of codon usage bias, which is mainly due to natural selection, mutation and genetic drift.^{2} According to related studies, codon usage bias has certain connection with gene expression level.^{3} The larger the value of codon usage bias is, the higher gene expression will be. So the problem, how to substitute the synonymous codons aimed at raising the efficiency of gene expression and thus increasing the production of recombination protein in heterologous host cell, is expected to be addressed.

#### Protein folding rate

Protein is an important class of biological macromolecules. It is the main bearer of life activities and occupies a special position in vivo. Each protein has its own unique amino acid composition and sequences. Only when the amino acid chain is folded into the correct three-dimensional structure, will the protein have normal biological functions. Misfolded ones will not only lose its biological function but also even cause diseases such as mad cow disease, Alzheimer's syndrome, etc. The protein folding problem, an important biological question that the central dogma of molecular biology has not solved yet, has been listed as an important topic in twenty-first century. The folding mechanism of the protein is a challenging task, one of which is to determine factor influencing the folding rate. Although the answer can be found in a variety of biological experiments, such as various spectroscopy, mass spectrometry and nuclear magnetic resonance, these methods are time-consuming and costly. With the development of physics, mathematics, especially the progress of computer technology, how to apply a fast and accurate calculation method to predict protein folding rate attracts more and more attention.^{4}

# Introduction

#### Balance with single codon and codon pair

Individual codon usage optimization has been attached importance to, taking Codon optimizer,^{5} Gene Designer,^{6} OPTIMIZER^{7} for example. Subsequently, people found the effect of gene expression optimization cannot be perfect just by single codon optimization. Codon pair, namely the pair of k-th and (k+1)-th codons from the 5’ to 3’ end, is another crucial factor. Due to potential tRNA-tRNA steric interaction within the ribosomes,^{8} the usage of rare condon pairs, which correlate with translation elongation, decrease protein translation rates.^{9} Optimization of individual codon has an influence on the corresponding codon pair resulting in maybe-not-the-best codon pair optimization. In the same way, optimizing codon pair merely contributes to maybe-not-the-best single codon optimization. Therefore, it is a challenging way for us to apply a method considering and weighing the effects of single codon and codon pair optimization and thus make the whole best.

Our team focuses on evaluating both optimization of single codon and codon pair and thus selecting the best sequences for expression in heterologous host cell.

#### Host Cell

Considering *E. coli* and *S. cerevisiae* are the ideal hosts for recombinant proteinexpression, and Gram-positive bacterium *L. lactis* and methylotrophic yeast *P. pastoris* are also promising candidates for expressing recombinant proteins,^{10} we attached importance to selecting these four kinds of bacterium as host cell to optimize the sequences.

#### Enzyme cutting site protection

To avoid deconstruction of enzyme cutting site and therefore incapability during substitution of the base, Synoproteiner succeeds in exchanging 14 kinds of commonly used cutting sites such as EcoRI, XbaI and so on.

Users can multi-select the enzyme cutting site that they want to stay unaltered and hence get an optimized sequence without changing selected region.

#### Method of prediction

In recent years, many researchers have made great efforts to explore the determinants of the folding rate, and various forecasting methods have been proposed. The existed prediction methods can be roughly divided into three categories.^{11-12} The first one is based on the tertiary structure.^{13-19} However, it takes lots of molecular experiments, expensive and in long period, to acquire the information of the tertiary structure, which fails to meet the demand of rapid prediction. The second category is based on the secondary structure.^{20-24} This kind of method requires information of the secondary structure, similarly obtained by molecular experiments, or from the primary sequences prediction, but it will be limited by accuracy of the secondary structure prediction method. The last one is based on the primary structure,^{25-34} which predicts the folding rate from amino acid sequences without most structure information.^{4} And our prediction of the protein folding rate focuses on the last method.

# Algorithm

#### Part I—Optimization

### Basic Table

Based on the synonymous codons table, we calculate, we calculate function of single codon, function of codon pair and the function of multi-objective codon optimization. The method aims at make the optimization of whole best by calculating the relative effect of single codon and codon pair.

#### Fitness calculation^{36}

### Fitness function：

In the function,

*cpi* is a value larger than zero, ranging from 10^{-4} to 0.5,*fit _{cp}* (

*g*) is the fitness function of the codon pair，

*fit*(

_{sc}*g*) is the fitness function of the single codon，

*w*( (

*c*(

*k*),

*c*(

*k+1*)) is the weight of codon pairs in sequences

*g*,|

*g*| is the length of encoding sequences,

*c*(

*k*) is k-th codon in the sequences, is the target ratio of k-th codon, is the actual ratio of k-th codon in the sequences，the best value of

*cpi*is 0.2 in the software.

In the function, the target ratio of k-th codon can be approximated by the equation below:

In the function, weight can be calculated by the equation below:

stands for the ratio of single codon c_{k} in the complete genome'is the number of pair ( *c _{i},c_{j}* ) in high-expression genes, and high-expression genes are genes whose copy numbers of mRNA can be detected at least 20 per cell.

syn (c_{k}) stands for the synonymous codon set related to *c _{k}*,equals to the number of amino acid encoded by

*c*in the whole protein set.

_{i}### Calculation of Fitness

NSGA-II algorithm applied^{35}:

1. Randomly initialize a population of coding sequences for target protein.

2. Evaluate *fit _{sc}* (

*g*) and

*fit*(

_{sp}*g*) fitness of each sequences in the population.

3. Group the sequences into nondominated sets and rank the sets.

4. Check termination criterion.

5. If termination criterion is not satisfied, select the “fittest” sequences (top 50% of the population) as the parents for creation of offsprings via recombination and mutation.

6. Combine the parents and offsprings to form a new population.

7. Repeat steps 2 to 5 until termination criterion is satisfied.

The identification and ranking of nondominated sets in step 3 is performed via pair-wise comparison of the sequences' *fit _{sc}* (

*g*) and

*fit*(

_{sp}*g*) fitness. For a given pair of sequences with fitness values expressed as (

*fit*(

^{1}_{cp}*g*),

*fit*(

^{1}_{sc}*g*)) and (

*fit*(

^{2}_{cp}*g*),

*fit*(

^{2}_{sc}*g*)), the domination status can be evaluated as follows:

• If (*fit ^{1}_{sc}* (

*g*)>

*fit*(

^{2}_{sc}*g*))and (

*fit*(

^{1}_{cp}*g*)>=

*fit*(

^{2}_{cp}*g*)), sequences 1 dominates sequences 2.

• If (*fit ^{1}_{sc}* (

*g*)>=

*fit*(

^{2}_{sc}*g*)) and (

*fit*(

^{1}_{cp}*g*)>

*fit*(

^{2}_{cp}*g*)), sequences 1 dominates sequences 2.

• If (*fit ^{1}_{sc}* (

*g*)<

*fit*(

^{2}_{sc}*g*))and (

*fit*(

^{1}_{cp}*g*)<=

*fit*(

^{2}_{cp}*g*)), sequences 2 dominates sequences 1.

• If (*fit ^{1}_{sc}* (

*g*)<=

*fit*(

^{2}_{sc}*g*))and (

*fit*(

^{1}_{cp}*g*)<

*fit*(

^{2}_{cp}*g*)), sequences 2 dominates sequences 1.

The process is showed in the figure below:

### Result:

By this method, there are enough experimental data to prove the sequences optimized works. Xylose isomerase in *Bacillus stearothermophilus*, Xylose isomerase in *Streptomyces olivochromogenes* and L-arabinose isomerase in *Thermoanaerobacter mathranii* all, the optimized ones, were highly expressed in *Bacillus subtilis*. In addition, the activity of the optimized Aspergillusniger fungal amylase was enhanced to 400% compared with the original sequences in *A. niger*.^{36}

Link to the page of Example of optimization

#### Part II—Prediction of protein folding rate

In order to illustrate protein folding rate quantitatively, we determine the folding rate of 60 kinds of proteins as an experimental data set from literature and database^{37}, and information of the sequences comes from PBD and NCBI.

protein | Logarithm of the folding rate Ln(k_{f}) |
protein | Logarithm of the folding rate Ln(k_{f}) |
protein | Logarithm of the folding rate Ln(k_{f}) |
---|---|---|---|---|---|

2PDD | 9.8 | 1FKB | 1.5 | 1RA9 | -2.5 |

2ABD | 6.6 | 2CI2 | 3.9 | 1QOP | -6.9 |

256B | 12.2 | 1URN | 5.8 | 1PHP | 2.3 |

1IMQ | 7.3 | 1APS | -1.5 | 1PHP | -3.5 |

1LMB | 8.5 | 1RIS | 5.9 | 1BNI | 2.6 |

1WIT | 0.4 | 1POH | 2.7 | 2LZM | 4.1 |

1TEN | 1.1 | 1DIV | 6.1 | 1UBQ | 5.9 |

1SHG | 1.4 | 2VIK | 6.8 | 1SCE | 4.2 |

1SRL | 4 | 1A6N | 1.1 | 1YCC | 9.62 |

1PNJ | -1.1 | 1CEI | 5.8 | 1VII | 11.52 |

1SHF | 4.5 | 2CRO | 3.7 | 1NYF | 4.54 |

1PSF | 3.2 | 2A5E | 3.5 | 2AIT | 4.2 |

1CSP | 7 | 1IFC | 3.4 | 1PIN | 9.44 |

1C9O | 7.2 | 1EAL | 1.3 | 1C8C | 6.91 |

1G6P | 6.3 | 1OPA | 1.4 | 1BRS | 3.4 |

1MJC | 5.3 | 1CBI | -3.2 | 1UBQ | 5.9 |

1LOP | 6.6 | 1QOP | -2.5 | 3CHY | 1 |

1C8C | 7 | 1BRS | 3.4 | 1BIN | 2.6 |

1HZ6 | 4.1 | 3CHY | 1 | 1SCE | 4.2 |

1PGB | 6 | 2RN2 | 0.1 | 1GXT | 4.38 |

In order that the characteristic factors of the folding rate can be extracted from protein sequences, we introduced the Chou's pseudo amino acid composition concept.^{38} According to the pseudo amino acid composition principle, the position information of protein sequences can be, to some extent, reflected by a group of serial correlation factors *θ _{1}，θ_{2} ，θ_{3}……，θ_{n}* ,which is defined as follows:

in the function, *θ _{1}* is called the first-tier correlation factor that reflects the sequences order correlation between all the most contiguous residues along a protein chain (Fig. 2a),

*θ*the second-tier correlation factor that reflects the sequences order correlation between all the second most contiguous residues (Fig.2b),

_{2}*θ*the third-tier correlation factor that reflects the sequences order correlation between all the 3rd most contiguous residues (Fig.2c), and so forth.

_{3}^{38}

the correlation function is given by^{4}:

_{i},R

_{j})=|

*H*(

*R*)-

_{j}*H*(

*R*)|

_{i}where *H1(R _{i})), H2(R_{i}), and M(R_{i})* are, respectively, the hydrophobicity value. Studies have shown that λ=10 will be the best predictor.

^{39}But there will be a large amount of calculation considering all possible situations—the 30 factors. We should select factors that can obtain the best prediction accuracy in least calculation. For that reason, we drew lessons from the literature

^{4}by using the method of Monte Carlo simulation and then 14 optimal characteristic factor were obtained. Other studies have indicated that the logarithm of the sequences length has a good correlation with folding rate, so Ln (L) will be the fifteenth factors. We apply SPSS software to calculate the coefficient of 15 factor by multivariate linear regression, and this will be the forecast formula of the rate of protein folding. We compared the experimental data and the predicted data and the results are as follows:

Through the test, our software succeeded in showing a relatively accurate folding rate value.

# Future work

First of all, we will modify our software by advancing the program and the framework to improve its ability of concurrent computation and shorten the computing time.

Secondly, to accelerate the calculation, we may simplify the function of calculation by neglecting some term in our equations. However, considering the time spent on running program was extremely little, we will pay more attention on how to modify the equations for increasing the accuracy which maybe dramatically progress optimization result.

Thirdly, enriching the database is other way to improve our software. According to time-space tradeoff law, we could pre-process a bunch of sequences in common use to optimized one and save the result into our database. By assessing our data, investigators could select the optimized sequences for their synthesis. Then, users are required to feedback their result. When it collects enough information, our app will learn users’ bias therefore modify our optimizing function by some methods, like NSGA-II algorithm.

The specific points are listed as following:

1. Shortening the computing time of the software.

2. Expanding the range of the host cells.

3. Improving bacterium's resistance to toxic molecule.

4. Advancing existed paths of synthetic biology by the method.

5. Designing new paths of synthetic biology by the method.

6. Increasing the output of recombinant protein.

7. Predicting the expression of heterologous gene in a new host cell.

8. Considering more factors such as spiral structure in folding which influence the folding rate and thereby obtaining more accurate prediction rate.

9. Providing a set of software tools for protein folding, especially in molecular dynamics simulation of protein folding.

# References

*Nucleic acids research*

**1980**,

*8*(1), 197-197.

*Annual review of genetics*

**2008**,

*42,*287-299.

*Nucleic acids research*

**1982**,

*10*(22), 7055-7074.

*Progress in Biochemistry and Biophysics*

**2010**,

*37*(12): 1331~1338

*Protein expression and purification*

**2003**,

*31*(2), 247-249.

*Bmc Bioinformatics*

**2006**,

*7*(1), 285.

*Nucleic acids research*

**2007**,

*35*(suppl 2), W126-W131.

*Proceedings of the National Academy of Sciences*

**1989**,

*86*(12),4397-4401.

*Science*

**2008**,

*320*(5884), 1784-1787.

**2005**,

*3*(2), 119-128; (b) Morello, E.; Bermudez-Humaran, L.; Llull, D.; Sole, V.; Miraglio, N.; Langella, P.; Poquet, I., Lactococcus lactis, an efficient cell factory for recombinant protein production and secretion.

*Journal of molecular microbiology and biotechnology*

**2007**,

*14*(1-3), 48-58.

**2006**,

*22*(2):89-95 Guo J X, Ma B G, Zhang H Y.

*Acta Biophys Sin*,

**2006**,

*22*(2):89-95.

*Current Bioinformatics*,

**2008**,

*3*(1): 1-9

*J MolBiol*,

**1998**,

*277*(4): 985-994

*J Mol Biol*,

**2001**,

*310*(1): 27-32

*Biophys J*,

**2002**,

*82*(1): 458-463

*et al*. Structural determinants of the rate of protein folding.

*J Theor Biol*,

**2003**,

*223*(3): 299-307

*J Mol Biol*,

**2003**,

*332*(4): 953-963

*Protein Sci*,

**2003**,

*12*(9): 2057-2062

*Annu Rev Biophys Biomol Struct*,

**2001**,

*30*(1):361-396

*J MolBiol*,

**2003**,

*327*(5): 1149-1154

*Proc Nat Acad Sci USA*,

**2004**,

*101*(24): 8942-8944

*Protein Sci*,

**2006**,

*15*(8): 1829-1834

*Proteins*,

**2007**,

*67*(1): 12-17

*Biochemistry*,

**2006**,

*45*(11):3805-3812

*Protein Pept Lett*,

**2003**,

*10*(3): 277-280

*Proteins*,

**2003**,

*51*(2): 162-166

*Proteins*,

**2006**,

*63*(3): 551-554

*J Chem Inf Model*,

**2005**,

*45*(2): 494-501

*Proteins*,

**2006**,

*65*(2): 362-372

*Nucleic Acids Res*,

**2006**,

*34*(suppl_2): 70-74.

*Protein Sci*,

**2008**,

*17*(7): 1256-1263

*J ComputChem*,

**2008**,

*29*(10): 1675-1683

*J Biomedical Science and Engineering*,

**2009**,

*2*(3): 136-143

*J Comput Chem*,

**2009**,

*30*(5): 772-783

*BMC systems biology*

**2012**,

*6*(1), 134.

*Nucleic acids research*

**2006**,

*34*(suppl 2), W70-W74.

*Proteins: Structure, Function, and Bioinformatics*

**2001**,

*43*(3), 246-255.

*Proteins: Structure, Function, and Bioinformatics*

**2003**,

*51*(2), 162-166.