Team:TU-Munich/Results/Software

From 2013.igem.org

Revision as of 12:47, 24 September 2013 by ChristopherW (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The AutoAnnotator

Introduction to the Idea behind our AutoAnnotator

Figure 1:

The Parts Registry contains a wide range of interesting protein coding BioBricks, but there is no standardized way of presenting basic information about them. This is a real pity, because after the identification of the open reading frame a multitude of parameters of the protein can be computed automatically, e.g. its molecular mass, theoretical pI or codon quality for different organisms. We have developed a tool which is able to identify the open reading frame of a BioBrick, analyze the sequence and the encoded protein and export the results in a format that can easily be integrated into the part description (and the team wikis) as a single table.
This enables users to see basic information about the BioBrick at a quick glance in a standardized table, saving time, facilitating the comparison of BioBricks and improving the annotation of the parts. The AutoAnnotator can also be used for planning new Bricks, by quickly computing the relevant parameters in a single place rather than having to go to several different websites and gather the information together.
Try it out: The AutoAnnotator!

Overview

The AutoAnnotator is a web-based tool compiling information about encoded proteins from the DNA sequence. It performs the following steps:

Input: When entering a BioBrick number, the AutoAnnotator imports the nucleotide sequence directly from the Registry data base.
Alternatively a nucleotide sequence can be entered directly. This has to be used for new BioBricks, which aren't in the Registry yet, but can also be helpful for planning new BioBricks.
Finding the Open Reading Frame: In order to determine the Open Reading Frame (ORF) the algorithm first tries to determine what BioBrick assembly standard the BioBrick is in. If necessary (e.g. for an [http://parts.igem.org/Assembly_standard_25 RFC 25] Brick), nucleotides are added to the sequence. Then the ORF is determined by taking the first start codon and the first matching in-frame stop codon.
Computation of Parameters: From the nucleotide sequence the codon usage for different organisms, i.e. whether the preferred codons are used or not (which contributes to the level of gene expression), is computed directly. Then after translating the DNA sequence into its amino acid sequence, several parameters of the encoded protein are determined, namely: the amino acid composition, the number of charged amino acids, the atomic composition, the molecular mass, the isoelectric point (pI) and the extinction coefficient of the protein. Also moving averages of the hydrophobicity and charge of the residues are calculated. For more information on each of these see below. Additionally the sequence is also compared to a list of sequence features, such as binding sites or cleavage sites.
Alignments and Predictions: The amino acid sequence is also sent to the servers of [http://predictprotein.org PredictProtein.org], a site created and operated by the [http://rostlab.org Rost group], who kindly granted us access to their online resources. The servers provide alignments for the entered protein as well as predictions of the secondary structure, the solvent accessibility, the sub-cellular localization, the gene ontology and of transmembrane regions and disulfide bridges. Proteins, which are not among the over 30 million stored there, will be calculated and added, and should be ready in a matter of hours.
Presentation of the Computed Data: The data is then put together into a concise, structured HTML table and displayed to the user. Sequence features and predictions are additionally shown in an optional plot as part of the table. Furthermore the code producing the table is displayed underneath it and so by a single copy&paste the table can be integrated into any wiki, part description or other website.

Import of BioBrick Sequences

Upon entering a BioBrick number the AutoAnnotator uses the [http://parts.igem.org/DAS_-_Distributed_Annotation_System Registry DAS] interface to load the nucleotide sequence from the data base of the Registry. To allow this cross-domain information request, which is blocked by most browsers for security reasons, an [http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/ extension] to the .ajax() method in jQuery written by James Padolsey was used. This uses the [http://developer.yahoo.com/yql/ YQL] (Yahoo! Query Language), which is a service by Yahoo!, to redirect the request via their servers, in this way solving the security issues and allowing the Annotator to read the information from the Registry.

Determination of the Open Reading Frame

The first step is to work out the Assembly Standard of the BioBrick, since parts of the coding sequence may be in the pre- or suffix. As of version 1.0 the most common standards [http://parts.igem.org/Assembly_standard_10 RFC 10] and [http://parts.igem.org/Assembly_standard_25 RFC 25] are supported. Then the first start codon ATG is used and the first corresponding in-frame stop codon determined. These are taken to be the open reading frame.

Recognising Sequence Features

There are several useful building blocks, which are frequently integrated into BioBricks, such as different tags for analytical purposes or cleavage and docking sites for protein interaction. We have put together a list of such common sequence features and the AutoAnnotator automatically looks for these, lists the appearing features and marks them in the amino sequence. For the currently supported features please see the Feature List. If you have any suggestions for other interesting features, please get in touch and we will add them.

Computation of Parameters

Amino acid counting, atomic composition and molecular weight

The amino acid counting section is straight forward from the amino acid sequence. Then with the amino acid composition the atomic composition can easily be calculated by using the atomic composition of each amino acid (e.g. given [http://www.matrixscience.com/help/aa_help.html here]) and adding a water molecule (for the ends). Similarly the molecular weight is obtained by adding the individual weights (using the [http://web.expasy.org/findmod/findmod_masses.html#AA average isotopic masses]) and again adding the molecular weight of a water molecule.

Theoretical pI

The theoretical pI is the isoelectric point of the protein ignoring effects due to folding, which can't be computed properly. By definition the isoelectric point of a protein is the pH-value where the overall charge of the protein is zero, so we need to relate the pH value to the charges of the amino acids. For acid groups HA this is done by the Henderson-Hasselbalch equation, where pK_a is the negative logarithm of the acid dissociation constant:

We can rearrange this to get the fraction of molecules, which are deprotonised and so negatively charged:

Analogously by regarding HB⁺ as an acid, where B is a base, we can obtain the fraction of positively charged molecules:

These fractions can also be regarded as a "fractional charge", because they give the average charge over all molecules of this type. So by adding up the fractional charge of each amino acid (those with non-basic and non-acidic residues contribute no charge) and those for the N- and C-terminal groups we can determine the charge of the protein at a specific pH. The dissociation constants were taken from http://www.ncbi.nlm.nih.gov/pubmed/8125050 Bjellqvist et al., 1993 & http://www.ncbi.nlm.nih.gov/pubmed/8055880 Bjellqvist et al., 1994, which are also those used by the [http://web.expasy.org/protparam/ ExPASy ProtParam Tool] and are also shown in the tables below:

Positively charged groups
Group	pK_a
Lysine residue	10.00
Arginine residue	12.00
Histidine	5.98
N-terminal -NH₂ (unless specified otherwise)	7.50
N-terminal -NH₂ on Alanine	7.59
N-terminal -NH₂ on Methionine	7.00
N-terminal -NH₂ on Serine	6.93
N-terminal -NH₂ on Proline	8.36
N-terminal -NH₂ on Threonine	6.82
N-terminal -NH₂ on Valine	7.44
N-terminal -NH₂ on Glutamic acid	7.70

Negatively charged groups
Group	pK_a
C-terminal -COOH	3.55
Aspartic acid residue	4.05
Glutamic acid residue	4.45
Cysteine residue	9.00
Tyrosine residue	10.00

Now all that remains to be done is to find the pH such that the total charge is zero. This is most easily done by the bisection method: Start with pH=7.0 and determine the charge there. If it is positive, we know the pI must be greater than 7.0 and so we only consider that interval. If it is negative, we continue with the lower half of the pH range. In the subinterval we again evaluate the charge at its middle and choose a subinterval accordingly. By repeating this algorithm we halve the remaining range of pH values on every recursion and can determine the theoretical isoelectric point upto our required precision by continuing until the remaining range is smaller than that precision. However it has to be noted, that the pK_a values are only estimations, which depend on the experimental procedure (so you will find many different values in the literature), and that modifications to the protein and the formation of disulfide bridges affect the isoelectric point significantly, so it doesn't make sense to choose a precision of less than 0.01.

Extinction coefficient at 280 nm

The calculation of the extinction coefficient of a protein at 280 nm from its amino acid composition is straight-forward http://www.ncbi.nlm.nih.gov/pubmed/2610349 Gill and von Hippel, 1989. The only residues absorbing at this wavelength are those of Tyrosine, Tryptophan and Cystine (which consists of two Cysteines forming a disulfide bridge). Then the extinction coefficient is given by

where Numb(amino acid) is the number of appearances of that amino acid in the protein. Since the number of formed disulfide bridges is impossible to calculate, two values are calculated: One under the assumption that all Cysteines are reduced, i.e. that there are no disulfide bridges, the other assuming that every Cysteine is oxidized and hence part of a disulfide bridge.

Codon Adaptation Index (CAI)

The Codon Adaptation Index (CAI) was introduced by http://www.ncbi.nlm.nih.gov/pmc/articles/PMC340524/ Sharp and Li, 1987

Alignments and Predictions

This information is provided to us by the [http://rostlab.org Rost group], who kindly granted us access to the servers and data bases of their prediction site [http://predictprotein.org PredictProtein.org] (see Attributions). The sequence is aligned http://www.ncbi.nlm.nih.gov/pubmed/9254694 Altschul et al., 1997 against the [http://web.expasy.org/docs/swiss-prot_guideline.html Swiss-Prot] (manual annotation and review) and [http://web.expasy.org/docs/userman.html#what_is_trembl TrEMBL] (automatic annotation, extensive) protein databases, as well as against the [http://www.rcsb.org/pdb/static.do?p=general_information/about_pdb/index.html Protein Data Bank (PDB)] for 3D structures. For each of these the two alignments with the highest identity and those with an identity above 97% are given together with the number of amino acids which were aligned. So for example for a fusion protein it is perfectly possible to get two completely different alignments with 100% identity on different subparts of the amino acid sequence.

The server also produces various predictions. For more information about how these are obtained, please see the corresponding papers by the [http://rostlab.org Rost group]:

for secondary structure and solvent accessibility: http://www.ncbi.nlm.nih.gov/pubmed/15215403 Rost et al., 2004
for transmembrane helices: http://www.ncbi.nlm.nih.gov/pubmed/8844859 Rost et al., 1996
for disulfide bridges: http://nar.oxfordjournals.org/content/34/suppl_2/W177.full Frasconi et al., 2006
for sub-cellular localization: http://www.ncbi.nlm.nih.gov/pubmed/22962467 Rost et al., 2012
for gene ontology: http://www.biomedcentral.com/1471-2105/14/S3/S7 Rost et al., 2013

These results are stored on the server for more than 30 million amino acid sequences and are instantly available for these. If however a sequence is entered, which has not been calculated yet, the computations are initialized and should be ready in a few hours. In this case the user will be informed and told to return and rerun the AutoAnnotator later.

Export of the Computed Parameters

The results are then combined in a standardized HTML-Table, which is presented to the user. Additionally the HTML code of the table is given, allowing for a quick and easy copy&paste into any wiki page or part description. Here is an example of the produced table:

Protein data table for BioBrick BBa_K801060 automatically created by the BioBrick-AutoAnnotator version 1.0

Nucleotide sequence in RFC 10: (underlined part encodes the protein)
GTACACAATGCGTCGT ... TTCGAAAAATAA
ORF from nucleotide position 8 to 1705 (excluding stop-codon)

Amino acid sequence: (RFC25 scars in shown in bold, other sequence features underlined; both given below)

1	MRRSANYQPSIWDHDFLQSLNSNYTDEAYKRRAEELRGKVKIAIKDVIEPLDQLDLIDNLQRLGLAHRFETEIRNILNNIYNNNKDYNWRKENLYATSLE
101	FRLLRQHGYPVSQEVFNGFKDDQGGFICDDFKGILSLHEASYYSLEGESIMEEAWQFTSKHLKEVMISKNMEEDVFVAEQAKRALELPLHWKVPMLEARW
201	FIHIYERREDKNHLLLELAKMEFNTLQAIYQEELKEISGWWKDTGLGEKLSFARNRLVASFLWSMGIAFEPQFAYCRRVLTISIALITVIDDIYDVYGTL
301	DELEIFTDAVERWDINYALKHLPGYMKMCFLALYNFVNEFAYYVLKQQDFDLLLSIKNAWLGLIQAYLVEAKWYHSKYTPKLEEYLENGLVSITGPLIIT
401	ISYLSGTNPIIKKELEFLESNPDIVHWSSKIFRLQDDLGTSSDEIQRGDVPKSIQCYMHETGASEEVARQHIKDMMRQMWKKVNAYTADKDSPLTGTTTE
501	FLLNLVRMSHFMYLHGDGHGVQNQETIDVGFTLLFQPIPLEDKHMAFTASPGTKGTGAWSHPQFEK*

Sequence features: (with their position in the amino acid sequence, see the list of supported features)

	RFC25 scar (shown in bold):	556 to 557
	Strep-tag II:	559 to 566

Amino acid composition:

Ala (A)	33 (5.8%)
Arg (R)	25 (4.4%)
Asn (N)	27 (4.8%)
Asp (D)	34 (6.0%)

Cys (C)	4 (0.7%)
Gln (Q)	24 (4.2%)
Glu (E)	48 (8.5%)
Gly (G)	29 (5.1%)

His (H)	18 (3.2%)
Ile (I)	39 (6.9%)
Leu (L)	64 (11.3%)
Lys (K)	36 (6.4%)

Met (M)	16 (2.8%)
Phe (F)	28 (4.9%)
Pro (P)	17 (3.0%)
Ser (S)	33 (5.8%)

Thr (T)	26 (4.6%)
Trp (W)	14 (2.5%)
Tyr (Y)	27 (4.8%)
Val (V)	24 (4.2%)

Amino acid counting

	Total number:	566
	Positively charged (Arg+Lys):	61 (10.8%)
	Negatively charged (Asp+Glu):	82 (14.5%)
	Aromatic (Phe+His+Try+Tyr):	87 (15.4%)

Biochemical parameters

	Atomic composition:	C₃₀₀₂H₄₅₈₆N₇₇₈O₈₆₈S₂₀
	Molecular mass [Da]:	66105.3
	Theoretical pI:	5.38
	Extinction coefficient at 280 nm [M^-1 cm^-1]:	117230 / 117480 (all Cys red/ox)

Codon usage

	Organism:	E. coli	B. subtilis	S. cerevisiae	A. thaliana	P. patens	Mammals
	Codon quality (CAI):	good (0.71)	good (0.75)	good (0.69)	good (0.78)	codon usage physco	good (0.68)

The BioBrick-AutoAnnotator was created by TU-Munich 2013 iGEM team. For more information please see the documentation.
If you have any questions, comments or suggestions, please leave us a comment.

How to use the BioBrick-AutoAnnotator

Text

Programming of the BioBrick-Autoannotator

The Annotator is a JavaScript program using the jQuery library (version 1.10.0) and an [http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/ extension] to the .ajax() method in jQuery written by James Padolsey (also see Import of BioBrick Sequences above). The output is HTML-code including style markup.

Source code of the BioBrick-Autoannotator version 1.0

Application of our Software-tool

Annotation by TU-Munich 2013 Team

Annotation by other Teams

References:

http://www.ncbi.nlm.nih.gov/pubmed/6327079 Edens et al., 1984 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC340524/ Sharp and Li, 1987 http://www.ncbi.nlm.nih.gov/pubmed/8125050 Bjellqvist et al., 1993 http://www.ncbi.nlm.nih.gov/pubmed/8055880 Bjellqvist et al., 1994 http://www.ncbi.nlm.nih.gov/pubmed/2610349 Gill and von Hippel, 1989

http://www.ncbi.nlm.nih.gov/pubmed/15215403 Rost et al., 2004 server, sec and acc

http://www.ncbi.nlm.nih.gov/pubmed/9254694 Altschul et al., 1997 ali

http://www.ncbi.nlm.nih.gov/pubmed/8844859 Rost et al., 1996 trans

http://nar.oxfordjournals.org/content/34/suppl_2/W177.full Frasconi et al., 2006 dis

http://www.biomedcentral.com/1471-2105/14/S3/S7 Rost et al., 2013 go

http://www.ncbi.nlm.nih.gov/pubmed/22962467 Rost et al., 2012 loc

http://www.ncbi.nlm.nih.gov/pubmed/8125050 Bjellqvist et al., 1993 Bjellqvist, B., Hughes, G.J., Pasquali, Ch., Paquet, N., Ravier, F., Sanchez, J.-Ch., Frutiger, S. and Hochstrasser, D.F. (1993). The focusing positions of polypeptides in immobilized pH gradients can be predicted from their amino acid sequences. Electrophoresis, 14:1023-1031.
http://www.ncbi.nlm.nih.gov/pubmed/8055880 Bjellqvist et al., 1994 Bjellqvist, B., Basse, B., Olsen, E. and Celis, J.E. (1994). Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis, 15:529-539.
http://www.ncbi.nlm.nih.gov/pubmed/2610349 Gill and von Hippel, 1989 Gill, S.C. and von Hippel, P.H. (1989). Calculation of protein extinction coefficients from amino acid sequence data. Anal. Biochem., 182:319-326.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC340524/ Sharp and Li, 1987 Sharp, P.M. and Li, W.H. (1987). The Codon Adaptation Index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15(3):1281–95.
http://www.ncbi.nlm.nih.gov/pubmed/15215403 Rost et al., 2004 Rost, B., Yachdav, G. and Liu, J. (2004). The PredictProtein server. Nucleic Acid Res. 32:321-326.
http://www.ncbi.nlm.nih.gov/pubmed/9254694 Altschul et al., 1997 Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.
http://www.ncbi.nlm.nih.gov/pubmed/8844859 Rost et al., 1996 Rost, B., Fariselli, P. and Casadio, R. (1996). Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci. 5:1704-1718.
http://nar.oxfordjournals.org/content/34/suppl_2/W177.full Frasconi et al., 2006 Ceroni, A., Passerini, A., Vullo, A. and Frasconi, P. (2006). DISULFIND: a disulfide bonding state and cysteine connectivity prediction server. Nucleic Acids Res. 34:177-181.
http://www.biomedcentral.com/1471-2105/14/S3/S7 Rost et al., 2013 Hamp, T., Kassner, R., Seemayer, S., Vicedo, E., Schaefer, C., Achten, D., Auer, F., Boehm, A., Braun, T., Hecht, M., Heron, M., Honigschmid, P., Hopf, T.A., Kaufmann, S., Kiening, M., Krompass, D., Landerer, C., Mahlich, Y., Roos, M. and Rost B (2013). Homology-based inference sets the bar high for protein function prediction. BMC Bioinformatics. 14(Suppl.3):S7.
http://www.ncbi.nlm.nih.gov/pubmed/22962467 Rost et al., 2012 Goldberg, T., Hamp, T. and Rost, B. (2012). LocTree2 predicts localization for all domains of life. Bioinformatics 28:i458-i465.