Team:TU-Delft/NovelPeptides

From 2013.igem.org

Revision as of 14:11, 2 October 2013 by Dimitra (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Novel Peptides

The antimicrobial peptide(AMPs) field is growing rapidly in response to the demand for novel antimicrobial agents. In particular AMPs are promising candidates in the fight against antibiotic-resistant pathogents due to their low toxicity, and broad range of activity. Antimicrobial peptides are generally between 12 and 50 amino acids long. These peptides include two or more positively charged residues provided by arginine, lysine or, in acidic environments, histidine, and a large proportion of hydrophobic residues.

Due to the fact that AMPs constitute a current research area, both the knowledge and the experimentally validated data are rapidly increasing.It was decided to use these data in order to create novel peptides which will be high toxic for S.aureus but low toxic for E.coli. The method that was developed is described in the following sections.

Data and Feature extraction

The necessary data were acquired from the CAMP: Collection of Anti-Microbial Peptides Database. The database contains 3789 records with MIC values but only the records that target both E.coli and S.aureus(and are experimentally validated) were taken into account. The acquired records were seperated into 4 classes based on the MIC values:

class 0: Toxic for both S.aureus and E.coli
class 1: Toxic for S.aureus but not for E.coli
class 2: Toxic for E.coli but not for S.aureus
class 3: Non Toxic for both E.coli and S.aureus

The next step is related to the feature extraction for each one of the collected peptides.The resulting number of features per sequence is 21[1][2][3].In particular, the attributes for each peptide are either general such as the length of the sequence or specific based on AMPs properties.A list of them is presented underneath:

length
charge
prolines' frequency
glycines' frequency
hydrophobic residues appearance
hydropathy
C terminus
N terminus
polarity

The N and C terminus were examined only for 3 positions due to the different size of each peptide.

Rule Learning

After creating the final data set, a machine learning toolkit, WEKA, was used. In particular, WEKA contains a collection of machine learning algorithms for data mining tasks. In our case, it was decided to use nnge algorithm in order to perform association rule mining[4].

By the term association rule mining, a method for discovering interesting relations between variables in data sets is described.In that way, it is possible to discover rules that represent the class of interest and create our novel peptides!Some of the rules identified can be seen in Figure 1.

Figure 1: Some rules discovered by NNge

Based on the rules discovered, one can conclude for example that a peptide belongs in class 1 if it has a charge of 1-4 or 10 and hydropathy with a minimum value of -0.37 and a maximum value of 1.82(dependent on the peptide length). Moreover the first 3 amino acids are either FLP or GLL and the 3 last amino acids are RLL, GLL or FGL. The amino acid sequence in between N and C terminus has to be composed by 0-2 prolines, 0-4 or 7 prolines and different frequencies for specific hydrophobic residues. Last but not least, the rest of the amino acids can be included in the sequence between the N and C terminus as their appearance is of no importance for AMPs peptides but the peptide properties must still be satisfied(like hydropathy for example).

Model Evaluation

In order to evaluate the performance of our model, we are interested in investigating the ability of the model to correctly predict or separate the classes. For that reason, the measurements accuracy, precision , recall and F-measure are computed. A brief explanation for each measurement is presented below.

Accuracy: the overall correctness of the model
Precision:percent of positive predictions which are correct
Recall:true positive rate (percent of positive cases that you can catch)
F-measure:a measure that combines precision and recall

In our case, we succeeded in the aforementioned results:
Accuracy: 94.4149 %
Detailed Accuracy by class

	Class	Precision	Recall	F-measure
	1	0.955	0.986	0.97
	2	0.917	0.611	0.733
	3	0.963	0.867	0.912
	4	0.737	0.875	0.8
Weighted Avg.		0.945	0.944	0.942

Final Created Peptides

The rules that generated are taken into consideration in order to create our final peptides.First of all it was decided to create peptides which are 13 amino acids long in order to avoid post translation modification. The next step was to set the amino acids for the N and C terminus because it was proven to be of great importance for the the toxicity and selectivity of the peptides. We also set the number of prolines, glycines and specific hydrophobic amino acids to satisfy the rules due to the fact that the amino acid composition of these specific amino acids proved to be of great importance for the AMPs. The rest of the amino acids were chosen so as to satisfy the remaining rules. It is also necessary to be mentioned that we designed our peptides by taking into consideration their hydrophobic nature. We tried to design them in a way that they will both satisfy the rules and they will not be highly hydrophobic. In that way we ensured that the peptides will not be toxic for humans as the toxicity to humans is directly related and influenced by the peptide's hydrophobic mature.

Finally it was also significant to ensure that the synthesized peptide would have a high probability of working. For that reason after synthesizing the peptides we also checked the aforementioned criteria.

The amino acid sequences for each peptide and their properties are depicted underneath.

Peptidor : GFGLCKNKAFGLL

Figure 2: Peptidor properties Figure 3: Peptidor amino acid composition

The Peptidor peptide was also proven to have similarity with the MIRJA antimicrobial peptide(E- Value 6.5). The specific peptide do not target E.coli but it targets Gram positive bacteria.
We also run SVM classifier in CAMP database for predicting the antimicrobial nature of the peptide.

Sequence Id Class Probability

Unknown AMP 0.961
Derpini: FLPILGVARKGLL

Figure 4: Derpini properties Figure 5: Derpini amino acid composition

The Derpini peptide was proven to have similarity with both Vespid chemotactic peptide 5h and Temporin-1CSb(E-value: 3.6). Temporin is an AMP which has MIC = 128 μM for E.coli and MIC = 8 μM for S.aureus. The other AMP is inactive against E.coli but active against S.aureus.
After running SVM classifier in CAMP the peptide was predicted as antimicrobial.

Sequence Id Class Probability

Unknown AMP 0.955
Staphycine: FLPLLASLFSRLL

Figure 6: Staphycine properties Figure 7: Staphycine amino acid composition

Staphycine was proven to have similarity with Temporin-1CSb(E-value: 0.011).
Temporin has MIC = 70 μM for E.Coli and MIC = 2 μM for S.Aureus.

Sequence Id Class Probability

Unknown AMP 0.862

Our lab people test our synthesized peptides in the lab!!!The Peptidor peptide worked well. Staphycine peptide worked as expected whereas Derpini did not work at all. The MICs of the newly synthesized peptide were determined by lab experiments and are presented in the following figures.

Figure 8: MICs of Peptidor

Figure 9: MICs of Staphicine

For more information, check our lab pages!

Discussion

As observed there is a large set of generated rules and some overlapping rules between the classes. It is highly probable that one peptide failed to work due to this reason. The are limitation to the specific model and this is related not only to the fact that the experimentally validated data set is of small size but also to the fact that the number of samples that belong to the class of interest is limited compared to the other classes. In the future, it is possible to improve the model by performing a better feature selection and/or using different algorithms. However, it is necessary for all the data that are currently available to be experimentally validated and more to be included in the current databases.

Sequence Id	Class	Probability
Unknown	AMP	0.961

Sequence Id	Class	Probability
Unknown	AMP	0.955

Sequence Id	Class	Probability
Unknown	AMP	0.862