Team:TU-Delft/NovelPeptides

From 2013.igem.org

Novel Peptides

The antimicrobial peptide(AMPs) field is growing rapidly in response to the demand for novel antimicrobial agents. In particular AMPs are promising candidates in the fight against antibiotic-resistant pathogents due to their low toxicity, and broad range of activity. Antimicrobial peptides are generally between 12 and 50 amino acids long. These peptides include two or more positively charged residues provided by arginine, lysine or, in acidic environments, histidine, and a large proportion of hydrophobic residues [1].

Due to the fact that AMPs constitute a current research area, both the knowledge and the experimentally validated data are rapidly increasing.It was decided to use these data in order to create novel peptides which will be high toxic for S. aureus but low toxic for E. coli. The method that was developed(Figure 1) is described in the following sections.

Figure 1: Schematic Description of the Method

Data and Feature extraction

The necessary data were acquired from the CAMP: Collection of Anti-Microbial Peptides Database. The database contains 3789 records with MIC values but only the records that target both E. coli and S. aureus(and are experimentally validated) were taken into account. The acquired records were seperated into 4 classes based on the MIC values:

class 0: Toxic for both S. aureus and E. coli
class 1: Toxic for S. aureus but not for E. coli
class 2: Toxic for E. coli but not for S. aureus
class 3: Non Toxic for both E. coli and S. aureus

The next step is related to the feature extraction for each one of the collected peptides.The resulting number of features per sequence is 21 [1][2][3] .In particular, the attributes for each peptide are either general such as the length of the sequence or specific based on AMPs properties.A list of them is presented underneath:

length
charge
prolines' frequency
glycines' frequency
hydrophobic residues appearance
hydropathy
C terminus
N terminus
polarity

The N and C terminus were examined only for 3 positions due to the different size of each peptide.

Rule Learning

After creating the final data set, a machine learning toolkit, WEKA, was used. In particular, WEKA contains a collection of machine learning algorithms for data mining tasks. In our case, it was decided to use nnge algorithm in order to perform association rule mining [4][6](Figure 2).

Figure 2: NNge Description

By the term association rule mining, a method for discovering interesting relations between variables in data sets is described.In that way, it is possible to discover rules that represent the class of interest and create our novel peptides!Some of the rules identified can be seen in Figure 3.

Figure 3: Some rules discovered by NNge

In order to give a better understanding of the design process, the rules that referred to the class of interest(class 1) are presented below.

13-14/23/24/33/37/40/44 aa length
charge of 1-4 or 10
hydropathy with a minimum value of -0.37 and a maximum value of 1.82(dependent on the peptide length)
the 3 first amino acids are either FLP or GLL
the 3 last amino acids are RLL, GLL or FGL
in between N and C terminus amino acid sequence:0-2 prolines
in between N and C terminus amino acid sequence:0-4 or 7 glycines
frequencies for specific hydrophobic residues, like tryptophan: 0 or 1 or 3

Model Evaluation

In order to evaluate the performance of our model, we are interested in investigating the ability of the model to correctly predict or separate the classes. For that reason, the measurements accuracy, precision , recall and F-measure are computed. A brief explanation for each measurement is presented below.

Accuracy: the overall correctness of the model
Precision:percent of positive predictions which are correct
Recall:true positive rate (percent of positive cases that you can catch)
F-measure:a measure that combines precision and recall

In our case, we succeeded in the aforementioned results:
Accuracy: 94.4149 %
Detailed Accuracy by class

	Class	Precision	Recall	F-measure
	1	0.955	0.986	0.97
	2	0.917	0.611	0.733
	3	0.963	0.867	0.912
	4	0.737	0.875	0.8
Weighted Avg.		0.945	0.944	0.942

Final Created Peptides

The rules that generated are taken into consideration in order to create our final peptides.First of all it was decided to create peptides which are 13 amino acids long in order to avoid post translation modification. The next step was to set the amino acids for the N and C terminus because it was proven to be of great importance for the the toxicity and selectivity of the peptides. As it has been already described the 3 first amino acids, N terminus, are fixed and can be either FLP or GLL. Same is the case for the 3 last amino acids, C terminus, which can be only RLL, GLL or FGL for class 1. We also set the number of prolines, glycines and specific hydrophobic amino acids to satisfy the rules due to the fact that the amino acid composition of these specific amino acids proved to be of great importance for the AMPs. The rest of the amino acids were chosen so as to satisfy the remaining rules.

It is also necessary to be mentioned that we designed our peptides by taking into consideration their hydrophobic nature. We tried to design them in a way that they will both satisfy the rules and they will not be highly hydrophobic. In that way we ensured that the peptides will not be toxic for humans as the toxicity to humans is directly related and influenced by the peptide's hydrophobic nature [5].

Finally it was also significant to ensure that the synthesized peptide would have a high probability of working. For that reason after synthesizing the peptides we also checked the aforementioned criteria.

The amino acid sequences for each peptide and their properties are depicted underneath.

peptidor : GFGLCKNKAFGLL

Figure 4: peptidor properties Figure 5: peptidor amino acid composition

The peptidor peptide was also proven to have similarity with the MIRJA antimicrobial peptide(E- Value 6.5). The specific peptide do not target E. coli but it targets Gram positive bacteria.
We also run SVM classifier in CAMP database for predicting the antimicrobial nature of the peptide.

Sequence Id Class Probability

Unknown AMP 0.961
derpini: FLPILGVARKGLL

Figure 6: derpini properties Figure 7: derpini amino acid composition

The derpini peptide was proven to have similarity with both Vespid chemotactic peptide 5h and Temporin-1CSb(E-value: 3.6). Temporin is an AMP which has MIC = 128 μM for E. coli and MIC = 8 μM for S. aureus. The other AMP is inactive against E. coli but active against S. aureus.
After running SVM classifier in CAMP the peptide was predicted as antimicrobial.

Sequence Id Class Probability

Unknown AMP 0.955
staphycine: FLPLLASLFSRLL

Figure 8: staphycine properties Figure 9: staphycine amino acid composition

staphycine was proven to have similarity with Temporin-1CSb(E-value: 0.011).
Temporin has MIC = 70 μM for E.Coli and MIC = 2 μM for S.Aureus.

Sequence Id Class Probability

Unknown AMP 0.862

Our lab people test our synthesized peptides in the lab!!!The peptidor peptide worked well. It was highly toxic against S. delphini with an MIC of 40μΜ, but showed low toxicity against both E. coli and COS-1 cells(Figure 10A, Figure 12A, Figure 13). Staphycine peptide worked as expected, being highly toxic to S.delphini(MIC of 30-40μM) but not toxic to E. coli and COS-1 cells(Figure 10B, Figure 12B, Figure 13). Last but not least, derpini did not work as expected as it was proven not to be toxic for S. delphini .

Figure 10: MICs of peptidor. 10A: peptidor on S. delphini, 10B: peptidor on B. subtilus

Figure 11: MICs of Staphicine. MICs of staphycine 11A:staphycine on S. delphini, 11B: staphycine on B. subtilus

Figure 12: 12A peptidor on E. coli, 12B staphycine on E. coli

Figure 13: Toxicity test of AMPs on COS-1 cells

For more information, check our lab pages!

Discussion

As observed there is a large set of generated rules and some overlapping rules between the classes. It is highly probable that one peptide failed to work due to this reason. The are limitation to the specific model and this is related not only to the fact that the experimentally validated data set is of small size but also to the fact that the number of samples that belong to the class of interest is limited compared to the other classes. In the future, it is possible to improve the model by performing a better feature selection and/or using different algorithms. However, it is necessary for all the data that are currently available to be experimentally validated and more to be included in the current databases.

Sequence Id	Class	Probability
Unknown	AMP	0.961

Sequence Id	Class	Probability
Unknown	AMP	0.955

Sequence Id	Class	Probability
Unknown	AMP	0.862