Team:Stanford-Brown/Projects/De-Extinction

From 2013.igem.org

Revision as of 08:17, 27 September 2013 by Tkalkus (Talk | contribs)

Contents

Introduction

We are exploring genes and proteins of the past. We use bioinformatic techniques to predict ancestral genes and have these genes synthesized in the real world. We then test their function against their modern counterparts.

Amino Acids

There are 20 commonly used amino acids that make up most proteins. This alphabet of “standard” amino acids has evolved over time. We are exploring two genes involved in synthesis of amino acids. HisC synthesizes histidine and CysE synthesizes cysteine. By predicting ancestral genes and testing their function, we hope better understand their evolution. The ancestral genes can be tested for basic function using “knockout strains” of E coli .

CRISPR

The CRISPR project is based on using the CRISPR system for specific recognition and delivery applications. We hope to better understand how a specific part of that system, the CASCADE complex, functions. We predicted an ancestral sequence for the CasA sub-gene, which is responsible for recognizing and binding to foreign DNA. Using biochemical and functional assays, we hope to better understand the exact mechanism by which this occurs and potentially manipulate it to expand the functionality of this unit in the CRISPR project.

Process Evaluation

The De-Extinction projects are based on the idea that we can accurately predict ancestral gene sequences. In order to test our methods, we are collaborating with Dr. Rich Lenski at Michigan State University. Dr. Lenski has conducted an evolutionary experiment on E coli for the past 25 years. We are using the most recent sequence data from his experiment to test the accuracy of our ancestral predictions. His data span 50,000 generations, and can be used to verify the accuracy of our predictions.

Protocols

Click here to see protocols.

Lab Notebook

The De-Extinction team recorded experiments and progress throughout the research period in the De-Extinction Lab Notebook.

Data

Evolutionary Modeling

Background:

Our De-Extinction project was based on the concept of using existing protein and nucleotide information to generate ancestral sequence data. This data would then be synthesized in real life for testing. As a final part of our project, we wanted to evaluate the accuracy of these methods through modeling a known ancestral genome. In order to do this, we collaborated with Dr. Rich Lenski’s lab at Michigan State University.

Dr. Lenski has been running a long-term evolutionary experiment on E coli for the past 25 years. In order to test our methods, we gathered recent data from his different strains. This data came from 40,000 generations after the start of the experiment. Because this is not a very long time in evolutionary terms, we focused on mutations across the entire genome, rather than individual significant genes.

Methods:

For our Amino Acid and CRISPR related sequences, we began with amino acid sequence data available from the Pfam Database. We chose to use the seed data, which is chosen to be representative of certain genes sequences in the species these genes are known to be present in.

A phylogenetic tree for CasA generated using Geneious and PhyML

Using these amino acid sequences, we constructed phylogenetic trees for these specific genes using Geneious with the PhyML extension. We then submitted the sequence and tree data to ProtTest, which we used to determine the most accurate protein substitution model for each gene. In each of our cases the chosen protein model was WAG. Using this data, we ran PAML with the Lazarus suite. We input the amino acid sequence information, phylogenetic tree, and substitution model in order to predict the sequence of the common ancestor of all of the species.

For our real-world evolutionary testing, we altered the process slightly. Dr. Rich Lenski’s lab provided us with nucleotide sequence data for the whole genome in their different strains of E coli.

Because we were examining the entire genome (~4.5 billion bases), using a pairwise aligner was not an option. We instead used Geneious with the Mauve extension, which is designed to accurately align entire genomes. We then used the PhyML extension to generate phylogenetic trees of these strains based on their mutations. We took the aligned nucleotide data and phylogenetic trees as input for PAML with Lazarus. Because this was nucleotide data instead of protein data, we used a nucleotide substitution model. We have generated data using the Jukes-Cantor (1969) model. We are also generating data using the HKY85 model for comparison. We expect similar results.

Results:

In the case of the Lenski Lab data, we have the actual ancestral sequence from generation 0. We compared the predicted ancestral sequences to the real ancestral sequence as a test of our methods. The results were over 90% pairwise identical to the real ancestor. As a basic control, we also generated consensus sequences from the same sequences we used to generate our ancestral sequences. The consensus data was 99.9% identical to the ancestral sequences.

Discussion:

The preliminary results show that the model is good. 90% identity with the real ancestor is a high mark. But compared to the consensus control, the predicted ancestral sequence is less accurate.

There are multiple reasons why this result may have occurred. It may indicate a problem with the substitution models that we chose to use. For this reason, we are continuing to rerun our data using multiple substitution models for comparison. It may indicate a failure of the Lazarus suite to perform up to expectations when working over short evolutionary periods. The software was designed to predict ancestral sequences of species that have diverged over thousands of years, so using it in a 20 year experiment may be too short of a timeline. It may also indicate a shortcoming in the Lazarus software.

Future Aims:

We would like to continue and expand our research into the field of ancestral reconstruction. In the future, we plan to test multiple different substitution models within Lazarus to determine if there is a significant discrepancy in accuracy between them. We also plan to model alternative programs designed for ancestral gene reconstruction. Additionally, we would like to come up with a scenario where we can test these programs over a longer evolutionary period with known ancestral data. This may demonstrate the strengths and weaknesses of these specific algorithms.

BioBricks

BBa_K1218003 (Modern E. Coli CRISPR CasA) CasA works in conjunction with the rest of the CASCADE complex as a part of the CRISPR system. Together, the pieces of CASCADE work to recognize, bind to, and degrade foreign DNA.

BBa_K1218004 (Ancestral CasA) CasA works in conjunction with the rest of the CASCADE complex as a part of the CRISPR system. Together, the pieces of CASCADE work to recognize, bind to, and degrade foreign DNA. This part is a predicted sequence that came from a common ancestor of all the species found in the Pfam seed data for CasA (PF09481).

BBa_K1218005 (Ancestral CysE) CysE is responsible for synthesizing cysteine. This part is a fusion of a predicted sequence that came from a common ancestor of all the species found in the Pfam seed data for CysE N-Terminal (PF06426) and the rest of the gene from wild type E coli (K12).

BBa_K1218006 (HisC E coli) HisC is responsible for synthesizing histidine.

BBa_K1218007 (Ancestral HisC) HisC is responsible for synthesizing histidine. This part is a predicted sequence that came from a common ancestor of all the species found in the Pfam seed data for HisC (PF00155).

BBa_K1218008 (AroE E coli) AroE is responsible for synthesizinga shikimate dehydrogenase, which is essential to the production of nucleotides.

BBa_K1218009 (CRISPR CasBCDE E. Coli CRISPR) CasBCDE works in conjunction with CasA in the CASCADE complex as a part of the CRISPR system. Together, the pieces of CASCADE work to recognize, bind to, and degrade foreign DNA.

Acknowledgements

We would like to thank:

- Dr. Lynn Rothschild, Dr. Gary Wessel, Dr. Joe Shih, and Dr. Kosuke Fujisima

- Dr. Rich Lenski, the Lenski Lab, and Rohan Maddamsetti, our point of contact