Team:Stanford-Brown/Projects/De-Extinction
Contents |
Introduction
The Rosetta Stone connected a language well known in modern times to Ancient Egyptian, a mysterious language long forgotten, finally allowing researchers to better understand Ancient Egypt through their writings. Similarly, the De-Extinction Project seeks to connect our knowledge today to de-code messages from the past. We have used phylogenetic analysis by maximum likelihood (PAML) bioinformatic software to predict and reconstruct ancestral genes. This provides a gateway to better understand early life on Earth.Amino Acids
Nearly all life known today uses a standard 20 amino acid alphabet for protein production. However, seven of these would not exist naturally before living organisms began to produce them. The genes CysE and HisC each code for a protein important for the production of one such amino acid, cysteine and histidine respectively. However, these proteins are constructed with the amino acids that they make, creating a chicken-and-egg which-came-first problem. By reconstructing these ancestral genes and comparing them to the extant genes, we can hopefully better understand the origin of this 20 amino acid alphabet.
CRISPR
Just like Latin is a dead language but still has specific use in taxonomy, we realized that reconstructed ancestral proteins can be used to expand the range of modern functions. We decided to apply this by expanding our team's CRISPR project, which utilizes the prokaryotic immune system for specific recognition and delivery applications (see the "Intercellular" page). By reconstructing an ancestral gene that codes for CasA, the protein subunit responsible for identifying a specific DNA motif, we have expanded the number of available CRISPR "tools," allowing a variety of DNA motifs to be targeted. We are continuing to work on biochemical and functional assays to analyze this parts' affinity characteristics and effectiveness.
Process Evaluation
The De-Extinction projects are based on the idea that we can accurately predict ancestral gene sequences. In order to test our methods, we are collaborating with Dr. Rich Lenski at Michigan State University. Dr. Lenski has conducted an evolutionary experiment on E coli for the past 25 years. We are using the most recent sequence data from his experiment to test the accuracy of our ancestral predictions. His data spans 50,000 generations, and can be used to verify the accuracy of our predictions.
Protocols
Our team has compiled our protocols here.
Lab Notebook
The De-Extinction team recorded experiments and progress throughout the research period in the De-Extinction Lab Notebook.
Data
Molecular Clock:
Using data from Dr. Lenski's lab, we can find the mutation rate for our genes. Using 12 strains and 25 years of data, we found that each of our target genes demonstrate approximately one mutation every 150 years. This is almost exactly the average rate of mutation expected for a one kilo base pair gene in E. coli. (Calculation: E. coli have .001 mutations in the genome per generation. With 20 minute generations and a gene that is 1/4,600 of the genome, you would expect .0057 mutations per year in the gene. The data from the Lenski lab demonstrates .0067 mutations per year.) The nucleotide sequence of our ancestral CasA nucleotide sequence is only 54.3% identical, meaning 815 nucleotides have been mutated. Our genes must be predicting a sequence from 120,000 years ago at the very least. This calculation does not account for multiple mutations in the same location, which would be likely because these non-conserved regions are less important to functionality, meaning the actual date of the sequence could be possibly millions, or even billions, of years ago.
Conserved Regions:
The extant K12 E. coli HisC and CysE amino acid sequences are 39.1% and 53.3% similar. The amino acid sequences of the extant DH5alpha E. coli casA gene and our ancestral casA gene show 31.1% similarity. The conserved regions of these proteins are not randomly scattered throughout the protein. Instead, the most conserved regions correspond to binding/interaction sites on the protein. As an extension of this observation, we used Foldit software and Thermus thermophilus as a template to make a physical model of our ancestral casA to physically observe the conserved active sites.We also noticed that, in both HisC and CysE, the number of histidine and cysteine residues, respectively, was smaller in the ancestral proteins. Only one of each residue was conserved between their respective ancestral and extant proteins. This supports the theory that these proteins developed first and started making cysteine and histidine before they mutated to incorporate them, a proposed solution to the chicken-and-egg problem proposed in the DeExtinction introduction.
Note: For sequence data, refer to the Biobrick pages in the Registry.
Wet Lab Testing:
Computers, modeling, and bioinformatics are fun for sure, but we were eager to test our genes out as soon as they were Biobricks. We used the pUC19 expression vector to express the amino acid production genes we had created/isolated in HisC and CysE knockout cells. We grew these cells on plates of M9 minimal media, so that no amino acids could be picked up from the media. This means that the cells would only survive if they picked up the gene to produce the essential amino acid they could no longer make themselves. We Biobricked the extant HisC gene to use as a positive control for our ancestral HisC tests, and for our ancestral CysE, we used the Biobrick from Trento's 2012 team as a positive control. We used an empty pUC19 plasmid as the negative control for both. Below you can see a zoomed in image of these plates. The colonies on these plates were sequenced to confirm that our ancestral genes successfully rescued their respective cells. It may not be a zombie dinosaur, but we have successfully brought to life something that was extinct long ago!
Negative HisC control plated on M9 minimal media (Exhibits few misshaped colonies) |
Negative CysE control plated on M9 minimal media (Exhibits no colonies) |
Positive Control HisC plated on M9 minimal media (Exhibits many regularly spaced, healthy colonies) |
|
Evolutionary Modeling
Background:
Our De-Extinction project was based on the concept of using existing protein and nucleotide information to generate ancestral sequence data. This data would then be synthesized in real life for testing. As a final part of our project, we wanted to evaluate the accuracy of these methods through modeling a known ancestral genome. In order to do this, we collaborated with Dr. Rich Lenski’s lab at Michigan State University.
Dr. Lenski has been running a long-term evolutionary experiment on E coli for the past 25 years. In order to test our methods, we gathered recent data from his different strains. This data came from 40,000 generations after the start of the experiment. Because this is not a very long time in evolutionary terms, we focused on mutations across the entire genome, rather than individual significant genes.
Methods:
For our Amino Acid and CRISPR related sequences, we began with amino acid sequence data available from the Pfam (Protein family) Database. We chose to use the seed data, which is representative of the currently known and sequenced population with each specific protein.
Using these amino acid sequences, we constructed phylogenetic trees for these specific genes using Geneious with the PhyML extension. We then submitted the sequence and tree data to ProtTest, which we used to determine the most accurate protein substitution model for each protein. In each of our cases the chosen protein model was WAG. Using this data, we ran PAML with the Lazarus suite. We input the amino acid sequence information, phylogenetic tree, and substitution model in order to predict the sequence of the common ancestor of all of the species.
For our real-world evolutionary testing, we altered the process slightly. Dr. Rich Lenski’s lab provided us with nucleotide sequence data for the whole genome in their different strains of E coli.
Because we were examining the entire genome (~4.5 billion bases), using a pairwise aligner was not an option. We instead used Geneious with the Mauve extension, which is designed to accurately align entire genomes. We then used the PhyML extension to generate phylogenetic trees of these strains based on their mutations. We took the aligned nucleotide data and phylogenetic trees as input for PAML with Lazarus. Because this was nucleotide data instead of protein data, we used a nucleotide substitution model. We have generated data using the Jukes-Cantor 1969 (JC69) model. We are also generating data using the Hasegawa, Kishino and Yano 1985 (HKY85) model for comparison. We expect similar results.
Results:
In the case of the Lenski Lab data, we have the actual ancestral sequence from generation 0. We compared the predicted ancestral sequences to the real ancestral sequence as a test of our methods. The results were over 90% pairwise identical to the real ancestor. As a basic control, we also generated consensus sequences from the same sequences we used to generate our ancestral sequences. The consensus data was 99.9% identical to the ancestral sequences.
Genome Segment | Model | Gaps | Percent Similarity to Ancestral Sequence |
---|---|---|---|
LCB1 | JC69 | Removed | 26.4% |
Replaced | 98.6% | ||
HKY85 | Removed | 98.6% | |
Replaced | 98.6% | ||
Consensus | 99.9% | ||
LCB2 | JC69 | Removed | 99.9% |
Replaced | 71.4% | ||
HKY85 | Removed | 99.9% | |
Replaced | 71.4% | ||
Consensus | 100% | ||
LCB3A | JC69 | Removed | 93.7% |
Replaced | 94.1% | ||
Consensus | 98.4% | ||
LCB4 | JC69 | Removed | 99.9% |
Replaced | 98.8% | ||
HKY85 | Removed | 99.9% | |
Replaced | 98.8% | ||
Consensus | 99.9% |
Discussion:
The preliminary results show that the model is good. 90% identity with the real ancestor is a high mark. But compared to the consensus control, the predicted ancestral sequence is less accurate.
There are multiple reasons why this result may have occurred. It may indicate a problem with the substitution models that we chose to use. For this reason, we are continuing to rerun our data using multiple substitution models for comparison. It may indicate a failure of the Lazarus suite to perform up to expectations when working over short evolutionary periods. The software was designed to predict ancestral sequences of species that have diverged over thousands of years, so using it in a 20 year experiment may be too short of a timeline. It may also indicate a shortcoming in the Lazarus software.
Future Aims:
We would like to continue and expand our research into the field of ancestral reconstruction. In the future, we plan to test multiple different substitution models within Lazarus to determine if there is a significant discrepancy in accuracy between them. With access to greater computing power, we could run many more sequences and have more robust results. We also plan to model alternative programs designed for ancestral gene reconstruction. Additionally, we would like to come up with a scenario where we can test these programs over a longer evolutionary period with known ancestral data. This may demonstrate the strengths and weaknesses of these specific algorithms.
This only scratches the surface of the evolutionary studies this technology offers. With more time, we could uncover more about the origin of cellular process (e.g. carbon fixation). We could use cell modeling and ancestral reconstruction in conjunction to understand evolutionary pressures over time. Just like CRISPR, more biological tools could be diversified. The origin of life could become clear. Eventually, if the reconstruction is accurate enough, who is to say we could not use our own genomes to reconstruct a Neanderthal!?*
"*Be sure to check out our ethics of DeExtinction paper in the Human Practices section!"
BioBricks
[http://parts.igem.org/Part:BBa_K1218003 BBa_K1218003 (Modern E. Coli CRISPR CasA)] CasA works in conjunction with the rest of the CASCADE complex as a part of the CRISPR system. Together, the pieces of CASCADE work to recognize, bind to, and degrade foreign DNA.
[http://parts.igem.org/Part:BBa_K1218004 BBa_K1218004 (Ancestral CasA)] CasA works in conjunction with the rest of the CASCADE complex as a part of the CRISPR system. Together, the pieces of CASCADE work to recognize, bind to, and degrade foreign DNA. This part is a predicted sequence that came from a common ancestor of all the species found in the Pfam seed data for CasA (PF09481).
[http://parts.igem.org/Part:BBa_K1218005 BBa_K1218005 (Ancestral CysE)] CysE is responsible for synthesizing cysteine. This part is a fusion of a predicted sequence that came from a common ancestor of all the species found in the Pfam seed data for CysE N-Terminal (PF06426) and the rest of the gene from wild type E coli (K12).
[http://parts.igem.org/Part:BBa_K1218006 BBa_K1218006 (HisC E coli)] HisC is responsible for synthesizing histidine.
[http://parts.igem.org/Part:BBa_K1218007 BBa_K1218007 (Ancestral HisC)] HisC is responsible for synthesizing histidine. This part is a predicted sequence that came from a common ancestor of all the species found in the Pfam seed data for HisC (PF00155).
[http://parts.igem.org/Part:BBa_K1218008 BBa_K1218008 (AroE E coli)] AroE is responsible for synthesizinga shikimate dehydrogenase, which is essential to the production of nucleotides.
[http://parts.igem.org/Part:BBa_K1218009 BBa_K1218009 (CRISPR CasBCDE E. Coli CRISPR)] CasBCDE works in conjunction with CasA in the CASCADE complex as a part of the CRISPR system. Together, the pieces of CASCADE work to recognize, bind to, and degrade foreign DNA.
Acknowledgements
We would like to thank:
- Dr. Lynn Rothschild, Dr. Gary Wessel, Dr. Joe Shih, and Dr. Kosuke Fujisima
- Dr. Rich Lenski, the Lenski Lab, and Rohan Maddamsetti, our point of contact