Team:WHU-China/templates/standardpage modelingCas9
From 2013.igem.org
(Difference between revisions)
IgnatzZeng (Talk | contribs) |
IgnatzZeng (Talk | contribs) |
||
Line 125: | Line 125: | ||
<div style="text-align:center"> | <div style="text-align:center"> | ||
- | <img src="https://static.igem.org/mediawiki/2013/ | + | <img src="https://static.igem.org/mediawiki/2013/e/ea/WHU2013Refineca5.png" /></div> |
<center><em><b>Figure 2. Model prediction compared with data from Fig.2B of [4]</b></em></br></center></br> | <center><em><b>Figure 2. Model prediction compared with data from Fig.2B of [4]</b></em></br></center></br> | ||
<div style="text-align:center"> | <div style="text-align:center"> | ||
- | <img src="https://static.igem.org/mediawiki/2013/ | + | <img src="https://static.igem.org/mediawiki/2013/3/34/WHU2013Refineca6.png" /></div> |
<center><em><b>Figure 3. Model prediction compared with data from Fig.5CB of [3]</b></em></br></center></br> | <center><em><b>Figure 3. Model prediction compared with data from Fig.5CB of [3]</b></em></br></center></br> | ||
These data are collected from 1’ end truncation or consecutive mutation experiment of gRNA. In Both figure, as the column number grows, the end truncation/end mutations become more serious, and the total energy of DNA-gRNA binding drops. The prediction of the model is near-linear, but the data show great non-lineality. Obvious platforms formed in the 4-8 column of Fig.2 and column 3-9 of Fig.3, which suggest the gRNA-Cas9 complex is not sensitive for the energy loss cause by the continuous mismatch / truncation at these stage.</br></br> | These data are collected from 1’ end truncation or consecutive mutation experiment of gRNA. In Both figure, as the column number grows, the end truncation/end mutations become more serious, and the total energy of DNA-gRNA binding drops. The prediction of the model is near-linear, but the data show great non-lineality. Obvious platforms formed in the 4-8 column of Fig.2 and column 3-9 of Fig.3, which suggest the gRNA-Cas9 complex is not sensitive for the energy loss cause by the continuous mismatch / truncation at these stage.</br></br> |
Latest revision as of 01:27, 29 October 2013
1. Overview
For a pdf version of the tandem promoter modeling part,click here
This model aims at predicting the off-target rate of any Cas9-based system in vivo. It has the following key ideas.
First, The probability of Cas9-DNA binding is majorly determined by the affinity of the gRNA and DNA. A △G’ is assumed to indicate this affinity. The △G’ is determined by △G(i), which is calculated by NN nearest neighbor model of nucleic acid thermodynamics.
Second, by analyzing binding equilibrium, dCas9 inhibition data and aCas9 activation data, the model to predict the possibility of gRNA-d/aCas9 binding to certain target in vivo can be constructed. The fitting result of this model reveals the equation to calculate △G’ from △G(i).
Finally, By analyzing the Cas9 cleaving process, the link between Cas9-DNA binding probability and editing efficiency can be established.
Data from six papers were analysis and/or used for model fitting. The data for Cas9 editing model fitting is generously provided by Vikram Pattanayak and Prof. David Liu, who has published the paper - High-throughput profiling of off-target DNA cleavage reveals RNA- programmed Cas9 nuclease specificity - on Nature Biotechnology, 11 Aug 2013.[1] The data for Cas9 binding model derivation and fitting is extracted from the following figures, Fig 2C, S7B, S7E of [2], Fig 5C of [3], Fig 2AB of [4]. The software used to extract high fidelity data is GetData Graph Digitizer V2.22.
2. Symbol table, Assumption and reasons.
Symbol | |
[ ] | The symbol of concentration, i.e. [A] means the concentration of A |
△G’ | Difference in Modified Gibbs Free Energy. It’s assumed to determine the binding constant between gRNA-Cas9 and DNA |
△G(i) | The calculated △G for the ith position of gRNA-DNA interaction |
a | The input of △G’, a vector consist of △G’(1)-(21) |
b | The constant representing all interaction in the binding process other than the gRNA-DNA interaction. |
ω | The weight vector |
F() | Relation function |
Ka | Association constant of gRNA-Cas9 and DNA |
Kd | Dissociation constant of gRNA-Cas9 and DNA |
[A]0 | the concentration of certain sequence in the pre-selection library |
[A]tot | the concentration of all DNA sequence in the pre-selection library |
[C] | the concentration of certain sequence in the post-selection library |
[C]tot | the concentration of all DNA sequence in the post-selection library |
A’ | the number of certain sequence we sampled from the pre-selection library |
Atot’ | the number of all sequence we sampled from the pre-selection library |
P’ | the number of certain sequence we sampled from the post-selection library |
Ptot’ | the number of all sequence we sampled from the post-selection library |
θ | Cas9 targeting efficiency |
S | Substrate, DNA |
E | Enzyme, gRNA-Cas9 |
P | Product, double strands broken DNA |
A | The intact DNA duplex |
B | the DNA molecule in which one of the two strands has been cleaved at the recognition site for the restriction enzyme |
C | the DNA molecules in which both strands have been cleaved at the recognition site |
ka,kb | The two apparent first-order reaction constant of the two steps of cleaving of Cas9 |
k1,k-1,kcat | Reaction constants |
KM | MM constant |
R | Gas constant |
T | Absolute temperature |
pb | Binding probability |
pc | Cutting probablity |
Abbreviation | |
dCas9 | Deactivated Cas9, Cas9 inhibitor, a Cas9 with two mutations D10A and H841A |
aCas9 | Cas9 activator, a dCas9 that fused with a activator domain like VP64, TAL and omega subunit of RNAP. |
d/aCas9 | Deactivated Cas9, no matter whether it’s an activator or inhibitor |
2. The model does not take the 3D structure of DNA, gRNA and DNA-gRNA complex into consideration. As the data is not sufficient to take these factor into consideration.
3. The model is based on NN nearest neighbor model of base pairing energy[5]. This model was built for thermodynamic energy calculation of DNA strand interaction. But we employ it to model the gRNA-DNA interaction. This will bring in some inherent flaws. The most prominent one will be when the RNA side is a U and the DNA side is a G. In the NN model it’s considered as a T-G pair, which is not as energetically favorable as U-G [6]. However, there is no model available for DNA-RNA interaction energy calculation yet. So it’s assumed that the energy (△G(i)) calculated from the NN model is to some degree consistent with reality. In fact, [7] suggested a rough sort of the tolerance of base mismatch: CC<UC<AG<AA<GA<CA<UG<CT<GG<UT<AC<GT, while the model suggested that CC<AC<TC<AA<TT<GA<GT<GG.
4. We believe by employing a better model of energy prediction, the whole Cas9 off-target model will be improved.
5. We assume △G’ takes up a form of . Where “a” is an 1×21 vector that contain △G(1) to △G(21) as its value, ω is the weight vector. Only the impact of DNA-gRNA interaction (“a”) is counting as a variable, and the △G contributed by other interaction(eg. protein-DNA interaction) are considered as a constant b. This is also why this model cannot predict Cas9 off-target rate of a target without PAM(NGG), which interact with Cas9 rather than gRNA. F() is the function that relate with △G’.
6. Both cleaving steps of Cas9 are assumed as classic Michaelis-Menten enzyme reaction.
7. The dCas9, aCas9 and normal Cas9 are assumed to share a same Ka with DNA, given they are guided by the same gRNA. This is reasonable as the only changes in Cas9 are D10A and H841A.
3. Modeling result
We employ a NN nearest neighbor model to calculate the △G(i) between gRNA and DNA on each NN position. From the first nucleotide of the target area of gRNA to the 20th, △G(i) of totally 21 position are calculated. We first proved the feasibility of our idea by calculating the correlation between △G(i) and cutting efficiency (employing data from [1] CLTA1,2,3).
The result shows that roughly the closer the position to PAM the larger the correlation. This discovery is consistent with previous studies [1,2,3,4,7,8]. Therefore we confirm that △G(i) do influence the targeting efficiency of Cas9.
But the data from [1,2,3,4,8] also revealed that △G(i) is not proportional to targeting efficiency. In most high single mismatch tolerance cases, the correlation between △G(i) and targeting efficiency is not significant. The following table can be concluded.
Sequence | Single Mismatch tolerance | G/C | Ref |
TCATGCTGTTTCATATGATC | low | 7 | [4] |
AACTTTCAGTTTAGCGGUCU | low | 8 | [3] |
TGTGAAGAGCTTCACTGAGT | low | 9 | [1] |
GATGCCGTTCTTCTGCTTGT | low | 10 | [8] |
AGTCCTCATCTCCCTCAAGC | low | 10 | [1] |
GAGATGATCGCCCCTTCTTC | low | 11 | [2] |
CTCCCTCAAGCAGGCCCCGC | low | 15 | [1] |
Ave. G/C 10 | |||
GCAGATGTAGTGTTTCCACA | high | 9 | [1] |
GGTGGTGCAGATGAACTTCA | high | 10 | [8] |
GGGGCCACTAGGGACAGGAT | high | 13 | [2] |
GTCCCCTCCACCCCACAGTG | high | 14 | [2] |
GGGCACGGGCAGCTTGCCGG | high | 16 | [8] |
Ave. G/C 12.4 |
The mismatch tolerance is roughly determined from the data of the references, for details(pdf version) please click here. A low tolerance sequence with single mismatch on at least 7 positions has significant performance drop. A high tolerance sequence with single mismatch at more than 16 position can perform as well as the original sequence in guiding Cas9.
The relationship of G/C frequence with single mismatch tolerance can be explained by the fact that the abundance in G/C make the gRNA binds to DNA more stable, and single mismatch is not strong enough to disturb the binding. This suggests that the F() may be a sigmoid function. But to determine this sigmoid function ( and to determine b) requires more specific experiment data of d/aCas9 binding kinetics, which is not available.
This guess is also supported by later analysis in the comparison of the Cas9 binding model, which assumed a normal proportional relationship between △G(i) and △G’. The results shows that the Cas9-gRNA not only is not sensitive to energy change in DNA-gRNA binding when △G’ surpass some threshold, but also not sensitive to such energy flux when △G’ is lower than some threshold.
A rough ω is calculated from Fig S7C and Fig.5D of [3] (both cases are intolerant to single mismatch).
The performance of these parameters and the model are checked by compared the predicted value with the data from Fig.2B of [4] and Fig.5C of [3]. Noticed that all the following figure using data normalize by the activity of “wildtype” gRNA, thus the b is not required for the prediction.
These data are collected from 1’ end truncation or consecutive mutation experiment of gRNA. In Both figure, as the column number grows, the end truncation/end mutations become more serious, and the total energy of DNA-gRNA binding drops. The prediction of the model is near-linear, but the data show great non-lineality. Obvious platforms formed in the 4-8 column of Fig.2 and column 3-9 of Fig.3, which suggest the gRNA-Cas9 complex is not sensitive for the energy loss cause by the continuous mismatch / truncation at these stage.
On the las part of the model. Kinetic analysis reveals that both concentration and reaction time are important for off-target control.
The curves displaying changes of two different cleaved products. Boundary conditions were set as [A0]=1.0, [B0]=[C0]=0, ka=0.2 min-1,kb=0.1 min-1 for red line; And [A0]=1.0, [B0]=[C0]=0, ka=0.1 min-1,kb=0.05 min-1 for blue line.
4. Model derivation
4.1. Calculation of △G’ of DNA-gRNA binding
The calculation method of △G(i) and △G’ is modified from the NN nearest neighbor model introduced in [2].
Step1. Set up the binding sequence
The input will be the 21nt of the target prior to the GG of the PAM, and the corresponding 21nt of the potential off-target sequence. The reason for why we need a 21nt sequence rather than 20nt is that the NN model using the adjacent 2nt as inputs. In order to completely consider the impact of the 20nt targeting sequence of gRNA, we need to consider the 21st base to make the calculation comprehensive. Hereby we explain our way to process inputs using an example. Mismatch base pairs are highlighted in red.
Example:
Target sequence and gRNA sequence: ATCG.............CCGG (20nt)
Possible off-target sequence: ACCG.............CGGG (20nt)
Change the off-target sequence to its complementary sequence:
TGGC.............GCCC (20nt)
The binding double strand will be:
ATCG.............CCGG (G) gRNA
TGGC.............GCCC (A) potential off-target DNA
The base in brackets is the 21st nucleotide on each chain. Notice that the 21st nucleotide of gRNA is always G.
We then divide the chains into the following form
Step2. Terminal energy calculation
Determine the first mismatch from both direction of the chain. If the mismatch happen within 2nt from end (i.e. at position 1,2), consider the corresponding end as an dangling end. There is a reason for only consider terminal mismatch and dangling end effect on the “1’ end”. These terminal stabilizing effect originate from the fact that if the two chains are not suitable to bind at the terminal, they can simply not bind in the classic way, which is energetically unfavorable, but just floating around. But the Cas9 is “grasping” at the “20’ end” of gRNA and DNA binding, as the protein needs to anchor on the PAM immediately following the “20’ end”. This spatial constraint make the gRNA and DNA has no other way but the “normal” way of binding. Therefore, we use the energetically unfavorable single mismatch table(in later steps) to calculate the energy here, rather than the relatively more stabilizing dangling end table.
Therefore, in the example, we consider the left end as a dangling end, the right end as a normal end.
If a dangling end is determined, determine the first match position following the mismatch position. In the example, this will be position 2 (△G(2)). Set all dangling end position energy as 0, i.e. △G(1)=0, and calculate the first match according to Table 3, i.e. △G(2)=5’TC/G+3’GG/C=-0.58-0.44=-1.02 kcal/mol,
If no dangling end appears. Determine whether the terminal pair is A-T. If yes, add a terminal AT penalty(+0.05) to the △G(i), and calculate all △G(i) according to Table 3.
Step3. Internal energy calculation
Calculate all position except for first match and dangling end position according to Table 4, in our example, this set contains △G(3) to △G(19).
The result will be
△G(3)=-2.17 kcal/mol
△G(17)=0.70 kcal/mol
△G(18)=0.70 kcal/mol
△G(19)=-1.84 kcal/mol
Step4. Further analysis of internal loops and bulges.
We will complete this step in the future. For the model V1.0, the algorithm will skip this step.
Step5. Adjust △G(i) according to ion concentration
Empirical salt correction equations have been derived,
where N is the total number of phosphates in the duplex, and [Na+] is the total concentration of monovalent cations from all sources (the same equation works for sodium, potassium, and ammonium )over a range of monovalent concentration of 0.05 to1M.
Step6. Calculate △G’ We assume △G’ takes up a form of . Where “a” is an 1×19 vector that contain △G(1) to △G(19) as its value, ω is the weight vector. Only the impact of DNA-gRNA interaction (“a”) is counting as a variable, and the △G contributed by other interaction(eg. protein-DNA interaction) are considered as a constant b. This is also why this model cannot predict Cas9 off-target rate of a target without PAM(NGG), which interact with Cas9 rather than gRNA. (Assumption 4)
According to the previous steps, the △G’of our example should be
4.2. Correlations between △G’ and Cas9 targeting efficiency
Vikram Pattanayak et al. used in vitro selection and high-throughput sequencing to determine the propensity of eight guide-RNA:Cas9 complexes to cleave each of 1012 potential off-target DNA sequences. This size is sufficiently large to include tenfold coverage of all sequences with eight or fewer mutations relative to each 22-base-pair target sequence.
The DNA of target sequences and their corresponding potential off-target sites were produced as substrates by PCR and rolling circle amplification. The abundance of each kind of sequence in the pre-selection library will differ from their abundance in post-selection library. This abundance changes reveals the relative targeting efficiency of the Cas9 on certain target.
Let us define the following variant.
[A]0 is the concentration of certain sequence in the pre-selection library,
[A]tot is the concentration of all DNA sequence in the post-selection library
[C] is the concentration of certain sequence in the post-selection library,
[C]tot is the concentration of certain sequence in the post-selection library
A’ is the number of certain sequence we sampled from the pre-selection library
Atot’ is the number of all sequence we sampled from the pre-selection library
P’ is the number of certain sequence we sampled from the post-selection library
Ptot’ is the number of all sequence we sampled from the post-selection library
So we have And the Cas9 targeting efficiency If △G’ really determine the probability of Cas9 digest certain DNA. There must be some kind of correlation between each △G(i) and the Cas9 targeting efficiency θ. We can calculate the △G’ and θ of all sequence contained in the library, and calculate the Pearson's product-moment coefficient of △G’ and θ. So we analyzed CLTA1,2,3 one-mutation pre-selection library and “v2.1 gRNA 100nM Cas9” post-selection library, and get Figure 1.
4.3. Derivation of Cas9 binding model, for off targe prediction of d/aCas9
Cas9 must first binds to DNA to cut them. For d/aCas9,
First, we link △G’ with [S], [E] and [ES] through
These equation can reach steady state quickly when compared with the time scale of any in vivo or in vitro experiment. Because according to the data from [3], Cas9-DNA binding can achieve equilibrium within 0~3 min. The reasoning is as follow.
Notice that, on Figure 5, the RFP started to decrease exponentially 10min after the adding of inducer. This is only possible, when v[mRNA] is hold as a constant. So d[mRNA]/dt=0, which means [TF] is a constant. In this equation, [TF] means the concentration of transcription factor that binding to the promoter, while dCas9 is the only transcription factor in this experiment. According to table 2.1 and 2.2 in [9], the typical mRNA lifetime in E.coli is 2-5 min, the time for protein (Cas9) transcription and translation is 5 min. So the Cas9-DNA binding can achieve equilibrium within (10-5-5~10-5-2) 0~3 min in vivo. So the time needed to achieve equilibrium is much shorter than the experiment time-scale both in vivo and in vitro.
So we can consider the equations are in steady state.
In order to predict the off-target rate of d/aCas9. Following equation can be derived. At equilibrium,
This equation can also be employed to calculate the best enzyme concentration of gRNA-Cas9 for an ideal balance between regulation and off-target.
4.4. Derivation of Cas9 cutting model, for off targe prediction of Cas9<>
Cas9 contains two nuclease domain - a RuvC-like domain and a HNH motif flanked by two RuvC-like domains. Each of them responsible for cutting one of the two nucleotide chains[11]. The kinetic of endonuclease catalyzed DNA double strand break is very complex. But fortunately, experiments have showed that most double strand break process can be approximated by a consecutive first-order reaction as below[12,13,14]. RuvC itself also show enzymatic activity consistent with first-order reactions based prediction[15].
In order to link the apparent first-order rate constant to △G’. We assume both steps of cleaving is classic enzymatic reaction as follow.
One can derive the concentration-time function of C following enzyme kinetic equations. The equation will be like following (derivation details in addendum):
So in addition to control the concentration of Cas9, control the expressing time of Cas9 is also important for off-target rate control. Cas9’s expression can be stoped as soon as possible when acceptable theoretical editing rate is reached, in order to reduce off-target rate.
5. Discussion
There may be three reason for the correlation variation throughout output 1-19.
First, the Cas9 has a mismatch tolerance for the 5’ end of gRNA. This is backed by all studies [1,2,3,4,7,8].
Second, there are flaws in the calculation of terminal energy. As all terminal mismatch of RNA and most for DNA are stabilizing [5,6]. The NN model may fail to catch all these stabilizing effect. So improvement of the energy calculation rules may help to fix the negative correlation.
Three, the NN model is derived from the binding energy database of free binding DNA double strands, while we employed it to calculate Cas9 influenced RNA-DNA binding. We considered this as the prime source of error in our model. And it may contribute to the funny correlation valley in △G(8)~△G(12). Or, maybe the valley means that Cas9 is indeed relatively insensitive to energy changes in these position.
First, the Cas9 has a mismatch tolerance for the 5’ end of gRNA. This is backed by all studies [1,2,3,4,7,8].
Second, there are flaws in the calculation of terminal energy. As all terminal mismatch of RNA and most for DNA are stabilizing [5,6]. The NN model may fail to catch all these stabilizing effect. So improvement of the energy calculation rules may help to fix the negative correlation.
Three, the NN model is derived from the binding energy database of free binding DNA double strands, while we employed it to calculate Cas9 influenced RNA-DNA binding. We considered this as the prime source of error in our model. And it may contribute to the funny correlation valley in △G(8)~△G(12). Or, maybe the valley means that Cas9 is indeed relatively insensitive to energy changes in these position.
6. Addendum
But still, the MM equation remains valid. Because, first, under these conditions, [E] (free E concentration) doesn't change much, because most "enzymes" are in free form and they don't do anything; second, some time after enzyme and substrate are mixed the concentrations of free enzyme sites and of substrate complexed will reach a steady state.[17]
The DNA-Cas9 ratio is of the same order, so it’s reasonable to use the experimental data to predict the Cas9 behavior in vivo.
Reference
1. Pattanayak, Vikram, et al. "High-throughput profiling of off-target DNA cleavage reveals RNA-programmed Cas9 nuclease specificity." Nature biotechnology (2013).
2. Mali, Prashant, et al. "CAS9 transcriptional activators for target specificity screening and paired nickases for cooperative genome engineering." Nature biotechnology 31.9 (2013): 833-838.
3.Qi, Lei S., et al. "Repurposing CRISPR as an RNA-guided platform for sequence-specific control of gene expression." Cell 152.5 (2013): 1173-1183.
4.Bikard, David, et al. "Programmable repression and activation of bacterial gene expression using an engineered CRISPR-Cas system."Nucleic Acids Research (2013).
5.SantaLucia Jr, John, and Donald Hicks. "The thermodynamics of DNA structural motifs." Annu. Rev. Biophys. Biomol. Struct. 33 (2004): 415-440.
6. Mathews, David H., et al. "Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure." Journal of molecular biology 288.5 (1999): 911-940.
7. Fu, Yanfang, et al. "High-frequency off-target mutagenesis induced by CRISPR-Cas nucleases in human cells." Nature biotechnology 31.9 (2013): 822-826.
8.Hsu, Patrick D., et al. "DNA targeting specificity of RNA-guided Cas9 nucleases." Nature biotechnology 31.9 (2013): 827-832.
9. Alon, Uri. Introduction to Systems Biology: And the Design Principles of Biological Networks. Vol. 10. CRC press, 2007. Page 6.
10. Buchler, Nicolas E., Ulrich Gerland, and Terence Hwa. "Nonlinear protein degradation and the function of genetic circuits." Proceedings of the National Academy of Sciences of the United States of America 102.27 (2005): 9559-9564.
11. Jinek, Martin, et al. "A programmable dual-RNA–guided DNA endonuclease in adaptive bacterial immunity." Science 337.6096 (2012): 816-821.
12. Halford, Stephen E., Nicola P. Johnson, and John Grinsted. "The reactions of the EcoRi and other restriction endonucleases." Biochemistry. J 179 (1979): 353-365.
13. Halford, Stephen E., Nicola P. Johnson, and John Grinsted. "The EcoRI restriction endonuclease with bacteriophage lambda DNA. Kinetic studies."Biochemistry. J 191 (1980): 581-592.
14. Fogg, Jonathan M., et al. "Yeast resolving enzyme CCE1 makes sequential cleavages in DNA junctions within the lifetime of the complex." Biochemistry 39.14 (2000): 4082-4089.
15. Fogg, Jonathan M., and David MJ Lilley. "Ensuring productive resolution by the junction-resolving enzyme RuvC: large enhancement of the second-strand cleavage rate." Biochemistry 39.51 (2000): 16125-16134.
16. Alberts, Bruce. Molecular biology of the cell (4th edition). Garland Science, (2000): 191-234
17. Gutfreund, Herbert. Enzymes: physical principles. London: Wiley-interscience, 1972.
Thanks for
the consulting by Kenji Adzuma
the data processing by Yancheng Liu
the data collection by Lei Yang
the debugging by Yao Yang
Otherwise the model can never be done in time.