DNA methylation and demethylation
Learn about DNA methylation (5mC) and the mechanisms of DNA demethylation and techniques used to map DNA modifications 5mC, 5hmC, 5fC, and 5caC.
DNA methylation
Throughout DNA, chemical modifications add a layer of regulation to the expression of genes encoded within the DNA sequence. The most well-studied of these chemical modifications is 5-methylcytosine (5mC), a modification most commonly recognized as a stable, repressive regulator of gene expression. The human genome consists of approximately 1% methylated cytosine making it the most abundant and widespread DNA modification (Moore et al 2012). There a several methods available to sequence 5mC throughout the genome, all of which have pros and cons, which we will discuss later in this guide. These methods include high-resolution approaches, such as whole-genome bisulfite sequencing, and antibody-dependent DNA immunoprecipitation (DIP) or MeDIP.
5mC was initially discovered to reside within CpG islands – stretches of DNA commonly found within promoter regions enriched in CpG dinucleotides. It is within these promoter regions that 5mC acts as a stable epigenetic mark repressing gene transcription. Within the mammalian genome, methylated cytosine is initially incorporated into the DNA during early development by the de novo methyltransferase enzymes DNMT3a and DNMT3b (Okano et al 1999). These methylation marks are maintained throughout the genome by an additional methyltransferase, DNMT1, which copies DNA methylation patterns to daughter strands during DNA replication (Vertino et al 1996).
DNA demethylation: 5mC, 5hmC, 5fC, and 5caC
Today the notion of 5mC being an entirely stable DNA modification is less concrete. Many methylated cytosines throughout the genome, particularly within gene bodies, undergo a process known as DNA demethylation – a process that ultimately results in the removal of 5mC back to an unmodified cytosine (C). DNA demethylation can occur in one of two ways: passive DNA demethylation, where methylated cytosine is diluted from the genome due to an absence of methylation maintenance enzymes. Or active DNA demethylation, which involves the oxidation of 5mC by ten-eleven translocation (TET) enzymes into oxidized derivatives of 5mC (reviewed in Wu et al 2017).
Active DNA demethylation occurs in a cycle, starting with 5mC and finishing with an unmodified C. 5mC is initially oxidized to 5-hydroxymethlcytosine (5hmC), which is further oxidized to 5-formylcytosine (5fC), and finally, this is oxidized once more to 5-carboxylcytosine (5caC). 5fC and 5caC can be removed from DNA by thymine DNA glycosylase (TDG) in combination with base excision repair (BER) to result in an unmodified C (figure 8). 5hmC, 5fC, and 5caC have been the focus of many recent epigenetic studies. More and more are being found out about these epigenetic marks, including the potential for them to have stable epigenetic roles. Many sequencing methods have been developed to distinguish these marks throughout the genome including variations on MeDIP using 5hmC, 5fC, and 5caC antibodies, and variations on bisulfite sequencing such as TET assisted bisulfite sequencing (TAB-seq). The differences between these methods will be discussed later in this guide.
Figure 8. The cycle of DNA demethylation. Active DNA demethylation occurs by thymine DNA glycosylase (TDG) coupled with base excision repair (BER) or replication-dependent dilution of 5hmC, 5fC or 5caC. Active modification–passive dilution (AM–PD). active modification–active removal (AM–AR).
Bisulfite sequencing
It is not possible to detect 5mC using traditional DNA amplification approaches because the mark is not maintained during sample preparation and amplification. Bisulfite conversion is one of the most widely used approaches to convert DNA methylation marks into a suitable template for amplification and downstream analysis. Bisulfite conversion uses the treatment of DNA with NaOH and sodium bisulfite in a chemical reaction that converts cytosine bases into uracil (U), while methylated cytosines are protected from the conversion (figure 9).
During downstream analysis such as PCR or sequencing, unmethylated C bases that undergo deamination in the bisulfite reaction will be interpreted as thymine (T), whereas 5-mC bases will remain unchanged and still be detected as a C by the sequencing output. This allows you to determine the locations in the genome containing methylated cytosine (Frommer et al., 1992)
Figure 9. Bisulfite conversion. Treatment of DNA with bisulfite (sulphonation) leads to the deamination of cytosine residues and converts them to uracil, while 5-methylcytosine residues remain the same.
Bisulfite-based applications
Bisulfite conversion has become the basis for several variations and applications designed for high throughput applications or the investigation of broader, whole genome-scale regions.
Here are some examples of bisulfite-based methods.
Genome-wide DNA methylation analysis
-
Whole-genome bisulfite sequencing (WGBS; Lister et al 2009)
-
Applies next-generation sequencing (NGS) techniques to bisulfite-converted input samples.
-
WGBS produces single-base resolution DNA methylation maps that span the entire genome of an organism.
-
-
Reduced representation bisulfite sequencing (RRBS; Meissner et al., 2005)
-
Combines the single-base resolution of bisulfite, and the genome-scale coverage of high throughput sequencing, with the use of methylation-sensitive restriction enzymes to enrich samples for high CpG content.
-
Effectively limits sequencing to only the regions of high interest, where DNA methylation exists.
-
Targeted DNA methylation analysis
-
Methylation-specific PCR (MS-PCR; Herman et al., 1996)
- Applies PCR primers specific to bisulfite-converted DNA templates that are either methylated or unmethylated. The differential PCR amplification indicates if DNA methylation modifications are present.
-
Pyrosequencing (Colella et al., 2003; Tost et al., 2007)
- Also known as sequencing by synthesis and can interrogate bisulfite-converted DNA at a specific region of interest. The level of 5mC is determined by comparing the ratio of C and T bases at an individual locus.
-
High resolution melting (HRM) analysis (Wojdacz and Dobrovic, 2007)
- Originally applied to SNP detection, but the process has also been adopted for DNA methylation. The real-time PCR-based protocol measures melting temperatures of PCR amplicons. The shift in melting temperatures, which vary on C-T content, corresponds to the level of DNA methylation in the sample.
-
Methylation-sensitive single-nucleotide primer extension (MS-SnuPE; Gonzalgo and Jones PA, 1997)
- Queries a CpG of interest by targeting bisulfite specific primers to the sequence immediately preceding a CpG. DNA polymerase terminating dideoxynucleotides allow the primer to extend a single base, which then can be quantitatively measured to determine C-T content, determining its DNA methylation status.
Bisulfite conversion: technical considerations
Incomplete conversion
Bisulfite conversion is a very powerful method because it is relatively simple to perform, and it can deliver single-base resolution of DNA methylation status. However, the method does have some drawbacks: incomplete conversion (or on occasion, over-conversion) can occur under sub-optimal reaction conditions leading to insufficient DNA denaturation, or when the DNA strands re-anneal before completion of the reaction.
Distinguishing 5hmC
DNA degradation is often a byproduct of the harsh bisulfite conversion reaction conditions, which can make working with smaller samples challenging. Insufficient desulfonation of the reaction will leave behind residues that can inhibit DNA polymerases used in PCR. Recent evidence indicates that bisulfite conversion does not distinguish between 5mC and 5hmC Bisulfite conversion therefore lowers the overall complexity of the DNA sequence. This reduction sequence complexity can complicate primer design for downstream PCR-based interrogation or introduce challenges when attempting to uniquely map sequencing reads to a reference genome.
DNA immunoprecipitation (DIP)
Another method commonly used to map the location of DNA methylation marks is DIP. DIP relies heavily on having antibodies capable of recognizing the DNA modifications of interest. However, once you have this, DIP is a straightforward and effective method. It is also considerably cheaper and easier to analyze compared to WGBS sequencing, which requires the whole genome to be sequenced. DIP only requires sequencing of the small sheared DNA regions pulled down in your IP step.
DIP has been successfully carried out for the most well-characterized DNA modifications: 5mC, 5hmC, 5fC, and 5caC (Pastor et al., 2011, Shen et al., 2013). It has been used in a range of samples, including embryonic stem (ES) cells, brain tissue, and zebrafish fish embryos. The method is similar to ChIP, but your starting material is raw genomic DNA with no chromatin required. This genomic DNA will undergo shearing to approximately 150–300bp, and then this sheared DNA can undergo heat denaturation. This step is essential as the antibody will only be able to access the modifications within denatured (open) DNA.
After DNA denaturation the sheared DNA is incubated with the antibody recognizing your modification of interest, usually overnight, and then the samples undergo an IP step to pull down all the DNA bound to the antibody and washing away any unbound DNA. We recommend using magnetic beads for this type of IP step. When you carry out DIP, it is important to treat your initial genomic DNA with RNase to remove any RNA from the samples.
Figure 10. DIP methodology. Genomic DNA is sheared, and immunoprecipitation is carried out using antibodies against your DNA modification. Pulldown DNA and input samples can then be used for qPCR, microarray, or NGS.
DIP-based applications
Genome-wide DIP analysis
-
DIP-sequencing (Pastor et al., 2011)
-
DIP is combined with NGS to map the location of DNA modifications across the whole genome.
-
Due to the conservation of these chemical structures between species, it has been easy for researchers to sequence their DNA modification of choice in any organism.
-
The library prep and analysis of DIP-sequencing is very similar to that of ChIP-sequencing.
-
The small DNA fragments pulled down in your IP are used in library prep, and these can be sequenced at a relatively low read depth compared to WGBS as you are more selective about what you sequence, ie only regions bound to your antibody.
-
Targeted DIP analysis
-
DIP-PCR (Pastor et al., 2011)
-
The pulldown DNA you obtain from your IP as described in the DIP-sequencing section above. However, this time instead of using the sheared DNA for library prep you can use it in a qPCR as template DNA.
-
When you design primers for this type of DNA you have to consider that the template being genomic DNA will contain both exons and introns. This method can be very effective to determine levels of a modification across samples.
-
It can be tricky to compare levels of different modifications as many factors including antibody affinity can affect this.
-
DIP: technical considerations
Shear your samples appropriately.
Unlike WGBS, DIP is not single-base resolution. When you are shearing your DNA samples, it is important to get these DNA fragments to a good size of between 150–300bp, to try to improve the resolution of your DIP sequencing. Having larger fragments means you will inevitably pull down more DNA flanking your DNA modification of interest and not physically bound to it. This results in broad, unspecific peaks in your sequencing analysis.
Get good antibodies.
Another problem with DIP is that you need to have an antibody specific for your modification of interest. You need to make sure there is minimal cross-reactivity with similar modifications, for example, if your 5fC antibody also recognizes 5hmC, this is not ideal for mapping the location of 5fC throughout the genome. The use of antibodies for this type of sequencing also has many advantages. You are only limited by the antibodies available to you. If you wanted to investigate a modification not previously characterized in DNA, eg m6A (more commonly associated with RNA), you could do so provided that you have a specific m6A antibody.
Alternate methods to capture 5hmC, 5fC, and 5caC
The biggest drawback of traditional bisulfite sequencing is that it is unable to distinguish the oxidized derivatives of 5mC and will profile only 5mC itself. Fortunately, there have been many variations on bisulfite sequencing and some entirely new approaches to tackling the problem of sequencing 5hmC, 5fC, and 5caC. Here we look at some of these new methods in more detail.
5hmC mapping
-
Tet-assisted bisulfite sequencing (TAB-seq; Yu et al., 2012)
-
This method relies on the conversion of 5hmC into 5gmC. The addition of glucose in this glucosylation reaction acts to protect the 5hmC.
-
TET enzymes are then added to the genomic DNA to convert all 5mC and 5fC present into 5caC. After Bisulphite conversion 5hmC is read as C.
-
5caC and unmethylated cytosines are all read as T. This method gives a clear differentiation between 5mC and 5hmC.
-
The problems with this method are that all the conversions to T can make it difficult to map the end sequences produced. It also requires very deep sequencing to get a full coverage of the genome, so this can be more expensive than other methods.
-
-
Oxidative bisulfite sequencing (oxBS-seq; Booth et al., 2012)
-
This is another method for detecting 5hmC at single-base resolution and uses potassium perruthenate (KRuO4) to chemically convert of 5hmC to 5fC.
-
After this conversion, all 5mC remains unchanged. Subsequent bisulfite treatment and sequencing allows you to distinguish between 5mC and 5hmC sites by comparing the KRuO4 treated and untreated samples.
-
-
hMe-Seal. (Song et al., 2011)
-
Similar to TAB-seq, hMe-Seal starts with the glucosylation of 5hmC to 5gmC, but the added glucose molecule is engineered to contain an azide group that can be chemically modified with biotin.
-
5hmC can then be enriched within the genome using the tight binding between biotin and streptavidin to carry out a pull-down for 5hmC.
-
-
Selective chemical labeling with exonuclease (SCL-exo; Sérandour et al., 2016)
-
The initial steps for this are the same as hMe-Seal, azide-glucose glycosylation of 5hmC.
-
Azide reaction with biotin allows for the 5hmC present to be linked to streptavidin however in this method the captured DNA undergoes exonuclease digestion which will stall at the biotin-5gmCs.
-
5fC/5caC mapping
-
M.SssI methylase-assisted bisulfite sequencing (MAB-seq; Wu et al., 2014)
-
This method can quantitatively measure 5fC and 5caC at single-base resolution. This is achieved using M.SssI methyltransferase on your DNA to convert all unmodified cytosine to 5mC.
-
After bisulfite-treatment all newly modified Cs, 5mC and 5hmC will be read in the sequencing as C, but all the 5fC and 5caC in the genome will be read as T.
-
If you compare this to sequencing carried out without M.SssI treatment you can determine where the 5fC and 5caC modifications are within the genome.
-
The biggest problem with this method is that is doesn’t differentiate between 5fC and 5caC.
-
-
5fC chemically assisted bisulfite sequencing (fCAB-seq; Song et al., 2013)
-
This technique relies on the chemical protection of 5fC using O-ethylhydroxylamine (EtONH2).
-
This protection prevents bisulfite-mediated deamination of 5fC, and so this will appear as a C in the sequencing results (the same as 5mC and 5hmC).
-
When this is compared to a sample not treated with EtONH2 (where all 5fC modifications would be read as T), you can distinguish all the sites in the genome which have a 5fC.
-
-
5caC chemically assisted bisulfite sequencing (caCAB-seq; Lu et al., 2013)
-
caCAB-seq uses the modification of 5caC within the genome using 1-ethyl-3-[3-dimethylaminopropyl] carbodiimide hydrochloride (EDC) to catalyze the formation of amide bonds to 5caC.
-
This chemical modification prevents deamination of 5caC after bisulfite conversion allowing for it to be distinguished from 5fC in the sequencing.
-
-
chemical-labeling-enabled C-to-T conversion sequencing (CLEVER-seq; Zhu et al., 2017)
-
CLEVER-seq is not only single-base resolution but can also be used on single cells. It is just for sequencing 5fC distribution and not 5caC.
-
This method uses malononitrile to selectively label 5fC creating a 5fC-M adduct which is read as a T in the sequencing.
-
Comparison of DNA modification sequencing methods
It is important that you choose the best method for detecting DNA modifications that suit your needs. Consider things like whether you need single-base resolution, if you need to be able to quantify the absolute levels of the modification, and how feasible the method will be to use in your model system or sample type. Below you can find a table where we have summarized these key features for some of the available methods for sequencing 5hmC, 5fC, and 5caC.
Table 1: DNA modification sequencing methods
Liquid chromatography tandem-mass spectrometry (LC/MS/MS)
If you have access to LC-MS/MS, then this is the best way quantify the amount of a DNA modification within total genomic DNA (Le et al 2011 and Fernandez et al., 2018). Using absolute quantification methods, LC-MS/MS gives you parallel quantification of all the DNA modifications found in total DNA from any organism and cell type (Zhang et al 2012). For absolute quantification, you are only limited which isotopic standards you have available to use as a standard to measure your sample against.
Using this technique combined with DIP (DIP-MS) allows you to determine if your DNA modification antibody is binding to your modification of interest and it will also allow you to see if it binds any other non-specific modifications. If you generate LC-MS/MS data of your DIP input and pull-down samples, you should see an enrichment of your modification of interest in the pulldown sample compared to the input. You can also then check other modifications with these same data to see if anything else came out as enriched in your samples to test for non-specific antibody binding. There is software being developed now that can even help you with this type of analysis.
LC/MS/MS: technical considerations
-
Technically challenging:
- MS equipment is costly and very specialized. The machine itself will require an enormous amount of maintenance and often requires its own technician to keep on top of things. Operating the machine is complicated and requires specialized training so it may be difficult to obtain this type of MS data on your own. Consider obtaining this data through collaborations or paid services if it is not feasible for you to purchase your own LC/MS/MS equipment.
DNA modification IHC/ICC
It is also possible to carry out IHC/ICC for DNA modifications. This can be done with a few simple additions to your standard IHC/ICC protocol. The most significant difference you will need to consider is that antibodies against DNA modifications cannot access and bind to the modification if it sits within double-stranded DNA. This means that you will need to denature the DNA making it single-stranded and accessible by the antibodies.
The most common of DNA denaturation is to treat your samples with acid. This is usually 4N hydrochloric acid (HCL) applied directly to you IHC/ICC slides (Yamaguchi et al., 2013 and Kaefer et al., 2016). The best time to add this step to your protocol is before the addition of the primary antibody. Once you have permeabilized your cells or tissues with a detergent (eg PBS 0.1% Triton) you can wash and add 4N HCl to denature the DNA strands. It is important to thoroughly wash the acid off once the step is complete and neutralize the acid with an alkali (eg 100 µM NaOH in PBS). After the acid is washed and neutralized you can proceed with your usual IHC/ICC steps and add the primary antibody.
When carrying out an IHC/ICC for DNA modifications you should also be wary that your antibody may recognize very similar modifications on RNA (eg 5mC on DNA and m5C on RNA). To avoid this problem, you can treat your samples with an RNase step to remove all RNA present. Again, this step should be optimized as leaving your sample in RNase for too long can also cause damage to the DNA present.
DNA modification IHC/ICC: technical considerations
-
Time your acid step
-
It is crucial that you optimize the concentration and timings of the acid step before you start using your experimental samples.
-
Too long in an acid treatment will ruin the samples, but the timing needs to be long enough to denature the DNA fully.
-
Tissue samples will require the acid treatment for longer than cells used for ICC. You should try a range of timings from 10 minutes up to 40 minutes and see how your signal looks after this.
-
ICC should only need 5–10 minutes maximum but again, it is important to test this first and optimize correctly.
-
-
Double IHC/ICC.
-
It can be difficult to carry out double IHC/ICC with a DNA modification given the effects of the acid treatment step. The acid treatment may denature proteins present in the sample or degrade epitopes required for recognition by your second primary antibody.
-
If you want to carry out such a double immunoassay it will require careful optimization. Try to minimize the amount of time your sample spends in the acid treatment step to reduce damage to other proteins. You could also consider doing the primary antibody steps sequentially.
-
For example, after you have applied the first primary antibody, fix this with a formaldehyde-based fixative before the acid treatment step and adding the second primary antibody (eg the DNA modification antibody).
-
-
Choose the right DNA stain.
-
You may find that because of the acid treatment step you cannot use your standard DNA stain. For example, DAPI may not bind so well as this recognizes the adenine-thymine bases present within double-stranded DNA.
-
A good alternative commonly found in most labs is propidium iodide (PI). PI will recognize both double-stranded and single-stranded nucleotide chains.
-
This means it will also pick up any RNA in your samples, so watch out for this. You can also find many commercially available DNA stains which will recognize single-stranded DNA.
-
Methyl binding domain proteins (MBDs)
5mC and its oxidized derivatives play an important role in gene silencing and promoting gene expression after DNA demethylation. It is now known that some of these DNA modifications can act as markers to recruit proteins to specific DNA sites, altering gene expression and acting as epigenetic marks. MBD3 and methyl CpG binding protein 2 (MECP2) have both been shown to bind 5hmC in addition to 5mC. Once bound to 5hmC they play a role in DNA accessibility and activation of transcription (Yildirim et al., 2011 and Mellén et al., 2012).
A common method to screen for binders of a DNA modification is to use a pull-down technique followed by MS to screen for any proteins pulled down. This method has been successfully used to find binders of 5mC, 5hmC, and 5fC (Iurlaro et al., 2013 and Sprujit et al., 2013). For this experiment, you need to create a synthetic DNA bait containing the modification you are interested in as well as baits containing other modifications and unmodified cytosine to act as controls. This DNA bait should be linked to a biotin molecule at one end that can be used to tether the bait to streptavidin-linked magnetic beads. Protein extract from your sample of interest can then be added to the tethered bait and flushed through with various wash steps to remove any non-specifically bound proteins. After this, you can elute the remaining proteins and carry out MS analysis to find out what your specific binders are.
MBDs: Technical considerations
-
DNA sequence.
-
When you design your synthetic DNA sequence, you may need to consider that the sequence itself may affect which binders you pull down.
-
You may have a sequence in mind that you wish to use as bait, the promoter region of your gene of interest for example.
-
Having a variety of sequences to use in your experiment will help to ensure that it is the modification which is the critical factor and not the DNA sequence.
-
-
The number of modifications.
-
The number of DNA modifications you have within your sequence may also influence the proteins binding to your bait.
-
You should consider having just one modification or multiple modifications in a sequence to see how this is influencing your result.
-
-
Washing.
-
If you want to be sure that the proteins binding to your bait are true binders of your modification it is important to carry out very stringent washing.
-
You can try high-salt washes to ensure you are removing everything non-specifically bound, but run the risk of removing everything, so you need to optimize this step to get the best results.
-
Novel DNA modifications
New DNA modifications could still be out there, just not discovered yet. It has been demonstrated that some modifications traditionally considered to be RNA modifications may also be present within DNA. One good example of this is N6-adenine methylation, known as m6A within RNA and 6mA within DNA. This modification is one of the most famous and abundant RNA modifications, but now it’s known to also reside within DNA. One of the first studies to show this was from John Gurdon’s lab in 2016 (Koziol et al., 2016). They show that 6mA is within Xenopus laevis, mice, and the human genome using an antibody against 6mA to carry out DIP-seq.
Since this study, there have been several more claims that 6mA is present within DNA in zebrafish and pig genomes (Liu et al., 2016), the mouse brain following environmental stress (Yao et al., 2017), and within the Arabidopsis thaliana genome (Liang et al., 2018). One study from 2018 took this one step further and uncovered the enzymes responsible for 6mA methylation and demethylation N6AMT1 and ALKBH1 respectively (Xiao et al., 2018). The presence of enzymes actively adding and removing the DNA modification suggests that it has a real purpose to be there and potentially its own epigenetic function.
Novel DNA modifications: technical considerations
-
Antibody availability.
-
Until single-base resolution methods are available for individual modifications, many studies rely heavily on the use of antibody-based pull-down (DIP-seq) to look for novel modifications within DNA.
-
The biggest problem here is that you are then reliant on there being a specific, commercially available antibody for your modification which is quite often not the case. Many RNA modification antibodies will also recognize modifications within DNA, so this is one approach you could take.
-
Treating your samples with RNase will help to ensure that you are targeting just DNA with your antibodies.
-
References
- Brahma, S.,, Henikoff, S. RSC-associated subnucleosomes define MNase-sensitive promoters in yeast Mol Cell 73 (e233),238–249 (2019)
- Buenrostro, J.D.,, Giresi, P.G.,, Zaba, L.C.,, et al. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position Nat Methods 10 ,1213–1218 (2013)
- Hainer, S.J.,, Fazzio, T.G. High-resolution chromatin profiling using CUT&RUN Curr Protoc Mol Biol 126 (e85), (2019)
- Janssens, D.H.,, et al. Automated in situ chromatin profiling efficiently resolves cell types and gene regulatory programs Epigenet Chromatin 11 ,74 (2018)
- Meers, M.P.,, Bryson, T.D.,, Henikoff, J.G.,, et al. Improved CUT&RUN chromatin profiling tools eLife (e46314), (2019)
- Meers, M.P.,, Tenenbaum, D.,, Henikoff, S. Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling Epigenet Chromatin 12 ,42 (2019)
- Schmid, M.,, Durussel, T.,, Laemmli, U.K. ChIC and ChEC; genomic mapping of chromatin proteins. Mol Cell 16 ,147–157 (2004)
- Skene, P.J.,, Henikoff, J.G.,, Henikoff, S. Targeted in situ genome-wide profiling with high efficiency for low cell numbers Nat Protoc 13 ,1006–1019 (2018)
- Skene, P.J.,, Henikoff, S. An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites eLife (e21856), (2017)
- Thakur, J.,, Henikoff, S. Unexpected conformational variations of the human centromeric chromatin complex Genes Dev 32 ,20–25 (2018)
- Zentner, G.E.,, Kasinathan, S.,, Xin, B.,, et al. ChEC-seq kinetics discriminates transcription factor binding sites by DNA sequence and shape in vivo Nat Commun 6 ,8733 (2015)