For the best experience on the Abcam website please upgrade to a modern browser such as Google Chrome
Take a look at our BETA site and see what we’ve done so far.
Search and browse selected products
Purchase these through your usual distributor
Recorded December 3, 2014
Watch our on-demand webinar to learn what ChIP-seq datasets should look like and the types of results you can extract. This webinar is an ideal resource for wet-lab biologists with little experience in programming or the Unix/Linux environment. It may also help any bioinformatic beginners, looking to learn ChIP-seq analysis skills for the future.
This webinar will give you a ChIP-seq workflow that you can run with your own dataset. All ChIP-seq data analysis software included has either a web-based interface or a graphic user interface, so no command-line experience is necessary. However, it is highly recommended that wet-lab biologists learn some basics of lineage usage, as the pipeline outlined in this webinar can be followed in just a few single commands.
Please note only sequencing data from the Illumina® platform is covered in this webinar.
Xi Chen completed his undergraduate studies at the Health Science Centre in Peking University, China. He obtained his PhD in the lab of Professor Andy Sharrocks at the University of Manchester, where he investigated the DNA binding specificity of several forkhead transcription factors using both traditional biochemical assays, and state-of-the-art genomic approaches. You can read more about that research here.
Xi is now a postdoctoral fellow in the lab of Dr Sarah Teichmann at EBI and Sanger in Hinxton. His work focuses on understanding how transcription factors control the fate decision of mouse T helper cells by integrating TF binding data and gene expression data.
Miriam Ferrer Ph.D., Product Manager for Cellular Assays, Abcam
Vicky Yang, Marketing Co-Ordinator, Abcam
VY: Hello. Welcome to Abcam's webinar, A Step-by-Step Guide to ChIP-seq data analysis. Today's principal speaker is Xi Chen, a postdoctoral fellow in the lab of Dr Sarah Teichmann at the EBI and Sanger in Hinxton. He did his undergraduate studies at the Health Science Centre in Peking University. He obtained his PhD in Professor Andy Sharrocks's lab at the University of Manchester, where he investigated the DNA binding specificity of several forkhead transcription factors using both traditional biochemical assays and state-of-the-art genomic approaches. Xi's work focuses on understanding how transcription factors control the fate decision of mouse T helper cells by integrating TF binding data and gene expression data.
Joining Xi today will be Miriam Ferrer, Product Manager for cellular assays at Abcam. Miriam completed her Biology degree at the University of Barcelona, and has a PhD from Vrije University in Amsterdam. After completing her PhD she joined MRC Laboratory of Molecular Biology in Cambridge. I will now handover to Xi who will start this webinar. Xi?
XC: Thank you, Vicky, for the introduction and thank you all for coming to the webinar. So this webinar is more like a technical tutorial of how to perform ChIP-seq data analysis, especially for wet-lab biologists that do not have much programming and learning experience. It is also useful for bioinformatics beginners who need to perform ChIP-seq data analysis in future. In this webinar the case in point is that you've got your raw sequencing read files, most likely FastQ files, and we will only cover sequencing data from the Illumina® platform. I will show you how to align or map the reads for a reference genome, how to perform peak calling to identify enriched regions for the protein of the interest; and how to find the A motif enriched within the binding regions, and how to assign binding sites to genes and get enriched genome ontology terms to discover potential biological function. After this webinar the audience will be able to get a general idea about what ChIP-seq datasets look like, and what you can expect from ChIP-seq data.
Let me give you an outline about this webinar first. We will cover all the basic ChIP-seq routines by using the software listed in this slide. All the software here, they either have a web based interface like Galaxy and GREAT, or they have a graphic user interface like FastQC and seqMINER, so no command line experience is needed here.
The dataset I'm using here are FOXA1 ChIP-seq experiments. FOXA1 is a forkhead transcription factor, and it is one of the first transcription factors that has been ChIPped genome-wide. We will do a side-by-side comparison of a successful FOXA1 ChIP-seq experiment, and a failed one. The successful FOXA1 ChIP-seq data is from the MCF7 breast cancer cell line, published in Nature Genetics from Jason Carroll's lab, shown on the left hand side. In this paper, the authors discovered that FOXA1 is absolutely required for estrogen receptor DNA binding and activity in breast cancer cells.
The failed one is from the LoVo colon cancer cell line, published in Cell from Jussi Taipale's lab, shown on the right hand side. In this paper, the authors found that transcription factor binding tends to occur in a highly clustered manner, and the most clusters contain cohesions. They performed hundreds of transcription factor ChIP-seq experiments, and FOXA1 is one of them. However, the FOXA1 ChIP-seq dataset didn't pass the author's QC standards, so they labelled this as a failed experiment. This provides a perfect opportunity for us to get an idea about what a failed experiment looks like, which is sometimes even more important than other things; and thank the authors for making this data publicly available.
Now, we will start with QC of your sequencing reads using a software called FastQC. It is developed at the Babraham Institute, and you can download it on the website indicated on the top of the slide. As you can see here, FastQC can be run interactively so you are able to locate your sequencing reads well and load them into FastQC by simply clicking your mouse. Here, we have four files: two for FOXA1/MCF7 datasets, and the other two for FOXA1/LoVo datasets. For each dataset we've got a FOXA1 ChIP sample and a control sample. In the MCF7 datasets the control sample is the input control, while in the LoVo datasets the control is the IgG control.
Once you load the files by FastQC you should see something like this in the middle of the slide. It gives you many QC metrics of your sequencing reads. The usage of FastQC is beyond the scope of this webinar, and we won't cover all the metrics in the program. Here, we will only focus on three metrics: the first one is basic statistics, which contains some basic information about your sequencing reads, like read length and number of total reads. Well, perhaps the most important piece of information here is the encoding, it tells you how the quality score of the sequencing reads and encoded. The FOXA1/MCF7 experiment is encoded by Sanger/Illumina® 1.9, and the FOXA1/LoVo experiment is by Illumina® 1.5. We will need this information later, and I will come back to this later.
The second metric is the read quality. In this graph the X-axis shows all the individual bases from the reads. For every base the Y-axis is above plot of distribution of quality scores of all reads across that particular base. The yellow boxes show from 26-76 percentile, and the red line indicates the medium, and the blue line indicates the mean of the quality scores. You really want the quality scores well above 20 to be good bases. In the FOXA1/LoVo data on the right the quality has dropped rapidly at the end of the reads. In this case, you might want to trim the last or even five bases of your reads before downstream analysis. In this webinar we won't do this just for simplicity.
The further metric we are looking at is a sequence duplication level. It tells you how unique your reads are within the library, and this is the measurement for library complexity. Due to the random nature of sequencing, what you expect to see is that most sequencing reads only occur once, like the FOXA1/MCF7 data on the left. However, in the FOXA1/LoVo data on the right the majority of reads occur twice, there are also a lot that occur four times. The overall unique sequencing read is shown above the graph here. The FOXA1/MCF7 data have more than 70% unique sequences, whilst the FOXA1/LoVo data have only around 44% unique sequencing reads. The total number of reads in the FOXA1/LoVo datasets is more than 48 million, so 44% indicates that there are about 21 million unique reads, which is quite okay, actually. Now, it is worth mentioning that the FastQC results are only indications of the quality of your sequencing, but they cannot tell you whether your ChIP experiments work or not. That's what we are going to do next.
Now, after getting a rough idea about the sequencing quality, we are ready to align the reads to the reference genome. To do this we will use Bowtie in the Galaxy platform. Galaxy is a user-friendly online system with many integrated bioinformatics tools. After registering and logging to the Galaxy website, which is usegalaxy.org, you will need to upload your FastQ files to the Galaxy server. Normally, you just click the upload file under the get data tab on the left. You can choose the files from your computer to upload. However, note here, if the files are too large they will fail to upload in their browser, so it is better to upload your FastQ files via Galaxy FTP.
Any FTP client will do the trick. Here, I'm using FileZilla as an example. In the host fill in the address of the Galaxy website, which is usegalaxy.org. The user name will be your email address when you've registered your account in the Galaxy website, and then fill in the password. Simply click quick connect, you will see some log information below indicating you have successfully connected into the Galaxy FTP. The left hand side shows the files on your local computer, and the right hand side shows your files on the Galaxy FTP; there are none at the moment. So now locate your FastQ files on your computer at the left, then simply drag them to the right; the files will start to upload. It is also recommended to leave your FastQ files to save uploading time. The upload speed is, of course, not as fast as your local network, but it is still quite decent. When it is done, the files uploaded will be shown on the right. Then if you go to the Galaxy website the files you just uploaded will be shown in the middle.
Now we are ready to upload the files from the FTP to the Galaxy server. Simply take the file you are going to upload, then in the file format dropdown menu choose the right file format. Our files are FastQ files, and you can see here there are actually four different kinds of FastQ files. The FastQ CS Sanger format is for color space SOLiD sequencing platform, and we won't cover this in this webinar. So the formats we should consider are FastQ Illumina®, FastQ Sanger and a FastQ Solexa. The main difference among these three FastQ formats are the ways they encode sequencing quality. The FastQ Sanger format uses phred score to present sequencing quality, and it uses ASCII stream of phred score plus 33 to encode the quality score. FastQ files from the Illumina® pipeline 1.8 and after are encoded in this way. I will show you later what exactly this means. If your sequencing experiment was done recently, so most likely your FastQ files will be in FastQ Sanger format.
The FastQ Illumina® format uses phred score as well, but it encodes the quality by ASCII stream of phred score plus 64. FastQ files from the Illumina® pipeline starting 1.3 and before 1.8, will be in this format. If you remember from the previous FastQC results, the FOXA1/LoVo ChIP sample is from the Illumina® pipeline 1.5, so it should be in FastQ Illumina® format. The rest of the three samples are all from Illumina® pipeline 1.9, so they will be in FastQ Sanger format.
The FastQ Solexa uses a different system between code quality score, and it is highly unlikely that your data files will be in this format, unless you're analyzing some very old Solexa data. More information about this can be found in this paper from NAR, and from the Wikipedia page at the bottom of the slide.
So now click the file and choose the right style format. Click the execute button. The file will start to upload from the FTP to the Galaxy server. After they are done they will appear at the right hand side with green color. If you click the little eye button here you will see what is actually in the file in the middle. We use the FOXA1/MCF7 data as an example. Basically, the FastQ file is just a simple text file, each sequencing read is represented by four lines of text. There are about 27 million reads in this sample, so there are about 108 million lines of text in this file.
Let's look into details of one read shown here on the top in the black box. The first line is the name of the read, and it must start with an @ character. The second line is the actual DNA sequence of the read. The third line means nothing, but it must start with the plus character and only this plus character is required, the rest of the line is optional. Since I've got the data up on the slide, this line is the same as the first line, except the beginning character, and most FastQ files only have a plus character as a third line of each read. In this way you reduce the file size and save storage space. The fourth line is the ASCII stream of quality score encoding at each base.
Here is an ASCII table that you can easily find via the internet, and let's look into each base of this read. So the first base of this read is A and the quality of string for this base is B. By looking at the ASCII table we can find that the ASCII code for string B is 66. Since this is FastQ Sanger format the phred score will be 66 minus 33, which is 33. A phred score of 33 indicates that the error probability of this base is 0.0005.
The second base of this read is G and the quality of stream for this space is the @ character. Again, by looking at the ASCII table we found that the ASCII code for the @ character is 64, then the phred score for this space will be 64 minus 33, which is 31. A phred score of 31 indicates that the error probability for this base is 0.0008. You can check every base within this read by yourself in this way, and that is basically how you'd read and integrate the data from FastQ files.
Now we will use Bowtie to align the sequencing reads to the genome, but the Bowtie software in Galaxy only supports FastQ Sanger files, so we need to convert the FOXA1/LoVo ChIP file from FastQ Illumina® to FastQ Sanger. To do this, on the left hand side take FastQ Groomer and at the NGS QC and the manipulation tab. Then in middle choose the file you are going to convert and the input format, and hit execute. After it's done it will appear on the right with green color, and you can click the pencil button here to rename it as something meaningful like this.
Finally, we are ready to align or map the sequencing read for the reference genome. We will be using the Bowtie program, so on the left hand side click map with Bowtie for Illumina®, and at the NGS mapping tab. So in the middle select the reference genome. In this case, all samples from both experiments are from human cells, so we choose the hg19 here which is not the latest, but it is a stable version of human genome assembly. For the input file choose the FastQ file you want to map, and in the Bowtie setting choose full parameter list, then you will get many options. Now, we will keep everything with default, except change this –m option to 1. So what does this mean? To put it simply, for some reads they can be mapped to multiple locations within the genome, and you can handle this read differently according to your need. But now we put one here, which simply discards those reads. This is a good starting point for beginners and, actually, many people are still doing this when they analyze DNA centric sequencing data, like ChIP-seq, DNA-seq and the ATAC-seq.
So you do the same for the other three files, and after they are finished they will appear on the right hand side in green color. As you can see here, I have already renamed them to something meaningful. Again, you can click the eye button to actually see what's inside the output alignment file. The output alignment file from Bowtie is called SAM file. The SAM file is also a single text file with each line representing a read containing the location it is mapped into the genome, and other information. The details about the SAM format can be found in the PDF link shown at the top.
Now, after mapping, we are going to perform peak calling to identify FOXA1 binding sites. However, the SAM files at the moment contain both mapped and unmapped reads. We only want the mapped reads for the peak calling, so we need to remove all the unmapped reads. To do this we will use SAM tools shown on the left hand side. Just click filter SAM or BAM, then in the middle choose the file you want filter and it should filter on bitwise flag to yes, then you get more options. In the skip alignment with any of these flag bits set check ‘the read is unmapped’, so in this way some tools will remove all the unmapped reads. After it is finished the output will appear on the right in green color, and I already named them to something meaningful. By default the output from SAM Tools are BAM files. The BAM file is a binary file, not a text file, so you won't be able to view the content. You can simply think the BAM file is just a compressed SAM file, it significantly reduces the file size.
Now we are ready to perform peak calling. For this we will be using a very popular peak caller called MACS, developed in Shirley Liu's lab. On the left hand side under the NGS peak calling tab click MACS to use it. In the middle there are many options, but we will only change a few of them. First, give it a meaningful name, then choose the ChIP file which is the FOXA1 ChIP file, and the control file which is the matching control experiment, input control for MCF7, IgG control for LoVo cells. The default genome size is 2.7 billion base pair, and this is for human cells. For a different organism try the MACS website for the genome size, then change the tag size to your sequencing wavelength. Take the Parse xls files into interval files. We will see this later. Then to save wig files so that you will get the binding signal files later for visualization. Then last, choose ‘do not build the shifting model’, and the 100 base pair for a different shift size.
So why change these two options? As you know, that for a regular ChIP-seq experiment the sequencing reads you get are from the end of your fragments, which are not the actual TF binding sites. The actual TF binding sites are somewhere in the middle of the fragment. By default, MACS will estimate the fragment length and shifting the reads to the middle by half of the fragment length to represent the actual TF binding. However, for some reason sometimes it cannot reliably estimate the fragment style. A common practice is to simply disable this and ask MACS to shape the reads to a certain distance. In this case we choose 100 base pair, which is the default when you disable the model built, and it works quite well. Now click execute. After it is finished you get some outputs on the right hand side in green color, and the number of output files depends on your MACS data. Well, in this case, we have six output files for each experiment.
Now, let's have a look at what each file is. The first file is the HTML report containing the running log of MACS. The second and third files are wig files containing the signal intensity across the genome. The treatment wig file is your ChIP sample signal, and the control wig file is your control sample signal. Again, the wig files are simple text files. We won't look at the negative peaks interval files here. The fifth file here is a peak interval file containing the enriched region of your protein, in this case FOXA1. The last file is a bed file for visualizing peak positions in a genome browser, and both the interval and bed files are simple text files.
If you click the name of the interval files and you can download this interval file to your local computer, and then it can be opened and easily handled and manipulated by any spreadsheet software like Excel. This is probably one of the most important output files, it contains all the information you need to know about the binding sites, including the location and the statistics. At this stage the peak calling is done.
A common and a tricky question that most beginners have to face at this stage is how can I tell whether my experiment works or not? Actually, there are many things you can check, but the following three things are the most efficient ways of telling whether the ChIP experiment worked or not. First, by looking at the interval files; second, by visual inspection of a binding signal; and third, by looking at the enriched motif within the binding site. We will now go through this one-by-one.
We will start with examining the interval files. To do this you sort first by FDR, smallest to largest, then by fold enrichment largest to smallest. The first thing we should look at is a number of binding sites. If you choose the FDR cover at 1%, which is commonly used, there are about 42,000 peaks left in the FOXA1/MCF7 data, but only 430 peaks left in the FOXA1/LoVo data. Of course, the number of binding sites depends on the protein, the peak caller, the cell line and many other things. Typically, the number you should expect is from around a thousand to several thousand, or even several tens of thousands. As you can see here, the FOXA1/LoVo datasets apparently do not satisfy the expectation. The second thing we could look at is the range of the folding enrichment of our local background. The FOXA1/MCF7 data have a wide range from more than 300-fold to about fourfold, while the largest FOXA1/LoVo peak has only around 20-fold enrichment, indicating the signal to noise in this data is really low. The third thing we could look at is a number of figures in reads within the peak region. So, again, the FOXA1/MCF7 data have a much wider range than the FOXA1/LoVo data. So these 3 metrics we just went through clearly suggest that there is something wrong with FOXA1/LoVo dataset.
The next thing we could do to judge the quality of the data is to visualize the binding signal, and we found this seems to be the most efficient way of telling whether a ChIP-seq experiment worked or not. So the wig files generated by MACS are the signal files, and you can visualize them directly in a genome browser. However, for large datasets it is highly recommended to convert the wig files to bigWig files before visualization. The bigWig format is an indexed binary format. When you visualize the bigWig files only the portions of the files being displayed are transferred to the genome browser, which is much faster than loading the wig files. So to convert the wig files to bigWig, on the left hand side click the wig paragraph to bigWig under the convert formats tab. Then in the middle choose the wig file you are going to convert, and click execute.
After it is done change the file name to something meaningful, and click the name of the file you will see an option of display at UCSC main. Then click that, a new tab or window of UCSC genome browser will appear. In the genome browser to configure the visualization set the display mode to full, so that it will display the histogram of the signal. Set the vertical viewing runs from zero to a reasonable maximum value like 100 or 200, and set data viewing scaling, so use vertical viewing range setting, so that it won't automatically rescale. You will be able to visualize the binding signal as shown here.
Now, what kind of patterns are we expecting? There are a few things you can look at at this stage. You can have a look whether the peak shape is normal, i.e. whether they are small bell curve shaped peaks or some strange spikes. If you have some known target genes can you find some binding peaks in the promoters, or near the transcriptional start size? Finally, to get a bigger view of the binding pattern you can look at the whole chromosome view.
This is an example of signals of all four samples across the entire chromosome 12 of the human genome. The two control files are almost flat, as you would expect them to be. The FOXA1/MCF7 ChIP sample has many peaks, but the FOXA1/LoVo ChIP sample looks very much like this mentioned control sample. There are only a few very small peaks. Now, this is a clear indication that the FOXA1/LoVo ChIP experiment failed, or at least it is not optimal. So that's how you can judge the quality of the data visually.
The next step is to identify enriched motifs within the binding sites. We will use a program called MEME-ChIP to do this. In order to use MEME-ChIP we need to extract the DNA sequence from the peak coordinates to a faster file. Now I need to clarify two terms before we go any further. When MACS is used for peak calling, peak means the wholly enriched region from start to end, indicated by the black bar on the top left. Whilst summit, indicated by the thin, orange line, means the highest pile up point within the peak region. It is supposed to be the exact TF binding site. It is not a good idea of putting too many sequences for the motif discovery, and a common practice for TF motif discovery is to use the 100 base pair region, centered on the summit as the input for motif discovery. So the column E in the MACS interval file is the distance of the summit to the start position. So to create the coordinates of peak summit, ± 50 base pair, just create four new columns after the FDR column of the interval file.
The first new column is the same as column A, which is the column's own name. The second column will be column B, as column E minus 50. The third column will be column B plus column E, plus 50. In the fourth column give a unique name for each individual region like this, and you can do this very easily using Excel. Now, save these four new columns as a tab delimited text file. Let’s call it FOXA1_summit_100bp.txt. For simplicity, I will only choose the power to 1,000 region for motif discovery in the webinar.
Now, we are going to extract the 100 base pair of DNA sequence of these regions using the coordinates. To do this, first upload these text files to the Galaxy server using a web browser. On the left hand side click ‘upload data’ under the get data tab, then in the middle choose the file format as interval. Choose the text file you just saved on your computer, and make sure the genome assembly is correct. After uploading, click ‘extract genomic DNA’ under the fetch sequences tab. Choose the file you just uploaded and simply click execute. When it has finished you will see the DNA sequence within each 100 base pair region in a fasta format. You can save this fasta file to your local computer; let’s call it FOXA1_summit_100bp.fasta.
Now, go to the MEME-ChIP website using the address at the of the slide. The options here are pretty much self-explanatory. In the input choose the fasta file you just downloaded, then enter your email address and give a job name. The default setting is always a good starting point, so you just simply click ‘start search’. Then you will receive an email with a link to retrieve the results.
The result page will look like this, the motifs found at the left hand side is a de novo motif that are enriched within your input region. The known or similar motif on the right hand side indicates the similarity between this motif with known TF motif. Now shown on the screen is the FOXA1/MCF7 data. The top motif is a typical focal motif with a very small E-value. This is pretty much what you expect from a successful experiment. There are also some other motifs returned, indicating potential interaction with other transcription factors. If the family's motif is over the transcription factor you are cheating, it's not enriched in the binding region, like the FOXA1/LoVo dataset we are going to show in the next slide. Then you need to put a question mark on the data, unless you have other strong evidence suggesting that the experiments are working.
Now shown on the screen is a motif result from FOXA1/LoVo datasets. As you can see here, although the top two motifs look like the focal motif, the E-values are very big. There are also two motifs with very low complexities, which are just simple repeats. These motifs' results further confirm that the FOXA1/LoVo ChIP experiment is not working. Now we have finished the motif discovery step, and we have a better impression about the quality of the data.
Another thing that most biologists are interested in is to assign the binding sites to genes, and the performing gene anthology analysis to find a potential biological function of transcription factors. To do this we'll be using a web tool called GREAT, developed in Gill Bejerano's lab. It is very straightforward to use, just go to the GREAT website using the address shown on the top right. Then in the genome assembly, choose the right genome, in this case it's hg19. Then choose the peak file from your computer as test region. Since this is a ChIP-seq experiment choose the whole genome as background. If you click ‘show settings’ you can see how exactly GREAT assigns the peaks to genes, and you can change it as you want. But according to the paper the default setting seems to perform the best, so just click ‘submit’.
The output looks like this. The top contains some basic information about peak gene association, and the bottom contains the enriched GO terms from specific categories like molecular function, biological process and the cellular components, as well as a lot of information from many other sequence databases. It is quite informative to look at those, but we won't go through them here. One thing that is particularly useful in that, if you click the ‘job description’ on the top this section will expand, and click ‘view all genomic region-gene associations’ you will see two tables: one gives you for each peak its associated genes, and at the other gives you for each gene its associated peaks, together with a distance between the peak and the gene. You can download them as a text file and save for future use. This is quite convenient. Now we have finished the gene ontology analysis.
Last, but not least, it is a good way of showing binding signals in Heatmap. To do this we will need alignment files, which are the BAM files we created after the mapping earlier. So click download datasets to save the BAM files to your local computer. We also need a program called seqMINER, developed in Lazlo Tora’s lab. Just download seqMINER from the address at the top, run it according to your operating system, then you should see an interface like this. First, to load data into it, in the load reference section choose the peak file, for example, the 100 base pair region text file, which we just created. Since this is a small file it is immediately loaded and you can see the information in the upper center. In the load aligned and read section choose the BAM files just downloaded, and then click ‘load files’. These two files will be loaded into memory, and this seems to be the time consuming part for all the analysis in seqMINER. After it is done, the datasets will be shown in the middle. When you click ‘extract data’, seqMINER will pull the reads from each alignment file you just loaded, on top of your peak file and the generates HeatMap metrics.
After it is done the result will show up on the right. If you right click the result choose ‘visualization of HeatMap’. A HeatMap of the two files you loaded will be displayed, you will also have a density graph at the bottom right. The good thing about seqMINER is that you can load multiple experiments at once, and the neatly-performed KMeans cluster to find different binding patterns, and extract different clusters for more detailed analysis. But that is beyond the scope of this webinar.
So far that's all the basic ChIP-seq routines done without any programming and command line involved. Hopefully, by running the workflow in the webinar on your own data by yourself, you would quickly have a general idea about what ChIP-seq datasets look like, and what we expect from ChIP-seq datasets. Here, I've put some references mainly from Nature Protocol having an in detail step-by-step guidance, including troubleshooting tips of the key aspect of ChIP-seq analysis. Some of them require some programming, or at least command line experience. It is highly recommended for wet-lab biologists who are doing genomic experiments to learn some basics about lineage usage, and have performed the data analysis yourself. It makes you have a better understanding of the data itself, and once you've done that all the things we just went through in this webinar can be done in just several single commands.
This is the end of my presentation. Now I will hand over to Miriam who will introduce some related products and the protocol from Abcam, and then I will come back to answer questions. Thank you.
MF: Thank you, Xi, for such an interesting and comprehensive presentation. I would like to take this opportunity to talk to you about some Abcam products and resources. We have just launched a high sensitivity ChIP kit that has been specifically designed to be used when there is a limited amount of sample available to perform ChIP; for example, when working with patient material, transgenic mice tissues or stem cells. With a high sensitivity ChIP kit you can obtain enrichment of your sequence of interest by using as little as 2,000 cells, or half mg of tissue per reaction. You can get your results in only 5 hr, so the experiment can be done in one working day, and the eluted DNA can be used straight away in sequencing or microarray, or qPCR.
Our high sensitivity ChIP kit is part of our range of high sensitivity products designed for when there is a limited amount of starting material. If you want to create a DNA library with a limited amount of DNA, our high sensitivity DNA library preparation kit for Illumina® sequencing allows you to create a DNA library from only 0.2 ng of DNA. Our ChIP-seq high sensitivity kit combines the benefit of the two kits I've just mentioned. This kit is designed to carry out a successful ChIP-seq starting directly from as little as 5-10,000 cells.
Bisulfite sequencing is the use of bisulfite treatments to determine the pattern of methylation. To help you to take your methylation pattern studies further, we now offer a couple of products to prepare both bisulfite DNA libraries for Illumina®. With our first bisulfite DNA library preparation kit, you can prepare a library using pre-treated DNA in only 5 hr. If you haven't treated your DNA, then we recommend our bisulfite-seq high sensitivity kit, which has all the reagents to perform the bisulfite modification followed immediately by a DNA library preparation step. The library is modified and ready in only 6 hr.
Without further delay, I will pass the microphone back to Xi who is ready to answer some of your questions.
XC: Thank you, Miriam. The first question is: Does parallel-end or single-end read data better? Well, for most of ChIP-seq data, single-end experiment is more prevalent if you do parallel-end sequencing for your ChIP experiments. Of course, the cost will be much more than the single-end data, but if you look at the information you get from parallel-end data, in terms of ChIP-seq you don't get a lot compared to the single-end data. So in a cost-effective way, most people are still doing single data in terms of ChIP-seq.
The second question is: Is the input or IgG a better control? Actually, they are both okay for ChIP-seq experiments as controls, and different people use different controls. For us, we prefer the input as a control, because it provides much more complex libraries.
The third question is: Is there a simpler way to quantify changes in peak rate between two ChIP-seq experiments using the same antibody, using chromatin from different cell types or conditions? Well, this is a very tricky and a very good question. Now in the ChIP-seq communities, a lot of people have put in a lot of effort to develop the software to identify differentially binding peaks, but, at the moment, there's no consensus about which method to use. There are already a lot of methods out there, and if you Google differentially binding ChIP-seq methods, then I'm sure you will get a lot and you can have a look at the map behind the method and choose by yourself.
Another question is: What are the differences between reads and the peaks? So the read is the raw signaling reads that you get from the sequencing machine, continuing the actual DNA sequences and the quality scores. Nowadays, they are usually between 50 base pairs to 100 base pairs. After making the map of reads containing the genomic locations where the reads are aligned. So the map to reads will be the genomic coordinates of 50 base pairs to 100 base pairs long, depending on your sequencing level. Well, peaks are the genomic regions returned by the peak caller, they are the regions where the reads are aggregating or pile up and have a certain statistics threshold of your peak caller.
This is the last question I will answer. So the question is: If the number of binding sites falls below 1,000, should I consider my experiment has failed? The short answer to this question is no, you cannot tell whether your ChIP-seq experiments fail or not solely based on the number of binding sites. Because the number of binding sites depends on many things, like the expression of the protein, the cell line, the peak caller and also the cut off thresholds used in the peak calling. If you use the method in this webinar the number we normally get from transcription factor ChIP-seq is around 1,000 to several thousand. If the number of binding sites falls below 1,000, you'll need to put a question mark. However, you cannot simply judge whether the experiment works or not solely based on the number of binding sites. You also need to check other things like motif results, and make sure you have inspection of your signal intensity. This is the last question. Now I will handover to Vicky, and thank you all for attending this webinar.
VY: Thank you Xi and Miriam for your presentations today. We have received quite a lot of questions, unfortunately we were not able to answer all of them. For those whose questions were not answered, our scientific support team will contact you shortly with a response to your question. If you have any questions about what has been discussed in this webinar or have any technical enquiries, our scientific support team will be very happy to help you, and they can be contacted at technical@abcam.com. We hope you have found this webinar informative and useful to your work. We look forward to welcoming you to another webinar in the future. Thank you again for attending, and good luck with your research!
ChIP-seq pipeline: Bardet A.F., He, Q., Zeitlinger, J. & Stark, A. A computational pipeline for comparative ChIP-seq analyses. Nat Protoc 6, 45-61 (2011).
Peak calling: Feng, J. et al. Identifying ChIP-seq enrichment using MACS. Nat Protoc 7, 1728-40 (2012).
Motif discovery:
Ma, W., Noble, W.S. & Bailey, T.L. Motif-based analysis of large nucleotide data sets using MEME-ChIP. Nat Protoc 9, 1428-50.
Thomas-Chollier, M. et al. A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nat Protoc 7, 1551-68 (2012).
Gene ontology:
McLean, C.Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol 28, 495-501 (2010).
Huang da, W., Sherman B.T. & Lempicki, R.A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4, 44-57 (2009).
HOMER suite: Go to https://www.salk.edu/ for more information