A step-by-step guide to ChIP-seq data analysis

On-demand webinar

webinar-image

Summary:

Watch our on-demand webinar to learn what ChIP-seq datasets should look like and the types of results you can extract. This webinar is an ideal resource for wet-lab biologists with little experience in programming or the Unix/Linux environment. It may also help any bioinformatic beginners looking to learn ChIP-seq analysis skills for the future.

Topics covered:

This webinar will give you a ChIP-seq workflow that you can run with your own dataset. All ChIP-seq data analysis software included has either a web-based interface or a graphic user interface, so no command-line experience is necessary. However, it is highly recommended that wet-lab biologists learn some basics of lineage usage, as the pipeline outlined in this webinar can be followed in just a few single commands.

Questions covered in this webinar:

About the presenter:

Xi Chen completed his undergraduate studies at the Health Science Centre in Peking University, China. He obtained his PhD in the lab of Professor Andy Sharrocks at the University of Manchester, where he investigated the DNA binding specificity of several forkhead transcription factors using both traditional biochemical assays, and state-of-the-art genomic approaches. You can read more about that research here.

Xi is now a postdoctoral fellow in the lab of Dr. Sarah Teichmann at EBI and Sanger in Hinxton. His work focuses on understanding how transcription factors control the fate decision of mouse T helper cells by integrating TF binding data and gene expression data.

Moderators:

Miriam Ferrer Ph.D., Product Manager for Cellular Assays, Abcam

Vicky Yang, Marketing Co-Ordinator, Abcam

Video Transcript

  • 00:00 - 00:06: Hello. Welcome to Abcam’s webinar on a step-by-step guide to ChIP-seq data analysis.
  • 00:07 - 00:12: Today’s principal speaker is Shi Chen, postdoctoral fellow in the lab of Dr. Sarah
  • 00:12 - 00:19: Teichman at EBI and Sanger in Hinxton. Shi did his undergraduate studies at Health Science
  • 00:19 - 00:25: Center in Peking University. He obtained his PhD in Professor Andy Sharrock’s lab
  • 00:25 - 00:31: at the University of Manchester, where he investigated the DNA binding specificity of
  • 00:31 - 00:38: several FOGHAT transcription factors using both traditional biochemical assays and state-of-the-art
  • 00:38 - 00:44: genomic approaches. Shi’s work focuses on understanding how transcription factors
  • 00:44 - 00:50: control the fate decision of mouse T helper cells by integrating PF binding data and gene
  • 00:50 - 00:58: expression data. Joining Shi today will be Miriam Ferrer, project manager for cellular assays at
  • 00:58 - 01:05: Abcam. Miriam completed her biology degree at the University of Barcelona and has a PhD from
  • 01:05 - 01:13: Free University in Amsterdam. After completing her PhD, she joined the MRC Laboratory of Molecular
  • 01:13 - 01:19: Biology in Cambridge. I will now hand over to Shi, who will start this webinar. Shi?
  • 01:21 - 01:25: Thank you, Vicky, for the introduction, and thank you all for coming to the webinar.
  • 01:26 - 01:32: So this webinar is more like a technical tutorial of how to perform ChIP-seq data analysis,
  • 01:32 - 01:39: especially for wildlife biologists that do not have much programming and learning experience.
  • 01:40 - 01:45: It is also useful for bioinformatics beginners who need to perform ChIP-seq data analysis in the future.
  • 01:47 - 01:52: In this webinar, the starting point is that you’ve got your raw sequencing reads files,
  • 01:53 - 01:58: most likely FASTQ files, and we will only cover sequencing data from the Illumina platform.
  • 01:59 - 02:04: I will show you how to align or map the reads to a reference genome,
  • 02:05 - 02:10: how to perform peak calling to identify enriched regions for the protein of interest,
  • 02:12 - 02:19: and how to find DNA motifs enriched within the binding region, and how to assign binding sites
  • 02:19 - 02:24: to genes and get enriched gene ontology terms to discover potential biological functions.
  • 02:25 - 02:32: After this webinar, the audience should be able to get a general idea about how ChIP-seq datasets
  • 02:32 - 02:39: look like and what you can expect from ChIP-seq data. Let me give you an outline about this
  • 02:39 - 02:46: webinar first. We will cover all the basic ChIP-seq routines by using the software listed
  • 02:46 - 02:54: in this slide. All the software here, they either have a web-based interface like Galaxy and GRID,
  • 02:55 - 03:00: or they have a graphic user interface like FASTQC and SeqMiner.
  • 03:01 - 03:04: So, no command-line experience is needed here.
  • 03:06 - 03:14: The datasets I’m using here are FOCUS-A1 ChIP-seq experiments. FOCUS-A1 is a FOGHAT transcription
  • 03:14 - 03:19: factor, and it is one of the first transcription factors that has been ChIPed genome-wide.
  • 03:19 - 03:24: We will do a side-by-side comparison of a successful FOCUS-A1 ChIP-seq experiment
  • 03:25 - 03:33: and a failed one. The successful FOCUS-A1 ChIP-seq data is from the MCF7 breast cancer
  • 03:33 - 03:39: cell line published in Nature Genetics from Jason Carroll’s lab, shown on the left-hand side.
  • 03:40 - 03:47: In this paper, the authors discover that FOCUS-A1 is absolutely required for estrogen
  • 03:47 - 03:55: receptor DNA binding and activity in breast cancer cells. The failed one is from the LOVO
  • 03:55 - 04:01: colon cancer cell line published in Cell from UC Tablet’s lab, shown on the right-hand side.
  • 04:02 - 04:07: In this paper, the authors found that transcription factor binding tends to
  • 04:07 - 04:13: occur in a highly clustered manner, and most clusters contain cohesins.
  • 04:14 - 04:17: They performed hundreds of transcription factor ChIP-seq experiments,
  • 04:18 - 04:24: and FOCUS-A1 is one of them. However, the FOCUS-A1 ChIP-seq dataset didn’t pass the
  • 04:24 - 04:32: authors’ QC standards, so they labeled this as a failed experiment. This provides a perfect
  • 04:32 - 04:38: opportunity for us to get an idea about what a failed experiment looks like, which is sometimes
  • 04:38 - 04:45: even more important than other things. And thank the authors for making this data publicly available.
  • 04:48 - 04:54: Now, we will start with QC of your sequencing reads using a software called FastQC.
  • 04:55 - 05:00: It is developed at the Babraham Institute, and you can download it
  • 05:00 - 05:03: from the website indicated on the top of the slide.
  • 05:03 - 05:10: As you can see here, FastQC can be run interactively, so you are able to locate
  • 05:10 - 05:15: your sequencing reads files and load them into FastQC by simply clicking your mouse.
  • 05:17 - 05:25: Here, we have four files, two for FOCUS-A1 MCF7 dataset, and the other two for FOCUS-A1 LOVO
  • 05:26 - 05:33: dataset. For each dataset, we got FOCUS-A1 ChIP sample and a control sample.
  • 05:35 - 05:41: In the MCF7 dataset, the control sample is the input control, while in the LOVO dataset,
  • 05:42 - 05:50: the control is the IgG control. Once you load the files by FastQC, you should see something
  • 05:50 - 05:56: like this in the middle of the slide. It gives you many QC metrics of your sequencing reads.
  • 05:58 - 06:03: The usage of FastQC is beyond the scope of this webinar, and we won’t cover all the metrics in
  • 06:03 - 06:13: the program. Here, we will only focus on three metrics. The first one is basic statistics,
  • 06:14 - 06:19: which contains some basic information about your sequencing reads, like read length and
  • 06:19 - 06:25: number of total reads. Well, perhaps the most important piece of information here is the
  • 06:25 - 06:33: encoding. It tells you how the quality score of the sequencing reads are encoded. The FOCUS-A1
  • 06:33 - 06:42: MCF7 experiment is encoded by Sanger Illumina 1.9, and the FOCUS-A1 LOVO experiment is by
  • 06:42 - 06:49: Illumina 1.5. We will need this information later, and I will come back to this later.
  • 06:51 - 07:01: The second metric is read quality. In this graph, the X-axis shows all the individual bases from the
  • 07:01 - 07:09: reads. For every base, the Y-axis is the box plot of the distribution of quality scores of all reads
  • 07:09 - 07:17: across that particular base. The yellow boxes show from the 25th to 75th percentile,
  • 07:18 - 07:23: and the red line indicates the median, and the blue line indicates the mean of the quality scores.
  • 07:25 - 07:28: You really want the quality scores well above 20 to be good bases.
  • 07:30 - 07:37: In the FOCUS-A1 LOVO data on the right, the quality drops roughly at the end of the reads.
  • 07:37 - 07:42: In this case, you might want to trim the last or even five bases of your reads 07:42 - 07:48: before downstream analysis. In this webinar, we won’t do this just for simplicity.
  • 07:50 - 07:55: The third metric we are looking at is sequence duplication level.
  • 07:57 - 08:01: It tells you how unique your reads are within the library,
  • 08:02 - 08:04: and this is a measurement for library complexity.
  • 08:06 - 08:12: Due to the random nature of sequencing, what you expect to see is that most sequencing reads
  • 08:12 - 08:21: only occur once, like the FOCUS-A1 MCF7 data on the left. However, in the FOCUS-A1 LOVO data on
  • 08:21 - 08:28: the right, the majority of reads occur twice. They are also allowed to occur four times.
  • 08:29 - 08:37: The overall unique sequencing read is shown above the graph here. The FOCUS-A1 MCF7 data
  • 08:37 - 08:45: have more than 70% unique sequences, while the FOCUS-A1 LOVO data have only around 44%
  • 08:45 - 08:50: unique sequencing reads. The total number of reads in the FOCUS-A1 LOVO data set
  • 08:51 - 08:59: is more than 48 million. So, 44% indicates that there are about 21 million unique reads,
  • 08:59 - 09:06: which is quite okay, actually. Now, it is worth mentioning that the FASTQC results are only
  • 09:06 - 09:11: indications of the quality of your sequencing, but they cannot tell you whether your ChIP
  • 09:11 - 09:19: experiments work or not. That’s what we are going to do next. Now, after getting a rough idea about
  • 09:19 - 09:24: the sequencing quality, we are ready to align the reads to the reference genome.
  • 09:25 - 09:33: To do this, we will use Bowtie in the Galaxy platform. Galaxy is a user-friendly online
  • 09:33 - 09:39: system with many integrated bioinformatics tools. After registering and logging into the Galaxy
  • 09:39 - 09:46: website, which is usegalaxy.org, you need to upload your FASTQ files to the Galaxy server.
  • 09:47 - 09:54: Normally, you just click the upload file under the Get Data tab on the left. You can choose the
  • 09:54 - 10:02: file from your computer to upload. However, note here, if the files are too large, they will fail
  • 10:02 - 10:08: to upload via browser. So, it is better to upload your FASTQ files via Galaxy FTP.
  • 10:08 - 10:13: Any FTP client will do the trick. Here, I’m using FileZilla as an example.
  • 10:14 - 10:22: In the host, fill in the address of the Galaxy website, which is usegalaxy.org. The username
  • 10:22 - 10:28: will be your email address when you registered your account on the Galaxy website. And then,
  • 10:29 - 10:37: fill in the password. Simply click Quick Connect. You will see some log information below
  • 10:37 - 10:45: indicating you have successfully connected to the Galaxy FTP. The left-hand side shows the
  • 10:45 - 10:52: files on your local computer, and the right-hand side shows your files on the Galaxy FTP. They are
  • 10:52 - 11:00: none at the moment. So now, locate your FASTQ files on your computer on the left,
  • 11:01 - 11:06: then simply drag them to the right. The files will start to upload.
  • 11:07 - 11:14: It is also recommended to zip your FASTQ files to save uploading time. The upload speed is,
  • 11:14 - 11:19: of course, not as fast as your local network, but it is still quite decent.
  • 11:20 - 11:24: When it is done, the files uploaded will be shown on the right.
  • 11:26 - 11:32: Then, if you go to the Galaxy website, the files you just uploaded will be shown in the middle.
  • 11:33 - 11:38: Now, we are ready to upload the files from the FTP to the Galaxy server.
  • 11:40 - 11:46: Simply tick the file you are going to upload. Then, in the File Format drop-down menu,
  • 11:46 - 11:54: choose the right file format. Our files are FASTQ files, and you can see here,
  • 11:55 - 12:02: there are actually four different kinds of FASTQ files. The FASTQ-CS-Sanger format is for
  • 12:02 - 12:06: color-based solid sequencing platforms, and we won’t cover this in this webinar.
  • 12:08 - 12:14: So, the formats we should consider are FASTQ-Illumina, FASTQ-Sanger, and FASTQ-Solixa.
  • 12:15 - 12:21: The main differences among these three FASTQ formats are the ways they encode sequencing quality.
  • 12:23 - 12:27: The FASTQ-Sanger format uses Phred Score to present sequencing quality,
  • 12:28 - 12:34: and it uses the ASCII 2 string of Phred Score plus 33 to encode the quality score.
  • 12:35 - 12:41: And FASTQ files from Illumina pipeline 1.8 and after are encoded in this way.
  • 12:41 - 12:47: I will show you later what exactly this means. If your sequencing experiments are done recently,
  • 12:48 - 12:51: so most likely your FASTQ files will be in FASTQ-Sanger format.
  • 12:53 - 13:01: The FASTQ-Illumina format uses Phred Score as well, but it encodes the quality by ASCII 2
  • 13:01 - 13:10: string of Phred Score plus 64. FASTQ files from Illumina pipeline starting 1.3 and before
  • 1.8 13:11 - 13:16: will be in this format. If you remember from the previous FASTQC results,
  • 13:18 - 13:25: the FOCUS-A1 LOVO chip sample is from Illumina pipeline 1.5, so it should be in FASTQ-Illumina
  • 13:25 - 13:33: format. And the rest of the three samples are all from Illumina pipeline 1.9, so they will be in
  • 13:33 - 13:40: FASTQ-Sanger format. The FASTQ-Solixa uses a different system to encode the quality score.
  • 13:41 - 13:46: And it is highly unlikely that your data files will be in this format unless you are analyzing
  • 13:46 - 13:54: some very old Solixa data. More information about this can be found in this paper from NAR
  • 13:54 - 13:57: and from the Wikipedia page at the bottom.
  • 13:59 - 14:06: So now take the file and choose the right file format. Click the Execute button.
  • 14:06 - 14:11: The file will start to upload from the FTP to the Galaxy server.
  • 14:12 - 14:17: After they are done, they will appear on the right-hand side with green color.
  • 14:19 - 14:25: If you click the little eye button here, you will see what is actually in the file in the middle.
  • 14:26 - 14:35: We use the FOCUS-A1 MCF7 data as an example. Basically, the FASTQ file is just a simple text
  • 14:35 - 14:43: file. Each sequencing read is represented by four lines of text. There are about 27 million reads in
  • 14:43 - 14:52: this sample, so there are about 108 million lines of text in this file. Let’s look into details of
  • 14:52 - 14:59: one read shown here on the top in the black box. The first line is the name of the read,
  • 15:00 - 15:08: and it must start with an at character. The second line is the actual DNA sequence of the read.
  • 15:09 - 15:13: The third line means nothing, but it must start with the plus character,
  • 15:14 - 15:20: and only this plus character is required. The rest of the line is optional.
  • 15:21 - 15:26: Since I got the data from SRA, this line is the same as the first line except for the beginning
  • 15:26 - 15:33: character, and most FASTQ files only have a plus character at the third line of each read.
  • 15:34 - 15:38: In this way, you reduce the file size and save storage space.
  • 15:39 - 15:45: The fourth line is the ASCII string of quality score encoding at each space.
  • 15:46 - 15:53: Here is an ASCII table that you can easily find via the internet, and let’s look into each space
  • 15:53 - 16:01: of this read. So the first space of this read is A, and the quality string for this space is B.
  • 16:02 - 16:10: By looking at the ASCII table, we can find that the ASCII code for string B is 66.
  • 16:11 - 16:19: Since this is FASTQ Sanger format, the Phred score will be 66 minus 33, which is 33.
  • 16:20 - 16:27: A Phred score of 33 indicates that the error probability of this space is 0.0005.
  • 16:29 - 16:36: And the second space of this read is G, and the quality string for this space is the plus character.
  • 16:36 - 16:44: Again, by looking at the ASCII table, we found that the ASCII code for the plus character is 64.
  • 16:45 - 16:51: Then the Phred score for this space will be 64 minus 33, which is 31.
  • 16:52 - 16:59: A Phred score of 31 indicates that the error probability for this space is 0.0008.
  • 17:00 - 17:04: And you can check every space within this read by yourself in this way.
  • 17:05 - 17:10: And that is basically how you read and interpret the data from FASTQ files.
  • 17:10 - 17:14: Now, we will use Bowtie to align the sequencing reads to the genome.
  • 17:14 - 17:19: But the Bowtie software in Galaxy only supports FASTQ Sanger files,
  • 17:20 - 17:26: so we need to convert the FOCUS-A1 LOVO ChIP file from FASTQ Illumina to FASTQ Sanger.
  • 17:27 - 17:34: To do this, on the left-hand side, click FASTQ Groomer under the NGS QC and Manipulation tab.
  • 17:34 - 17:40: Then in the middle, choose the file you are going to convert, and specify the input format,
  • 17:42 - 17:48: and hit Execute. After it’s done, it will appear on the right with green color.
  • 17:49 - 17:55: And you can click the pencil button here to rename it to something meaningful, like this.
  • 17:58 - 18:02: Finally, we are ready to align or map the sequencing reads to the genome.
  • 18:02 - 18:08: To align or map the sequencing reads to the reference genome, we will be using the Bowtie
  • 18:08 - 18:15: program. So on the left-hand side, click Map with Bowtie for Illumina under the NGS Mapping tab.
  • 18:16 - 18:24: Then in the middle, select the reference genome. In this case, all samples from both experiments
  • 18:24 - 18:30: are from human cells. So we choose HG19 here, which is not the latest,
  • 18:30 - 18:37: but it is a stable version of the Human Genome Assembly. For the input file, choose the FASTQ
  • 18:37 - 18:44: file you want to map. And in the Bowtie setting, choose Full Parameter List. Then you will get
  • 18:44 - 18:52: many options. Now we will keep everything at default, except change the “-m option to 1”.
  • 18:53 - 19:00: So what does this mean? To put it simply, for some reads, they can be mapped to multiple
  • 19:00 - 19:05: locations within the genome. And you can handle this read differently according to your need.
  • 19:06 - 19:13: But now, we put one here, which simply discards those reads. This is a good starting point for
  • 19:13 - 19:20: beginners. And actually, many people are still doing this when they analyze DNA-centric sequencing
  • 19:20 - 19:29: data, like ChIP-seq, DNA-seq, and APEX-seq. So you do the same for the other three files,
  • 19:29 - 19:33: and after they are finished, they will appear on the right-hand side in green color.
  • 19:34 - 19:38: As you can see here, I already renamed them to something meaningful.
  • 19:41 - 19:47: Again, you can click the eye button to actually see what’s inside in the output alignment file.
  • 19:48 - 19:55: The output alignment file from Bowtie is called SAM file. The SAM file is also a simple text file
  • 19:56 - 20:02: with each line representing a read containing the location it is mapped into the genome
  • 20:02 - 20:09: and other information. The details about the SAM format can be found in the PDF link shown at the
  • 20:10 - 20:19: top. Now, after mapping, we are going to perform peak calling to identify FOXA1 binding site.
  • 20:21 - 20:28: However, the SAM files at the moment contain both mapped and unmapped reads. We only want
  • 20:28 - 20:32: the mapped reads for the peak calling, so we need to remove all the unmapped reads.
  • 20:33 - 20:40: To do this, we will use SAM tools shown on the left-hand side. Just click filter SAM or BAM,
  • 20:41 - 20:48: then in the middle, choose the file you want to filter, and choose filter on bitwise flag to yes.
  • 20:50 - 20:56: Then you will get more options. In the skip alignment with any of these flag bits set,
  • 20:57 - 21:04: take the read as unmapped. So in this way, SAM tools will remove all the unmapped reads.
  • 21:05 - 21:10: After it is finished, the output will appear on the right in green color,
  • 21:12 - 21:19: and I already named them to something meaningful. By default, the output from SAM tools are BAM
  • 21:19 - 21:26: files. The BAM file is a binary file, not a text file, so you won’t be able to view the content.
  • 21:27 - 21:33: You can simply think of the BAM file as just a compressed SAM file. It significantly reduces
  • 21:33 - 21:43: the file size. Okay, now we are ready to perform peak calling. For this, we will be using a very
  • 21:43 - 21:50: popular peak caller called MACS, developed in Shirley Liu’s lab. On the left-hand side, under the
  • 21:50 - 21:59: NGS peak calling tab, click MACS to use it. In the middle, there are many options, but we will only
  • 21:59 - 22:07: change a few of them. First, give it a meaningful name, then choose the ChIP file, which is the
  • 22:07 - 22:15: FOCUS-A1 ChIP file, and the control file, which is the matching control experiment. Input control for
  • 22:15 - 22:26: MCF7, IgG control for LOVO cells. The default genome size is 2.7 billion base pairs, and this
  • 22:26 - 22:32: is for human cells. For different organisms, check the MACS website for the genome size,
  • 22:33 - 22:37: then change the effective size to your sequencing read length.
  • 22:39 - 22:47: Take the output files into interval files. We will see this later. Then, choose save wig
  • 22:47 - 22:54: files so that you will get the binding signal files later for visualization. Then last,
  • 22:54 - 22:59: choose do not build the shifting model, and 100 base pairs for arbitrary shift size.
  • 22:59 - 23:06: So, why change these two options? As you know, for a regular ChIP-seq experiment,
  • 23:06 - 23:12: the sequencing reads you get are from the ends of your fragment, which are not the actual TF
  • 23:12 - 23:17: binding sites. The actual TF binding sites are somewhere in the middle of the fragment.
  • 23:19 - 23:25: By default, MACS will estimate the fragment length and shifting the reads to the middle by half of
  • 23:26 - 23:35: the fragment length to represent the actual TF binding. However, for some reason, sometimes it
  • 23:35 - 23:42: cannot reliably estimate the fragment size. A common practice is to simply disable this
  • 23:42 - 23:50: and ask MACS to shift reads to a certain distance. In this case, we choose 100 base pairs,
  • 23:51 - 23:57: which is the default when you disable the model build, and it works quite well.
  • 23:59 - 24:06: Now, click execute. After it is finished, you will get some outputs on the right-hand side
  • 24:06 - 24:12: in green color, and the number of output files depends on your MACS setting.
  • 24:12 - 24:17: Well, in this case, we have six output files for each experiment.
  • 24:18 - 24:24: Now, let’s have a look at what each file is. The first file is the HTML report containing the
  • 24:24 - 24:32: running log of MACS. The second and third files are WIG files, containing the signal intensity
  • 24:32 - 24:40: across the genome. The treatment WIG file is your ChIP sample signal, and the control WIG file is
  • 24:41 - 24:50: your control sample signal. Again, the WIG files are simple text files. We won’t look at the
  • 24:50 - 24:58: negative peaks interval files here. The fifth file here is a peak interval file containing the
  • 24:58 - 25:07: enriched region of your protein, in this case, FOXA1. The last file is a bed file for visualizing
  • 25:07 - 25:14: peak positions in a genome browser, and both the interval and bed files
  • 25:14 - 25:20: are simple text files. If you click the name of the interval file,
  • 25:21 - 25:26: you can download this interval file to your local computer, and it can be opened
  • 25:26 - 25:32: and easily handled and manipulated by any spreadsheet software like Excel.
  • 25:33 - 25:38: This is probably one of the most important output files. It contains all the information
  • 25:38 - 25:44: you need to know about the binding site, including the location and the statistics.
  • 25:45 - 25:53: Well, at this stage, the peak calling is done. A common and tricky question that most beginners
  • 25:53 - 25:59: have to face at this stage is that, how can I tell whether my experiment works or not?
  • 25:59 - 26:04: Actually, there are many things you can check, but the following three things are the most
  • 26:04 - 26:11: efficient ways of telling whether the ChIP experiment works or not. First, by looking
  • 26:11 - 26:20: at the interval file. Second, by visual inspection of the binding signal. And third, by looking
  • 26:20 - 26:26: at the enriched motif within the binding site. We will now go through this one by one.
  • 26:29 - 26:37: We will start with examining the interval files. To do this, you sort first by FDR,
  • 26:37 - 26:46: smallest to largest. Then, by fold enrichment, largest to smallest. The first thing we should
  • 26:46 - 26:54: look at is the number of binding sites. If you choose an FDR cutoff at 1%, which is commonly
  • 26:54 - 27:06: used, there are about 42,000 peaks left in the FOXA1 MCF7 data, but only 430 peaks left
  • 27:07 - 27:14: in the FOXA1 LOVO data. Of course, the number of binding sites depends on the protein,
  • 27:14 - 27:20: the peak caller, the cell line, and many other things. But typically, the number you should
  • 27:20 - 27:28: expect is from around 1,000 to several thousand, or even several tens of thousands. As you can see
  • 27:28 - 27:37: here, the FOXA1 LOVO dataset apparently does not satisfy the expectation. The second thing
  • 27:37 - 27:44: we could look at is the range of the fold enrichment over a local background. The FOXA1
  • 27:44 - 27:53: MCF7 data have a wide range, from more than 300-fold to about 4-fold, while the largest
  • 27:53 - 28:00: FOXA1 LOVO peak has only around 20-fold enrichment, indicating the signal-to-noise
  • 28:00 - 28:08: ratio in this data is really low. The third thing we could look at is the number of sequencing reads
  • 28:08 - 28:17: within the peak region. Again, the FOXA1 MCF7 data have a much wider range than the FOXA1 LOVO
  • 28:17 - 28:25: data. So these three metrics we just went through clearly suggest that there is something wrong with
  • 28:25 - 28:33: FOXA1 LOVO datasets. And the next thing we could do to judge the quality of the data is to visualize
  • 28:33 - 28:39: the binding signal, and we found this seems to be the most efficient way of telling whether a
  • 28:39 - 28:48: ChIP-seq experiment works or not. So the WIG files generated by MACS are the signal files,
  • 28:51 - 28:57: and you can visualize them directly in the genome browser. However, for large datasets,
  • 28:57 - 29:03: it is highly recommended to convert the WIG files to bigWIG files before visualization.
  • 29:05 - 29:12: The bigWIG format is an indexed binary format. When you visualize the bigWIG files, only the
  • 29:12 - 29:18: portions of the files being displayed are transferred to the genome browser, which is much
  • 29:18 - 29:26: faster than loading the WIG files. So to convert WIG files to bigWIG, on the left-hand side,
  • 29:27 - 29:33: click the WIG to bigWIG under the Convert Format tab. Then in the middle,
  • 29:34 - 29:39: choose the WIG file you are going to convert, and click Execute.
  • 29:41 - 29:48: After it is done, change the file name to something meaningful, and click the name of the file.
  • 29:49 - 29:58: You will see an option to display at UCSC main. Then click that. A new tab or window of UCSC
  • 29:58 - 30:04: genome browser will appear. In the genome browser, to configure the visualization,
  • 30:05 - 30:10: set the display mode to full so that it will display the histogram of the signal.
  • 30:11 - 30:18: Set the vertical viewing range from 0 to a reasonable maximum value, like 100 or 200.
  • 30:19 - 30:25: And set data viewing scaling to use vertical viewing range setting, so that it won’t
  • 30:25 - 30:32: automatically rescale. You would be able to visualize the binding signal like shown here.
  • 30:34 - 30:40: Now, what kind of patterns are we expecting? There are a few things you can look at at this
  • 30:40 - 30:48: stage. You can have a look at whether the peak shape is normal, i.e., whether they are smooth
  • 30:48 - 30:55: bell-curve shaped peaks or some strange spikes. And if you have some non-targeted genes,
  • 30:55 - 31:01: can you find some binding peaks in the promoters or near the transcription start sites?
  • 31:02 - 31:05: And finally, to get a bigger view of the binding pattern,
  • 31:06 - 31:08: you can look at the whole chromosome view.
  • 31:11 - 31:17: This is an example of signals of all four samples across the entire chromosome 12 of the human
  • 31:17 - 31:27: genome. The two control files are almost flat as you would expect them to be. The FOXA1-MCF7
  • 31:27 - 31:34: ChIP sample has many peaks, but the FOXA1-LOVO ChIP sample looks very much like its matching
  • 31:34 - 31:42: control sample. There are only a few very small peaks. Now, this is a clear indication
  • 31:42 - 31:48: that the FOXA1-LOVO ChIP experiment failed, or at least it is not optimal.
  • 31:49 - 31:53: So, that’s how you can judge the quality of the data visually.
  • 31:54 - 31:59: The next step is to identify enriched motifs within the binding sites.
  • 31:59 - 32:07: We will use a program called MEME-chip to do this. In order to use MEME-chip, we need to extract the
  • 32:07 - 32:14: DNA sequence from the peak coordinates to a FASTA file. Now, I need to clarify two terms before we
  • 32:14 - 32:23: go any further. When MACS is used for peak calling, peak means the whole enriched region from start to
  • 32:24 - 32:33: end, indicated by the black bar on the top left, while summit, indicated by the thin orange line,
  • 32:33 - 32:41: means the highest parallax point within the peak region. It is supposed to be the exact TF binding
  • 32:41 - 32:49: site. It is not a good idea to put too many sequences for the motif discovery, and a common
  • 32:50 - 32:57: practice for TF motif discovery is to use the 100 base pair region centered on the summit as the
  • 32:57 - 33:06: input for motif discovery. So, the column E in the MACS interval file is the distance of the summit
  • 33:06 - 33:14: to the start position. So, to create the coordinates of peak summit plus minus 50 base pairs,
  • 33:15 - 33:21: just create four new columns after the FDR column of the interval file.
  • 33:22 - 33:30: The first new column is the same as column A, which is the chromosome name. The second column
  • 33:30 - 33:41: will be column B plus column E minus 50, and the third column will be column B plus column E plus
  • 33:41 - 33:51: 50. In the fourth column, give a unique name for each individual region like this, and you can do
  • 33:51 - 34:00: this very easily using Excel. Now, save these four new columns as a tab-delimited text file.
  • 34:00 - 34:09: Let’s call it FOXA1-Summit-100BP.txt. For simplicity, I will only choose the top 1,000
  • 34:09 - 34:16: regions for motif discovery in the webinar. Now, we are going to extract the 100 base pair DNA
  • 34:16 - 34:25: sequence of this region using the coordinates. To do this, first upload this text file to the
  • 34:25 - 34:34: Galaxy server using a web browser. On the left-hand side, click Upload Data under the ‘Get Data’ tab.
  • 34:34 - 34:42: Then, in the middle, choose the file format as ‘Interval’. Choose the text file you just saved on
  • 34:42 - 34:54: your computer, and make sure the genome assembly is correct. After uploading, click ‘Extract Genomic DNA’
  • 34:54 - 35:02: under the ‘Search Sequences’ tab. Choose the file you just uploaded, and simply click ‘Execute’.
  • 35:04 - 35:10: When it is finished, you will see the DNA sequence within each 100 base pair region
  • 35:10 - 35:16: in a FASTA format, and you can save this FASTA file to your local computer.
  • 35:17 - 35:27: Let’s call it FOXA1-Summit-100BP.FASTA. Now, go to the MEME-ChIP website using the address at the
  • 35:27 - 35:36: top of the slide. The options here are pretty much self-explanatory. In the Input, choose the FASTA
  • 35:37 - 35:44: file you just downloaded. Then, enter your email address and give a job name.
  • 35:45 - 35:51: The default setting is always a good starting point, so you just simply click Start Search.
  • 35:52 - 35:56: Then, you will receive an email with a link to retrieve the results.
  • 35:58 - 36:04: The results page will look like this. The motif found at the left-hand side
  • 36:04 - 36:08: is a de novo motif that you enriched within your input region.
  • 36:09 - 36:16: And the known or similar motif on the right-hand side indicates the similarity between this motif
  • 36:16 - 36:22: with known TF motifs. Now, shown on the screen is the FOXA1-MCF7 data.
  • 36:23 - 36:32: The top motif is a typical 4K motif with a very small E value. This is pretty much what you expect
  • 36:32 - 36:36: from a successful experiment. There are also some other motifs returned,
  • 36:37 - 36:44: indicating potential interaction with other transcription factors. And if the family motif
  • 36:44 - 36:48: of the transcription factor you are checking is not enriched in the binding region,
  • 36:49 - 36:56: like the FOXA1-LOVO dataset we are going to show in the next slide, then you need to put a
  • 36:56 - 37:03: question mark on the data unless you have other strong evidence suggesting that the experiments
  • 37:03 - 37:11: are working. Now, shown on the screen is a motif result from the FOXA1-LOVO dataset.
  • 37:13 - 37:18: As you can see here, although the top two motifs look like the 4K motif,
  • 37:19 - 37:26: the E values are very high. There are also two motifs with very low complexities,
  • 37:27 - 37:33: which are just simple repeats. This motif result further confirms that the FOXA1-LOVO
  • 37:33 - 37:40: ChIP experiment is not working. Now, we have finished the motif discovery step,
  • 37:41 - 37:44: and we have a better impression about the quality of the data.
  • 37:46 - 37:51: Another thing that most biologists are interested in is to assign the binding
  • 37:51 - 37:57: sites to genes and perform gene ontology analysis to find potential biological functions
  • 37:57 - 38:03: of transcription factors. To do this, we will be using a web tool called GRID,
  • 38:04 - 38:09: developed in Gil B. Gerano’s lab. It is very straightforward to use.
  • 38:10 - 38:14: Just go to the GRID website using the address shown on the top right.
  • 38:14 - 38:22: Then in the genome assembly, choose the reference genome. In this case, it’s HG19.
  • 38:23 - 38:26: Then choose the peak file from your computer as test region.
  • 38:28 - 38:33: Since this is a ChIP-seq experiment, choose the whole genome as background.
  • 38:35 - 38:41: If you click show settings, you can see how exactly GRID assigns the peaks to genes,
  • 38:42 - 38:45: and you can change it as you want. But according to the paper,
  • 38:45 - 38:50: the default setting seems to perform the best. So just click submit.
  • 38:52 - 39:00: The output looks like this. The top contains some basic information about peak gene association,
  • 39:01 - 39:06: and the bottom contains the enriched GO terms from specific categories,
  • 39:06 - 39:10: like molecular function, biological process, and the cellular component.
  • 39:11 - 39:18: As well as a lot of information from many other different databases. It is quite informative
  • 39:18 - 39:24: to look at those, but we won’t go through them here. One thing that is particularly useful is
  • 39:24 - 39:30: that if you click the job description on the top, this section will expand.
  • 39:32 - 39:37: And click view all genomic region gene association. You will see two tables.
  • 39:38 - 39:44: One gives you for each peak its associated genes, and the other gives you for each gene
  • 39:45 - 39:50: its associated peaks, together with the distance between the peak and the gene.
  • 39:51 - 39:58: And you can download them as a text file and save for future use. This is quite convenient.
  • 39:59 - 40:02: Now we have finished the gene ontology analysis.
  • 40:02 - 40:08: Last but not least, it is a good way of showing binding signal in HeatMap.
  • 40:09 - 40:17: To do this, we will need the alignment files, which are the BAM files we created after the
  • 40:17 - 40:24: mapping earlier. So click download dataset to save the BAM files to your local computer.
  • 40:24 - 40:28: We also need a program called SIGMiner, developed in Lazaro Tora’s lab. 40:33 - 40:41: Just download SIGMiner from the address at the top. Run it according to your operating system.
  • 40:43 - 40:45: Then you should see an interface like this.
  • 40:45 - 40:54: First, to load data into it, in the load reference section, choose the peak file.
  • 40:55 – 41:02: For example, the 100 base pair region text file we just created. Since this is a small file,
  • 41:03 - 41:07: it is immediately loaded, and you can see the information in the upper center.
  • 41:09 - 41:18: In the load aligned reads section, choose the BAM files just downloaded, and then click load files.
  • 41:18 - 41:25: These two files will be loaded into memory, and this seems to be the time-consuming part
  • 41:25 - 41:33: of all the analysis in SIGMiner. After it is done, the datasets will be shown in the middle.
  • 41:35 - 41:42: When you click extract data, SIGMiner will plot the reads from each alignment file you just loaded
  • 41:42 - 41:47: on top of your peak file and generate a heat map matrix.
  • 41:49 - 41:56: After it is done, the result will show up on the right. If you right-click the result,
  • 41:57 - 42:04: choose visualization of heat map. A heat map of the two files you loaded will be displayed.
  • 42:04 - 42:08: You will also have a density plot at the bottom right.
  • 42:10 - 42:15: The good thing about SIGMiner is that you can load multiple experiments at once,
  • 42:15 - 42:22: and easily perform k-means clustering to find different binding patterns, and extract different
  • 42:22 - 42:27: clusters for more detailed analysis. But that is beyond the scope of this webinar.
  • Time stamps go out of wack here
  • 42:32 - 42:38: So far, that’s all the basic ChIP-seq routines done without any programming and command line involved.
  • 42:39 - 42:42: Hopefully, by running the workflow in the webinar on your own data by yourself,
  • 42:42 - 42:48: you would quickly have a general idea about how ChIP-seq datasets look like,
  • 42:48 - 43:03: and what we expect from ChIP-seq datasets. Here, I put some references, mainly from Nature Protocol, containing detailed step-by-step guidance, including troubleshooting tips of the
  • 43:03 - 43:09: key aspects of ChIP-seq analysis. Some of them require some programming, or at least
  • 43:09 - 43:13: command line experience. Well, it is highly recommended for wildlife biologists who are
  • 43:13 - 43:21: doing genomic experiments to learn some basics about Linux usage, and perform the data analysis yourself
  • 43:22- 43:34: . It makes you have a better understanding of the data itself. And once you’ve done that, all the things we just went through in this webinar can be done in just several simple commands.
  • 43:34 - 43:43: Okay, this is the end of my presentation. Now, I will hand over to Miriam, who will introduce som
  • 43:43 - 43:41: related products and protocols from Abcam.
  • 43:46 - 43:49: And then, I will come back to answer questions. Thank you.
  • 43:51 - 43:55: Thank you, Shi, for such an interesting and comprehensive presentation. 43:56 - 44:01: And I would like to take this opportunity to talk to you about some up-to-date products and resources.
  • 44:02 - 44:08: We have just launched a high-sensitivity ChIP kit that has been specifically designed to be
  • 44:08 - 44:15: used when there is a limited amount of sample available to perform ChIP. For example, when
  • 44:15 - 44:22: working with patient material, transgenic myotissues, or stem cells. With the high-sensitivity
  • 44:22 - 44:29: ChIP kit, you can obtain enrichment of your sequence of interest by using as little as 2,000
  • 44:29 - 44:36: cells or half a milligram of tissue per reaction. You can get your result in only five hours,
  • 44:36 - 44:42: so the experiment can be done in one working day. And the eluted DNA can be used straightaway in
  • 44:42 - 44:51: sequencing, or microarray, or qPCR. Our high-sensitivity ChIP kit is part of our range of high-sensitivity
  • 44:51 - 44:57: products designed for when there is a limited amount of starting material.
  • 44:59 - 45:06: If you want to create a DNA library with a limited amount of DNA, our high-sensitivity DNA library
  • 45:06 - 45:13: preparation kit for Illumina sequencing allows you to create a DNA library from only 0.2 nanograms
  • 45:13 - 45:22: of DNA. And our ChIP-Seq high-sensitivity kit combines the benefits of the two kits I’ve just
  • 45:22 - 45:29: mentioned. This kit is designed to carry out a successful ChIP-Seq starting directly from as
  • 45:29 - 45:38: little as 10 to 5,000 cells. Bisulfite sequencing is the use of bisulfite treatment
  • 45:38 - 45:44: to determine the pattern of methylation. To help you to take your methylation pattern studies
  • 45:44 - 45:51: further, we now offer a couple of products to prepare post-bisulfite DNA libraries for Illumina.
  • 45:52 - 45:58: With our post-bisulfite DNA library preparation kit, you can prepare a library using
  • 45:59 - 46:05: pre-treated DNA in only five hours. If you haven’t treated your DNA,
  • 46:05 - 46:14: then we recommend our bisulfite-seq high-sensitivity kit, which has all the reagents to perform the
  • 46:14 - 46:21: bisulfite modification, followed immediately by a DNA library preparation step. The library is
  • 46:21 - 46:28: modified and ready in only six hours. With no delay, I will pass the microphone back to
  • 46:28 - 46:35: Shi, who is ready to answer some of your questions. Okay, thank you, Miriam.
  • 46:37 - 46:43: The first question is that, is paired-end or single-end read data better? Well,
  • 46:44 - 46:53: for most ChIP-seq data, single-end data is more prevalent. If you do
  • 46:54 - 46:59: paired-end sequencing for a ChIP experiment, of course, the cost will be much more than the
  • 47:00 - 47:06: single-end data. But if you look at the information you get from paired-end data,
  • 47:06 - 47:12: in terms of ChIP-seq, you don’t get a lot compared to the single-end data.
  • 47:12 - 47:18: So, in a cost-effective way, most people still do single-end data in terms of ChIP-seq.
  • 47:21 - 47:29: And the second question is that, is input or IgG a better control? Actually, they are both okay
  • 47:29 - 47:36: for ChIP-seq experiments as controls. Different people use different controls. For us, we prefer
  • 47:36 - 47:40: input as a control because it provides much more complex libraries.
  • 47:42 - 47:48: And the third question is that, is there a simple way to quantify changes in peak
  • 47:48 - 47:54: states between two ChIP-seq experiments using the same antibody, using chromatin
  • 47:54 - 47:59: from different cell types or conditions? Well, this is a very tricky and very good question.
  • 48:00 - 48:09: Now, there are, in the ChIP-seq communities, a lot of people have put in a lot of effort
  • 48:09 - 48:16: to developing the software to identify differentially bound peaks. But at the
  • 48:16 - 48:22: moment, there’s no consensus about which method to use. There are already a lot of methods out
  • 48:22 - 48:30: there. If you Google differentially bound ChIP-seq methods, then I’m sure you’ll get a lot.
  • 48:30 - 48:36: And you can have a look at the math behind the methods and choose by yourself.
  • 48:37 - 48:43: And another question is that, what are the differences between reads and peaks?
  • 48:44 - 48:53: So, the reads are the raw sequencing reads that you get from the sequencing machine containing
  • 48:53 - 48:59: the actual DNA sequences and the quality scores. Nowadays, they are usually between 50 base pairs
  • 48:59 - 49:04: to 100 base pairs long. After mapping, the mapped reads contain the genomic locations where
  • 49:04 - 49:11: the reads are aligned. So, the mapped reads will be the genomic coordinates of 50 base pairs to
  • 49:11 - 49:17: 100 base pairs long, depending on your sequencing length. While peaks are the genomic
  • 49:17 - 49:24: regions returned by the peak caller, they are the regions where the reads are aggregating or pile up
  • 49:24 - 49:27: and pass certain statistical thresholds of your peak caller.
  • 49:29 - 49:35: And then, this is the last question I will answer. So, the question is, if the number of
  • 49:35 - 49:43: binding sites falls below 1,000, should I consider my experiment has failed? The short
  • 49:43 - 49:52: answer to this question is that, no, you cannot tell whether your ChIP-seq experiments fail or not
  • 49:52 - 49:56: solely based on the number of binding sites, because the number of binding sites depends
  • 49:56 - 50:01: on many things, like the expression of the protein, the cell line, the peak caller,
  • 50:02 - 50:06: and also the cutoff threshold used in the peak calling.
  • 50:07 - 50:14: If you use the method in this webinar, the number we normally get from transcription factor ChIP-seq
  • 50:14 - 50:20: is around 1,000 to several thousand. Well, if the number of binding sites falls below 1,000,
  • 50:20 - 50:26: you need to put a question mark. However, you cannot simply judge whether the experiment
  • 50:26 - 50:30: works or not solely based on the number of binding sites. You also need to check other
  • 50:30 - 50:37: things, like motif results and visual inspection of your signal intensity.
  • 50:39 - 50:43: Okay, so this is the last question. Now, I will hand over to Vicky,
  • 50:43 - 50:45: and thank you all for attending this webinar.
  • 50:48 - 50:51: Okay. Thank you, Shi and Miriam, for your presentations today.
  • 50:52 - 50:57: We have received quite a lot of questions. Unfortunately, we were not able to answer
  • 50:57 - 51:03: all of them. For those whose questions were not answered, our scientific support team will contact
  • 51:03 - 51:08: you shortly with the response to your question. If you have any questions about what has been
  • 51:08 - 51:14: discussed in this webinar, or have any technical inquiries, our scientific support team will be
  • 51:14 - 51:20: very happy to help you, and they can be contacted at technical@abcam.com. 51:22 - 51:26: We hope you have found this webinar informative and useful to your work. 51:26 - 51:30: We look forward to welcoming you to another webinar in the future.
  • 51:30 - 51:33: Thank you again for attending, and good luck with your research.

You may be interested in...