1 - Introduction

1.1 - Goal

The aim is to :

  • understand the nature of ChIP-Seq data
  • perform a complete analysis workflow including quality check (QC), read mapping, visualization in a genome browser and peak-calling. Use command line and open source software for each step of the workflow and feel the complexity of the task
  • have an overview of some possible downstream analyses
  • perform a motif analysis with online web programs

1.2 - Summary

This training gives an introduction to ChIP-seq data analysis, covering the processing steps starting from the reads to the peaks. Among all possible downstream analyses, the practical aspect will focus on motif analyses. A particular emphasis will be put on deciding which downstream analyses to perform depending on the biological question. This training does not cover all methods available today. It does not aim at bringing users to a professional NGS analyst level but provides enough information to allow biologists understand what is DNA sequencing in practice and to communicate with NGS experts for more in-depth needs.

1.3 - Dataset description

For this training, we will use two datasets:

  • a dataset produced by Myers et al Pubmed involved in the regulation of gene expression under anaerobic conditions in bacteria. We will focus on one factor: FNR. The advantage of this dataset is its small size, allowing real time execution of all steps of the dataset
  • a dataset of ChIP-seq peaks obtained in different mouse tissues for the p300 co-activator protein by Visel et al. Pubmed; we will use this dataset to illustrate downstream annotation of peaks using R.

2 - Downloading ChIP-seq reads from NCBI

Goal: Identify the datasets corresponding to the studied article and retrieve the data (reads as FASTQ files) corresponding to 2 replicates of a condition and the corresponding control.

2.1 - Obtaining an identifier for a chosen dataset

NGS datasets are (usually) made freely accessible for other scientists, by depositing these datasets into specialized databanks. Sequence Read Archive (SRA) located in USA hosted by NCBI, and its European equivalent European Nucleotide Archive (ENA) located in England hosted by EBI both contains raw reads.

Functional genomic datasets (transcriptomics, genome-wide binding such as ChIP-seq,…) are deposited in the databases Gene Expression Omnibus (GEO) or its European equivalent ArrayExpress.

Within an article of interest, search for a sentence mentioning the deposition of the data in a database. Here, the following sentence can be found at the end of the Materials and Methods section: “All genome-wide data from this publication have been deposited in NCBI’s Gene Expression Omnibus (GSE41195).” We will thus use the GSE41195 identifier to retrieve the dataset from the NCBI GEO (Gene Expression Omnibus) database.

2.2 - Accessing GSE41195 from GEO

  1. The GEO database hosts processed data files and many details related to the experiments. SRA (Sequence Read Archive) stores the actual raw sequence data.
  2. Search in Google GSE41195. Click on the first link to directly access the correct page on the GEO database. alt text
  3. This GEO entry is a mixture of expression analysis (Nimblegen Gene Expression Array), chip-chip and chip-seq. At the bottom of the page, click on the subseries related to the chip-seq datasets. (this subseries has its own identifier: GSE41187).