The aim is to :
This training gives an introduction to ChIP-seq data analysis, covering the processing steps starting from the reads to the peaks. Among all possible downstream analyses, the practical aspect will focus on motif analyses. A particular emphasis will be put on deciding which downstream analyses to perform depending on the biological question. This training does not cover all methods available today. It does not aim at bringing users to a professional NGS analyst level but provides enough information to allow biologists understand what is DNA sequencing in practice and to communicate with NGS experts for more in-depth needs.
For this training, we will use two datasets:
Goal: Identify the datasets corresponding to the studied article and retrieve the data (reads as FASTQ files) corresponding to 2 replicates of a condition and the corresponding control.
NGS datasets are (usually) made freely accessible for other scientists, by depositing these datasets into specialized databanks. Sequence Read Archive (SRA) located in USA hosted by NCBI, and its European equivalent European Nucleotide Archive (ENA) located in England hosted by EBI both contains raw reads.
Functional genomic datasets (transcriptomics, genome-wide binding such as ChIP-seq,…) are deposited in the databases Gene Expression Omnibus (GEO) or its European equivalent ArrayExpress.
Within an article of interest, search for a sentence mentioning the deposition of the data in a database. Here, the following sentence can be found at the end of the Materials and Methods section: “All genome-wide data from this publication have been deposited in NCBI’s Gene Expression Omnibus (GSE41195).” We will thus use the GSE41195 identifier to retrieve the dataset from the NCBI GEO (Gene Expression Omnibus) database.
SRA stores sequences in a FASTQ format.
tip: To download the replicate and control datasets, we should redo the same steps starting from the GEO web page specific to the chip-seq datasets (see step 2.4) and choose FNR IP ChIP-seq Anaerobic B and anaerobic INPUT DNA. Downloaded FASTQ files are available in the data folder (SRR576934.fastq.gz and SRR576938.fastq.gz respectively)
At this point, you have three FASTQ files, two IPs, one control (INPUT).
During this training, we will work on the cluster provided by the Institut Français de Bioinformatique (IFB) using JupyterLab through the ondemand system.
Go to ondemand
Select JupyterLab: Core
Fill the form as such:
cd /shared/projects/<your_project>
mkdir EBAII2024_chipseq
cd EBAII2024_chipseq
cp -r /shared/projects/2422_ebaii_n1/chipseq/EBAII2024_chipseq/data .
/shared/projects/<your_project>/EBAII2024_chipseq
│
└───data
If you wish, you can check your directory structure:
tree
Goal: Get some basic information on the data (read length, number of reads, global quality of datasets)
Before you analyze the data, it is crucial to check the quality of the data. We will use the standard tool for checking the quality of data generated on the Illumina platform: FASTQC.
mkdir 01-QualityControl
cd 01-QualityControl
Your directory structure should be like this
/shared/projects/<your_project>/EBAII2024_chipseq
│
└───data
│
└───01-QualityControl <- you should be in this folder
module load fastqc/0.11.9
fastqc --help
fastqc ../data/FNR_IP_ChIP-seq_Anaerobic_A.fastq.gz -o .
ls
FNR_IP_ChIP-seq_Anaerobic_A_fastqc.html FNR_IP_ChIP-seq_Anaerobic_A_fastqc.zip
Go to the directory
/shared/projects/
Launch the FASTQC program on the replicate (FNR_IP_ChIP-seq_Anaerobic_B.fastq.gz) and on the control file (Anaerobic_INPUT_DNA.fastq.gz)
Analyze the result of the FASTQC program:
module unload fastqc/0.11.9
Goal: Obtain the coordinates of each read to the reference genome.
There are multiple programs to perform the mapping step. For reads produced by an Illumina machine for ChIP-seq, the currently “standard” programs is Bowtie (versions 1 and 2)(Langmead et al. 2009) (Langmead and Salzberg 2012). We will use Bowtie version 2.5.1 for this exercise.
module load bowtie2/2.5.1
bowtie2
This prints the help of the program. However, this is a bit difficult to read ! If you need to know more about the program, it’s easier to directly check out the manual on the website.
Bowtie needs the reference genome to align each read on it. The genome needs to be in a specific format (=index) for bowtie to be able to use it. Several pre-built indexes are available for download on bowtie webpage, but our genome is not there. You will need to make this index file.
Create a directory named 02-Mapping in which to output mapping results
cd ..
mkdir 02-Mapping
cd 02-Mapping
To make the index file, you will need the complete genome, in FASTA format. It has already been downloaded to gain time (Escherichia_coli_K12.fasta in the course folder) (The genome was downloaded from the NCBI).
Create a directory named index in which to output bowtie indexes
mkdir index
cd index
bowtie2-build
## Creating genome index : provide the path to the genome file and the name to give to the index (Escherichia_coli_K12)
bowtie2-build ../../data/Escherichia_coli_K12.fasta Escherichia_coli_K12
cd ..
mkdir bam
cd bam
Your directory structure should be like this:
/shared/projects/<your_project>/EBAII2024_chipseq
│
└───data
│
└───01-QualityControl
│
└───02-Mapping
| └───index
| └───bam <- you should be here
## Run alignment
## Tip: first type bowtie command line then add quotes around and prefix it with "sbatch --cpus 10 --wrap="
sbatch -p fast -o FNR_IP_ChIP-seq_Anaerobic_A.mapping.out --cpus-per-task 10 --wrap="bowtie2 -p 10 --mm -3 1 -x ../index/Escherichia_coli_K12 -U ../../data/FNR_IP_ChIP-seq_Anaerobic_A.fastq.gz -S FNR_IP_ChIP-seq_Anaerobic_A.sam"
This should take few minutes as we work with a small genome. For the human genome, we would need either more time and more resources.
Analyze the result of the mapped reads:
Open the file FNR_IP_ChIP-seq_Anaerobic_A.mapping.out (for example using
the less
command), which contains some statistics about the
mapping. How many reads were mapped? How many multi-mapped reads were
originally present in the sample? To quit less press ‘q’
Bowtie output is a SAM file. The
SAM format corresponds to large text files, that can be compressed
(“zipped”) into a BAM format. The BAM files takes up to 4 time less disk
space and are usually sorted and indexed for fast access to the data it
contains. The index of a given
## First load samtools
module load samtools/1.18
## Then run samtools
samtools view -@ 2 -q 10 -b FNR_IP_ChIP-seq_Anaerobic_A.sam | samtools sort -@ 2 - -o FNR_IP_ChIP-seq_Anaerobic_A.bam
samtools index FNR_IP_ChIP-seq_Anaerobic_A.bam
gzip FNR_IP_ChIP-seq_Anaerobic_A.sam
module unload samtools/1.18 bowtie2/2.5.1
Analyze the result of the mapped reads:
How many reads were mapped for samples Anaerobic_INPUT_DNA and
FNR_IP_ChIP-seq_Anaerobic_B?
Goal: Duplicated reads i.e reads mapped at the same positions in the genome are present in ChIP-seq results. They can arise from several reasons including a biased amplification during the PCR step of the library prep, DNA fragments coming from repetitive elements of the genome, sequencing saturation or the same clusters read several times on the flowcell (i.e optical duplicates). As analyzing ChIP-Seq data consist in detecting signal enrichment, we can not keep duplicated reads for subsequent analysis. So let’s detect them using Picard (“Picard Tools - By Broad Institute” n.d.).
cd /shared/projects/<your_project>/EBAII2024_chipseq/02-Mapping/bam
## Load picard
module load picard/2.23.5
## Run picard
picard MarkDuplicates \
-CREATE_INDEX true \
-INPUT FNR_IP_ChIP-seq_Anaerobic_A.bam \
-OUTPUT Marked_FNR_IP_ChIP-seq_Anaerobic_A.bam \
-METRICS_FILE metric
To determine the number of duplicated reads marked by Picard, we can
run the samtools flagstat
command:
## Add samtools to your environment
module load samtools/1.18
## run samtools
samtools flagstat Marked_FNR_IP_ChIP-seq_Anaerobic_A.bam
Run picard MarkDuplicates on the 2 other samples. How many duplicates are found in each sample?
Go back to working home directory (i.e
/shared/projects/
## Unload picard and samtools
module unload samtools/1.18 picard/2.23.5
## If you are in 02-Mapping/bam
cd ../..
Goal: This exercise aims at plotting the Lorenz curve to assess the quality of the chIP.
mkdir 03-ChIPQualityControls
cd 03-ChIPQualityControls
## Load deeptools in your environment
module load deeptools/3.5.4
## Run deeptools fingerprint
plotFingerprint \
-p 2 \
--numberOfSamples 10000 \
-b ../02-Mapping/bam/FNR_IP_ChIP-seq_Anaerobic_A.bam \
../02-Mapping/bam/FNR_IP_ChIP-seq_Anaerobic_B.bam \
../02-Mapping/bam/Anaerobic_INPUT_DNA.bam \
-plot fingerprint_10000.png
cp /shared/home/slegras/2421_m22_bims/slegras/03-ChIPQualityControls/fingerprint.png .
Look at the result files fingerprint.png (add the plot to this report). Give an explanation of the curves?
Go back to the working home directory (i.e /shared/projects/2421_m22_bims/<login>)
## Unload deepTools
module unload deeptools/3.5.4
## If you are in 03-ChIPQualityControls
cd ..
Goal: Check whether the IP worked: visualize the data in their genomic context.
There are several options for genome browsers, divided between the local browsers (which you need to install on your computer, eg. IGV) and the online genome browsers (eg. UCSC genome browser, Ensembl). We often use both types, depending on the aim and the localization of the data. If the data are on your computer, to prevent data transfer, it’s easier to visualize the data locally (IGV). Note that if you’re working on a non-model organism, the local viewer will be the only choice. If the aim is to share the results with your collaborators, view many tracks in the context of many existing annotations, then the online genome browsers are more suitable.
Browse around in the genome. Specifically go to the following genes: pepT (geneID:b1127), ycfP (geneID:b1108). Do you see peaks (add screenshots to this report).
However, looking at BAM file as such does not allow to directly compare the two samples as data are not normalized. Let’s generate normalized data for visualization.
bamCoverage from deepTools generates BigWigs out of BAM files 1. Try it out
## Load deeptools in your environment
module load deeptools/3.5.4
## run bamCoverage
bamCoverage --help
mkdir 04-Visualization
cd 04-Visualization
Your directory structure should be like this:
/shared/projects/<your_project>/EBAII2024_chipseq
│
└───data
│
└───01-QualityControl
│
└───02-Mapping
| └───index
| └───bam
│
└───03-ChIPQualityControls
│
└───04-Visualization <- you should be in this folder
bamCoverage \
--bam ../02-Mapping/bam/Marked_FNR_IP_ChIP-seq_Anaerobic_A.bam \
--outFileName FNR_IP_ChIP-seq_Anaerobic_A_nodup.bw \
--outFileFormat bigwig \
--effectiveGenomeSize 4639675 \
--normalizeUsing CPM \
--skipNonCoveredRegions \
--extendReads 200 \
--ignoreDuplicates
Go back to the genes we looked at earlier: pepT, ycfP (add
screenshots to this report). Look at the shape of the
signal.
Keep IGV opened.
Go back to working home directory (i.e
/shared/projects/
## If you are in 04-Visualization
cd ..
Goal: Detect the peaks which are regions with high densities of reads and that correspond to where the studied factor was bound
There are multiple programs to perform the peak-calling step. Some are more directed towards histone marks (broad peaks) while others are specific to transcription factors which present narrow peaks. Here we will use the callpeak function of MACS2 (version 2.2.7.1) because it’s known to produce generally good results, and it is well-maintained by the developer.
mkdir 05-PeakCalling
mkdir 05-PeakCalling/replicates
cd 05-PeakCalling/replicates
## Load macs2 in your environment
module load macs2/2.2.7.1
macs2 callpeak --help
This prints the help of the program.
macs2 callpeak \
-t ../../02-Mapping/bam/FNR_IP_ChIP-seq_Anaerobic_A.bam \
-c ../../02-Mapping/bam/Anaerobic_INPUT_DNA.bam \
--format BAM \
--gsize 4639675 \
--name 'FNR_Anaerobic_A' \
--bw 400 \
--fix-bimodal \
-p 1e-2 \
&> repA_MACS.out
Run macs2 for replicate A and replicate B.
In a new directory called pool, run macs2 for the pooled replicates A and B by giving both bam files as input treatment files (-t).
# You should be in 05-PeakCalling
cd ..
mkdir pool
cd pool
# Run macs2 for pooled replicates
macs2 callpeak \
-t ../../02-Mapping/bam/FNR_IP_ChIP-seq_Anaerobic_A.bam \
../../02-Mapping/bam/FNR_IP_ChIP-seq_Anaerobic_B.bam \
-c ../../02-Mapping/bam/Anaerobic_INPUT_DNA.bam \
--format BAM \
--gsize 4639675 \
--name 'FNR_Anaerobic_pool' \
--bw 400 \
--fix-bimodal \
-p 1e-2 \
&> pool_MACS.out
Look at the files that were created by MACS. Explain the
content of the result files ?
How many peaks were detected by MACS2 for each sample and in the
pool of samples ?
In order to take advantage of having biological replicates, we will create a combine set of peaks based on the reproducibility of each individual replicate peak calling. We will use the Irreproducible Discovery Rate (IDR) algorithm.
## You should be 05-PeakCalling
cd ..
mkdir idr
cd idr
Your directory structure should be like this:
/shared/projects/<your_project>/EBAII2024_chipseq
│
└───data
│
└───01-QualityControl
│
└───02-Mapping
| └───index
| └───bam
│
└───03-ChIPQualityControls
│
└───04-Visualization
|
└───05-PeakCalling
| └───replicates
| └───pool
| └───idr <- you should be in this folder
## Load idr in your environment
module load idr/2.0.4.2
idr --help
idr \
--samples ../replicates/FNR_Anaerobic_A_peaks.narrowPeak \
../replicates/FNR_Anaerobic_B_peaks.narrowPeak \
--peak-list ../pool/FNR_Anaerobic_pool_peaks.narrowPeak \
--input-file-type narrowPeak \
--output-file FNR_anaerobic_idr_peaks.bed \
--plot
Add the IDR graph to this report. How many peaks are found with the IDR method?
module unload macs2/2.2.7.1
module unload idr/2.0.4.2
## If you are in 05-PeakCalling/idr
cd ../..
Go back again to the genes we looked at earlier: pepT, ycfP. Do you see peaks (add the 2 screenshots to this report)? Navigate throught the genome to find peaks detected in the replicates (peak calling per replicate) and not found/kept with the IDR method
From now on, peak set we keep is the IDR peak set.
Goal: Define binding motif(s) for the ChIPed transcription factor and identify potential cofactors
For the motif analysis, you first need to extract the sequences corresponding to the peaks. There are several ways to do this (as usual…). If you work on a UCSC-supported organism, the easiest is to use RSAT fetch-sequences or Galaxy. Here, we will use Bedtools (Quinlan and Hall 2010), as we have the genome of interest on our computer (Escherichia_coli_K12.fasta). 1. Create a directory named 06-MotifAnalysis to store data needed for motif analysis
mkdir 06-MotifAnalysis
cd 06-MotifAnalysis
Your directory structure should be like this:
/shared/projects/<your_project>/EBAII2024_chipseq
│
└───data
│
└───01-QualityControl
│
└───02-Mapping
| └───index
| └───bam
│
└───03-ChIPQualityControls
│
└───04-Visualization
│
└───05-PeakCalling
│
└───06-MotifAnalysis <- you should be in this folder
## First load samtools
module load samtools/1.18
## Create an index of the genome fasta file
samtools faidx ../data/Escherichia_coli_K12.fasta
## First load bedtools
module load bedtools/2.30.0
## Extract fasta sequence from genomic coordinate of peaks
bedtools getfasta \
-fi ../data/Escherichia_coli_K12.fasta \
-bed ../05-PeakCalling/idr/FNR_anaerobic_idr_peaks.bed \
-fo FNR_anaerobic_idr_peaks.fa
Is there anything interesting in RSAT results? If so, which motif is of interest and why (add screenshot of the results).
Goals: Associate ChIP-seq peaks to genomic features, identify closest genes and run ontology analyses
# aller dans le répertoire si besoin
cd ..
mkdir 07-PeakAnnotation
cd 07-PeakAnnotation
annotatePeaks.pl from the Homer suite (Heinz et al. 2010) associates peaks with nearby genes.
cut \
-f 1-5 \
../05-PeakCalling/idr/FNR_anaerobic_idr_peaks.bed | \
awk -F "\t" '{print $0"\t+"}' \
> FNR_anaerobic_idr_peaks.bed
## First load bedtools
module load homer/4.11
## run Homer annotatePeaks
annotatePeaks.pl --help
Let’s see the parameters:
annotatePeaks.pl peak/BEDfile genome > outputfile User defined annotation files (default is UCSC refGene annotation): annotatePeaks.pl accepts GTF (gene transfer formatted) files to annotate positions relative to custom annotations, such as those from de novo transcript discovery or Gencode.
-gtf <gtf format file> (Use -gff and -gff3 if appropriate, but GTF is better)
annotatePeaks.pl \
FNR_anaerobic_idr_peaks.bed \
../data/Escherichia_coli_K12.fasta \
-gtf ../data/Escherichia_coli_K_12_MG1655.annotation.fixed.gtf \
> FNR_anaerobic_idr_annotated_peaks.tsv
Look at the file you generated. Gene symbols are not present. Let’s add them with some R code.
Launch Rstudio in ondemand
Add gene symbol annotation using R with Rstudio
## set working directory
setwd("/shared/projects/<your_project>/EBAII2024_chipseq/07-PeakAnnotation")
## Or navigate using the "Files" tab and click on "More">"Set as Working Directory"
## read the file with peaks annotated with homer
## data are loaded into a data frame
## sep="\t": this is a tab separated file
## header=TRUE: there is a line with headers (ie. column names)
d <- read.table("FNR_anaerobic_idr_annotated_peaks.tsv", sep="\t", header=TRUE)
## Load a 2-columns files which contains in the first column gene IDs
## and in the second column gene symbols
## data are loaded into a data frame
## header=FALSE: there is no header line
gene.symbol <- read.table("../data/Escherichia_coli_K_12_MG1655.annotation.tsv.gz", header=FALSE)
## Merge the 2 data frames based on a common field
## by.x gives the columns name in which the common field is for the d data frame
## by.y gives the columns name in which the common field is for the gene.symbol data frame
## d contains several columns with no information. We select only interesting columns
d.annot <- merge(d[,c(1,2,3,4,5,6,8,10,11)], gene.symbol, by.x="Nearest.PromoterID", by.y="V1")
## Change column names of the resulting data frame
colnames(d.annot)[2] <- "PeakID" # name the 2d column of the new file "PeakID"
colnames(d.annot)[dim(d.annot)[2]] <- "Gene.Symbol"
## output the merged data frame to a file named "FNR_anaerobic_idr_final_peaks_annotation.tsv"
## col.names=TRUE: output column names
## row.names=FALSE: don't output row names
## sep="\t": table fields are separated by tabs
## quote=FALSE: don't put quote around text.
write.table(d.annot, "FNR_anaerobic_idr_final_peaks_annotation.tsv", col.names=TRUE, row.names=FALSE, sep="\t", quote=FALSE)
What information is listed in each column of the file? (print column names and explain them)
How many genes are associated to the “promoter-TSS” feature?
What are all the possible gene features? (see in column Annotation - extract information like promoter-TSS, TSS, …). Create a plot (pie chart, barplot…) showing the proportion of each of them (include both the plot and the code that created it in the report).
## If you are in 07-PeakAnnotation
cd ..
Use Official gene symbols of the file FNR_anaerobic_idr_final_peaks_annotation.tsv to search for enriched gene ontologies with the tool DAVID (Database for Annotation, Visualization and Integrated Discovery). Input your gene list on the DAVID website: https://david.ncifcrf.gov/. Use DAVID convert ID tool if needed
Are there biological processes enriched in the list of genes associated to the peaks? Show the top results of the Functional Annotation Clustering. Are these genes enriched in some KEGG pathway? Which ones?
In this part, we will use a different set of peaks obtained using a peak caller from a set of p300 ChIP-seq experiments in different mouse embryonic tissues (midbrain, forebrain and limb).
cd /shared/projects/<your_project>/EBAII2024_chipseq
mkdir 07-PeakAnnotation-bonus
cd 07-PeakAnnotation-bonus
GSMxxxxx_p300_peaks.txt.gz
file to the newly created folder
(where xxxxx
represents the GSM number) You should now have
downloaded 3 files: > GSM348064_p300_peaks.txt.gz (Forebrain) >
GSM348065_p300_peaks.txt.gz (Midbrain) > GSM348066_p300_peaks.txt.gz
(limb)Beware: Make sure to check which genome version was used to call the peaks (remember: this is mouse data!)
Now, we will use RStudio to perform the rest of the analysis in R. For the analysis, we will need some R/Bioconductor libraries
# load the required libraries
library(RColorBrewer)
library(ChIPseeker)
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
library(org.Mm.eg.db)
# define the annotation of the mouse genome
txdb = TxDb.Mmusculus.UCSC.mm9.knownGene
# define colors
col = brewer.pal(9,'Set1')
# set the working directory to the folder in which the peaks are stored
setwd("/shared/projects/<your_project>/EBAII2024_chipseq/07-PeakAnnotation-bonus")
# read the peaks for each dataset
peaks.forebrain = readPeakFile('GSM348064_p300_peaks.txt.gz')
peaks.midbrain = readPeakFile('GSM348065_p300_peaks.txt.gz')
peaks.limb = readPeakFile('GSM348066_p300_peaks.txt.gz')
# create a list containing all the peak sets
all.peaks = list(forebrain=peaks.forebrain,
midbrain=peaks.midbrain,
limb=peaks.limb)
The peaks are stored as GenomicRanges object; this is an R format which look like the bed format, but is optimized in terms of memory requirements and speed of execution.
We can start by computing some basic statistics on the peak sets.
# check the number of peaks for the forebrain dataset
length(peaks.forebrain)
## [1] 2453
# compute the number of peaks for all datasets using the list object
sapply(all.peaks,length)
## forebrain midbrain limb
## 2453 561 2105
# display this as a barplot
barplot(sapply(all.peaks,length),col=col)
# statistics on the peak length for forebrain
summary(width(peaks.forebrain))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 276.0 551.0 751.0 815.9 1001.0 2701.0
# size distribution of the peaks
peaks.width = lapply(all.peaks,width)
lapply(peaks.width,summary)
## $forebrain
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 276.0 551.0 751.0 815.9 1001.0 2701.0
##
## $midbrain
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 276 526 676 717 876 2126
##
## $limb
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 276.0 476.0 601.0 682.6 826.0 2301.0
# boxplot of the sizes
boxplot(peaks.width,col=col)
Can you adapt the previous code to display a boxplot of the peak
score distribution for the Forebrain peak set (column
Maximum.Peak.Height
)?
We can now display the genomic distribution of the peaks along the
chromosomes, including the peak scores, using the covplot
function from ChIPSeeker
:
# genome wide distribution
covplot(peaks.forebrain, weightCol="Maximum.Peak.Height")
Exercice: use the option “lower” in covplot to display only the peaks with a score (Max.Peak.Height) above 10
In addition to the genome wide plot, we can check if there is a tendency for the peaks to be located close to gene promoters.
# define gene promoters
promoter = getPromoters(TxDb=txdb, upstream=5000, downstream=5000)
# compute the density of peaks within the promoter regions
tagMatrix = getTagMatrix(peaks.limb, windows=promoter)
## >> preparing start_site regions by gene... 2024-11-19 15:26:10
## >> preparing tag matrix... 2024-11-19 15:26:10
# plot the density
tagHeatmap(tagMatrix, palette = "RdYlBu")
We can now assign the peaks to the closest genes and genomic
compartments (introns, exons, promoters, distal regions, etc…) This is
done using the function annotatePeak
which compares the
peak files with the annotation file of the mouse genome. This function
returns a complex object which contains all this information.
peakAnno.forebrain = annotatePeak(peaks.forebrain, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Mm.eg.db")
## >> preparing features information... 2024-11-19 15:26:31
## >> identifying nearest features... 2024-11-19 15:26:31
## >> calculating distance from peak to TSS... 2024-11-19 15:26:31
## >> assigning genomic annotation... 2024-11-19 15:26:31
## >> adding gene annotation... 2024-11-19 15:26:35
## 'select()' returned 1:many mapping between keys and columns
## >> assigning chromosome lengths 2024-11-19 15:26:35
## >> done... 2024-11-19 15:26:35
peakAnno.midbrain = annotatePeak(peaks.midbrain, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Mm.eg.db")
## >> preparing features information... 2024-11-19 15:26:35
## >> identifying nearest features... 2024-11-19 15:26:35
## >> calculating distance from peak to TSS... 2024-11-19 15:26:35
## >> assigning genomic annotation... 2024-11-19 15:26:35
## >> adding gene annotation... 2024-11-19 15:26:36
## 'select()' returned 1:1 mapping between keys and columns
## >> assigning chromosome lengths 2024-11-19 15:26:36
## >> done... 2024-11-19 15:26:36
peakAnno.limb = annotatePeak(peaks.limb, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Mm.eg.db")
## >> preparing features information... 2024-11-19 15:26:36
## >> identifying nearest features... 2024-11-19 15:26:36
## >> calculating distance from peak to TSS... 2024-11-19 15:26:36
## >> assigning genomic annotation... 2024-11-19 15:26:36
## >> adding gene annotation... 2024-11-19 15:26:37
## 'select()' returned 1:many mapping between keys and columns
## >> assigning chromosome lengths 2024-11-19 15:26:37
## >> done... 2024-11-19 15:26:37
We can now analyze more in details the localization of the peaks (introns, exons, promoters, distal regions,…)
# distribution of genomic compartments for forebrain peaks
plotAnnoPie(peakAnno.forebrain)
# for all the peaks
plotAnnoBar(list(forebrain=peakAnno.forebrain, midbrain=peakAnno.midbrain,limb=peakAnno.limb))
Question: do you see differences between the three peak sets?
An important step in ChIP-seq analysis is to interpret genes that are located close to the ChIP peaks. Hence, we need to 1. assign genes to peaks 2. compute functional enrichments of the target genes.
Beware: By doing so, we assume that the target gene of the peak is always the closest one. Hi-C/4C analysis have shown that in higher eukaryotes, this is not always the case. However, in the absence of data on the real target gene of ChIP-peaks, we can work with this approximation.
We will compute the enrichment of the Gene Ontology “Biological Process” categories in the set of putative target genes.
# load the library
library(clusterProfiler)
# define the list of all mouse genes as a universe for the enrichment analysis
universe = mappedkeys(org.Mm.egACCNUM)
## extract the gene IDs of the forebrain target genes
genes.forebrain = peakAnno.forebrain@anno$geneId
ego.forebrain = enrichGO(gene = genes.forebrain,
universe = universe,
OrgDb = org.Mm.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
# display the results as barplots
barplot(ego.forebrain,showCategory=10)
Question: do you see an enrichment of the expected categories? What does the x-axis mean? What does the color mean?
Exercise: redo this analysis for the limb dataset and check if the enriched categories make sense.
Goal: Identify the datasets corresponding to the studied article and retrieve the data (reads as FASTQ files) corresponding to 2 replicates of a condition and the corresponding control.
NGS datasets are (usually) made freely accessible for other scientists, by depositing these datasets into specialized databanks. Sequence Read Archive (SRA) located in USA hosted by NCBI, and its European equivalent European Nucleotide Archive (ENA) located in England hosted by EBI both contains raw reads.
Functional genomic datasets (transcriptomics, genome-wide binding such as ChIP-seq,…) are deposited in the databases Gene Expression Omnibus (GEO) or its European equivalent ArrayExpress.
Within an article of interest, search for a sentence mentioning the deposition of the data in a database. Here, the following sentence can be found at the end of the Materials and Methods section: “All genome-wide data from this publication have been deposited in NCBI’s Gene Expression Omnibus (GSE41195).” We will thus use the GSE41195 identifier to retrieve the dataset from the NCBI GEO (Gene Expression Omnibus) database.
Although direct access to the SRA database at the NCBI is doable, SRA does not store sequences in a FASTQ format. So, in practice, it’s simpler (and quicker!!) to download datasets from the ENA database (European Nucleotide Archive) hosted by EBI (European Bioinformatics Institute) in UK. ENA encompasses the data from SRA.
tip: To download the replicate and control datasets, we should redo the same steps starting from the GEO web page specific to the chip-seq datasets (see step 2.4) and choose FNR IP ChIP-seq Anaerobic B and anaerobic INPUT DNA. Downloaded FASTQ files are available in the data folder (FNR_IP_ChIP-seq_Anaerobic_B.fastq.gz and Anaerobic_INPUT_DNA.fastq.gz respectively)
At this point, you have three FASTQ files, two IPs, one control (INPUT).
The processed peaks (BED file) is sometimes available on the GEO website, or in supplementary data. Unfortunately, most of the time, the peak coordinates are embedded into supplementary tables and thus not usable “as is”. This is the case for the studied article. To be able to use these peaks (visualize them in a genome browser, compare them with the peaks found with another program, perform downstream analyses…), you will need to (re)-create a BED file from the information available. Here, Table S5 provides the coordinates of the summit of the peaks. The coordinates are for the same assembly as we used.
perl -lane 'print "gi|49175990|ref|NC_000913.2|\t".($F[0]-50)."\t".($F[0]+50)."\t" ' retained_peaks.txt > retained_peaks.bed
Annotation files can be found on genome websites, NCBI FTP server, Ensembl, … However, IGV required GFF format, or BED format, which are often not directly available. Here, I downloaded the annotation from the UCSC Table browser as “Escherichia_coli_K_12_MG1655.annotation.gtf”. Then, I changed the “chr” to the name of our genome with the following PERL command:
perl -pe 's/^chr/gi\|49175990\|ref\|NC_000913.2\|/' Escherichia_coli_K_12_MG1655.annotation.gtf > Escherichia_coli_K_12_MG1655.annotation.fixed.gtf
This file will work directly in IGV