Gene set enrichment analysis
Forewords
R code for advanced users
In this presentation, there will be screen captures, for you to follow the lesson. There will also be R command lines (hidden by default). Do not take care of the command lines if you find them too challenging. Our goal here, is to understand the main mechanism of Differential Expression Analysis. R is just a tool.
In fact, the whole TP can be done in two lines of code:
<- readRDS(file = "dataset/Wilcox_DE.rds")
wilcox_de
# Enricher analysis
<- singleCellTK::runEnrichR(
enrich inSCE=wilcox_de,
features=c("Jund", "Mindy1"),
analysisName="Cond_G3_vs_G4"
)
<- singleCellTK::getEnrichRResult(
enrich inSCE=enrich,
analysisName="Cond_G3_vs_G4"
)write.table(enrich$result, "entich_table.tsv")
# Pathway analysis
<- singleCellTK::importGeneSetsFromMSigDB(
imported inSCE=wilcox_de,
categoryIDs="C5-BP"
)<- singleCellTK::runVAM(
vam_results inSCE=imported,
geneSetCollectionName="C5-BP",
useAssay="SLNst",
resultNamePrefix="VAM_GO_BP",
center=TRUE
)
Our goal is to understand these lines, not to be able to write them.
Purpose of this session
Up to now, we have:
- Identified to which cell each sequenced reads come from
- Identified to which gene each read come from
- Identified possible bias in gene expression for each cell
- Annotated cell clusters
- Identifier differnetially expressed genes among cluster of interest
We would like to identify the list of genes that caracterize differences between G3 and G4 groups.
At the end of this session you will know:
- What is gene set analysis
- How to choose a Gene Set database
- How to perform an enrichment analysis
- How to read Gene set analysis results
Enricher
How to do enrichement analysis ?
We are interested in the genes Jund and Mindy1. Manually, as we are working with mice, we go to MGI (Mouse Genome Informatics), and look for “Jund”.
Multiple pathways are found, clicking gives us more information:
With lots of genes, and lots of databases, this will be tedious.
Enricher with SingleCell TK
Enricher does the same as the above: look for the list of genes provided, to a list of databases provided.
You can either enter a small list of genes, or let SingleCellTK select the list of genes of interest.
Magnitude of the fold change, order of the genes entered, all these information does not count. The only important thing that matters is the name of the genes.
Gene set enrichment analysis
Why ?
The large table containing 7 000 differentially expressed genes is not usefull for humans.
Gene set analysis methods searches gene names against pathways:
Gene set analysis also searches for genes against protein databases:
Most of you know these methods under the umbrella term of “GSEA”. Please, be aware that GSEA is the name of a tool developped by the broad institure.
Gene names and gene identifiers
We just did a whole analysis using names like ‘Jund’ or ‘Mindy1’. These names are understandable by humans, and using them while presentif your work or publishing your paper is perfectly acceptable.
Let me tell you about the gene ACTR2.
This gene name is a synonym for NCOA3, a nuclear receptor coactivator involved in transcription It is located in Chr20.p. It is also the name of the Actin Receptor Protein 2, located in Chr2.p. Now you guess the issue. Arp2 is a synonym of Arpc2. Arpc2 is a gene identified as differentially expressed. So … We may have a problem here.
That’s why, we, bioinformatician, have had the idea of writing very unique names for each genes. We, bioinformaticians, however did not have the idea to make them readable by humans. Sorry. We’ve been spending too much time with computers.
For us:
Synonyms | |
---|---|
ENSG00000163466 | ACTR2 / ARPC2 / P34-Arc / PNAS-139 / … |
ENSG00000138071 | ACTR2 / ARP2 |
ENSG00000124151 | ACTR / NCOA / NCOA3 / TRAM-1 / … |
In fact, ARP2 is not even related to mice ! I just show you gene information about humans !
Ensembl-id, Uniprot-id, Entrez-id, each database has its own gene/protein/transcript identifiers. They are unique. They are not human readable. You should use them when you can.
Lots of tools are used to “translate” these identifiers to human readable ones. SingleCellTK graphic interface does not allow to make these translations. GSEA lesson in RNA-Seq bulk has the example R code required for surch translation.
Databases
You may enter any gene set file as long as you provided any in the input pane: