Differential Expression Analysis
Forewords
R for advanced users
In this presentation, there will be screen captures, for you to follow the lesson. There will also be R command lines (hidden by default). Do not take care of the command lines if you find them too challenging. Our goal here, is to understand the main mechanism of Differential Expression Analysis. R is just a tool.
In fact, the whole TP can be done in two lines of code:
<- readRDS(file = "dataset/INTEGRATED_02d_ClustLab.rds")
integrated_sctk
<- singleCellTK::runWilcox(
integrated_sctk inSCE=integrated_sctk,
groupName1="G3",
groupName2="G4",
analysisName="G3_vs_G4_Wilcox",
log2fcThreshold=0.01,
fdrThreshold=0.05,
class="sample",
useAssay="SLNst",
classGroup1="G3",
classGroup2="G4"
)
<- singleCellTK::runDESeq(
integrated_sctk inSCE=integrated_sctk,
groupName1="G3",
groupName2="G4",
analysisName="G3_vs_G4_DESeq",
log2fcThreshold=0.01,
fdrThreshold=0.05,
class="sample",
useAssay="counts",
classGroup1="G3",
classGroup2="G4"
)
Our goal is to understand these lines, not to be able to write them.
Purpose of this session
Up to now, we have:
- Identified to which cell each sequenced reads come from
- Identified to which gene each read come from
- Identified possible bias in gene expression for each cell
- Annotated cell clusters
We would like to identify the list of genes that caracterize differences between G3 and G4 groups.
At the end of this session you will know:
- how to select a differential analysis method
- how to select the correct normalization (if any?) that must be provided to your differential analysis method
- How to read differential expression results
Load RDS dataset
You aleady have a dataset loaded ? Good. Keep on going with it ! You don’t have one ? Use mine:
In SingleCellTK, click “Data”.
- Import SingleCellExperiment or Seurat object stored in an RDS file
- click “Browse” and select INTEGRATED_02d_ClustLab.rds
- click “Add To Sample List”
- click “Import”
- wait …
- Set “Rownames (Default)” as “Feature for Display”
This may be a bit long, depending on the size of the input dataset.
Insight
Have a look at your dataset. The Shiny interface returns the following description:
Aside, we might want to look at this dataset through R, using the
function head
.
A RDS (R Dataset) contains lots of information.
head(integrated_sctk)
class: SingleCellExperiment
dim: 6 7311
metadata(9): assayType sctk ... misc diffExp
assays(2): counts SLNst
rownames(6): Mrpl15 Lypla1 ... Rb1cc1 Pcmtd1
rowData names(4): rownames counts_filter
seurat_variableFeatures_vst_varianceStandardized
seurat_variableFeatures_vst_mean
colnames(7311): G3.1 G3.2 ... G4.4586 G4.4587
colData names(27): sample nCount ... SingleR_immgen_fine_delta.next
SingleR_immgen_fine_pruned.labels
reducedDimNames(5): QC_UMAP SLNst2K_seuratPCA
SLNst2K_seuratPCA_seuratUMAP20 PCA_FMNN PCA_FMNN_seuratUMAP30
mainExpName: NULL
altExpNames(1): SLNst2K
We can see the name of the normalized dataset and raw counts, we read the name of the dimension reduction steps, we can see the number of cells (7311) and the list of available metadata (8). This information is quite precious when you recieve an unknown dataset, but here, we know where these data come from.
Select a DE method
Available methods
There are many DEA methods available in SingleCellTK:
- wilcox: The wilcoxon test tests the mean of expression and looks for a difference in these means.
- MAST: This tool has been built for Single Cell. It is based on a statistical model called “Hurdle Model”, which excells with data that contains lots of zeros (which is our case in Single Cell RNA-Seq: most of the genes are not expressed.)
- DESeq2: This tool has originally been built for bulk RNA-Seq but now includes specific funcitons for Single Cell. It performs well when counts are highly variable or when you wand to compare a handful of cells.
- Limma: This tool is based on similar methods as DESeq2, and IMHO their results are quite alike.
- ANOVA: ANOVA stands for “Analysis of Variance” and is also called “Fisher analysis of Variance”.
The main question now is how to choose the right test: spoilers, there are no option better than another in all ways.
From Soneson & Robinson (2018) Nature Methods:
Here, researchers have designed an artificial dataset where they knew in advance the list of differentially expressed genes. They have used all these algorithms and consigned the results.
- DESeq2, Limma seems to have a higher number of false positives (genes called differentially expressed while they were not.)
- Wilcoxon seems to be better in general
- Mast performs well in absolute
ANOVA was not present in this study.
In conclusion, to select your method, use the following:
- If you have already done a study with one of these methods, keep using the same. This is crutial if you ever want to compare your new result with the old ones.
- If you want to compare your study with a published one, use the same method.
- If you have no idea, use Wilcoxon.
- If you have bulk data analyzed with DESeq2/Limma, use DESeq2/Limma. It will be easier to take both works in consideration.
Please, never use a simple Wilcoxon on bulk RNA-Seq data.
Hands on a single gene: Jund
Question is now to guess wether this gene is differnetially expressed or not.
Cell observations
Let’s have a look at the gene named ‘Jund’, involved in senescence and apoptosis.
::plotSCEViolin(
singleCellTKinSCE=integrated_sctk,
feature="Jund",
slotName="assays",
itemName="SLNst",
groupBy="sample",
ylab="Expression",
xlab="Sample",
defaultTheme=FALSE,
title="Expression of Jund in both conditions"
)
Using ‘intuition’, is that gene differentially expressed ?
I can give you expression density for that gene if it helps:
::plotSCEDensity(
singleCellTKinSCE=integrated_sctk,
feature="Jund",
slotName="assays",
itemName="SLNst",
groupBy="sample",
ylab="Expression",
xlab="Sample",
defaultTheme=FALSE,
title="Expression density of Jund in both conditions"
)
Okay, let’s have some informations about these distributions.
Within the G3 group:
<- as.data.frame(
countsG3 ::assay(x=integrated_sctk, i="SLNst")
SummarizedExperiment%>% select( matches("^G3."))
)
<- as.data.frame(t(countsG3["Jund", ]))
JundG3 $Group <- "G3"
JundG3
summary(JundG3)
Jund Group
Min. :-0.8072 Length:3468
1st Qu.:-0.8072 Class :character
Median :-0.8072 Mode :character
Mean :-0.4501
3rd Qu.:-0.8072
Max. : 2.0387
Withing the G4 group:
<- as.data.frame(
countsG4 ::assay(x=integrated_sctk, i="SLNst")
SummarizedExperiment%>% select( matches("^G4."))
)
<- as.data.frame(t(countsG4["Jund", ]))
JundG4 $Group <- "G4"
JundG4
summary(JundG4)
Jund Group
Min. :-0.8072 Length:3843
1st Qu.:-0.8072 Class :character
Median : 0.7629 Mode :character
Mean : 0.4061
3rd Qu.: 1.2086
Max. : 2.8188
From biology to statistics
Okay, let us resort on statistics to evaluate our chances to be guess correctly.
We have lots of observations: 3 468 cells in G3 group, and 3 843 cells in G4 groups. Statisticians really like to have a lot of observations! Ideally, statisticians always want to have more observation than tests. We have a total of 7311 observations and we are testing 1 gene. For them, this is a dream come true!
Our cells are supposed to be interactig with each others ? Are they independent from each others ? This is very important, and usually, it requires a discussion.
Does the expression in our cells follow a normal distribution ? It’s easy to check. Let’s draw the expression like we did above:
<- rbind(JundG3, JundG4)
Jund gghistogram(Jund, x="Jund", y = '..density..', fill = "steelblue", bins=15, add_density=TRUE)
Our distribution seems to be binomial we will have to rely on non-parametric tests.
Let’s run a non-parametric test base on the mean of distributions, since it’s the clothest to our ‘intuitive’ approach. Let there be Wilcoxon test.
In R, it’s quite straightforward: we have the function wilcoxon_test
to perform the test, then we can plot the result.
<- Jund %>% wilcox_test(Jund ~ Group) %>% add_significance()
stat.test <- Jund %>% wilcox_effsize(Jund ~ Group) eff.size
.y. | group1 | group2 | n1 | n2 | statistic | p | p.signif |
---|---|---|---|---|---|---|---|
Jund | G3 | G4 | 3468 | 3843 | 3766722 | 0 | **** |
Wilcoxon test says: the distributions are different, with a \(9.31e-280\) % of risks of being wrong. The gene Jund can safely be said differentially expressed. We can compute a fold change and conclude.
The same thing with DESeq2 leads to the following results:
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
---|---|---|---|---|---|---|
Jund | 1.742497 | 0.7222429 | 0.0274657 | 26.29621 | 0 | 0 |
With an adjusted pvalue of \(1.200032e-149\) (raw p-value = \(2.118861e-152\)) DEseq2 also says Jund is differentially expressed.
Here, two methods (Wilcoxon and DESeq2) lead to the same conclusion.
Hands on a second gene: Mindy1
Cell observations
Let’s have a look at the gene named ‘Mindy1’, mutation in that gene are associated with Encephalitis.
::plotSCEViolin(
singleCellTKinSCE=integrated_sctk,
feature="Mindy1",
slotName="assays",
itemName="SLNst",
groupBy="sample",
ylab="Expression",
xlab="Sample",
defaultTheme=FALSE,
title="Expression of Mindy1 in both conditions"
)
::plotSCEDensity(
singleCellTKinSCE=integrated_sctk,
feature="Mindy1",
slotName="assays",
itemName="SLNst",
groupBy="sample",
ylab="Expression",
xlab="Sample",
defaultTheme=FALSE,
title="Expression density of Mindy1 in both conditions"
)
Okay, let’s have some informations about these distributions.
Within the G3 group:
<- as.data.frame(
countsG3 ::assay(x=integrated_sctk, i="SLNst")
SummarizedExperiment%>% select( matches("^G3."))
)
<- as.data.frame(t(countsG3["Mindy1", ]))
Mindy1G3 $Group <- "G3"
Mindy1G3
summary(Mindy1G3)
Mindy1 Group
Min. :-0.1172 Length:3468
1st Qu.:-0.1172 Class :character
Median :-0.1172 Mode :character
Mean : 0.1271
3rd Qu.:-0.1172
Max. :10.0000
Withing the G4 group:
<- as.data.frame(
countsG4 ::assay(x=integrated_sctk, i="SLNst")
SummarizedExperiment%>% select( matches("^G4."))
)
<- as.data.frame(t(countsG4["Mindy1", ]))
Mindy1G4 $Group <- "G4"
Mindy1G4
summary(Mindy1G4)
Mindy1 Group
Min. :-0.1172 Length:3843
1st Qu.:-0.1172 Class :character
Median :-0.1172 Mode :character
Mean :-0.1172
3rd Qu.:-0.1172
Max. :-0.1172
<- rbind(Mindy1G3, Mindy1G4)
Mindy1 $Cells <- rownames(Mindy1) Mindy1
Not easy, isn’t it ?
Outlayers exists, but only in one condition. Are these outlayers related to the G3 population or are they linked to another unknown effect ?
Statistical results
Using the same parameters as for Jund, we can run Wilcoxon method, and have the following result:
<- Mindy1 %>% wilcox_test(Mindy1 ~ Group) %>% add_significance()
stat.test <- Mindy1 %>% wilcox_effsize(Mindy1 ~ Group) eff.size
.y. | group1 | group2 | n1 | n2 | statistic | p | p.signif |
---|---|---|---|---|---|---|---|
Mindy1 | G3 | G4 | 3468 | 3843 | 6865520 | 0 | **** |
Then with DESeq2:
kable(res["Mindy1", ], caption = "DESeq2 test result")
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
---|---|---|---|---|---|---|
Mindy1 | 1.068407 | -0.0518225 | 0.0335339 | -1.545376 | 0.1222553 | 0.2427601 |
Here the conclusion is not easy. Two tools, two answers with an adjusted P-Value under 5%. While DESeq2 rejects the hypothesis with a p-value = \(0.1222553\), adjusted to \(0.2427601\), Wilcoxon accepts it with an adjusted p-value of \(0.0497590536460365\).
DESeq2 identified these points as outlayers while Wilcoxon, standing on mean expression, identified it as differentially expressed. It is not possible to state who’s right and who’s wrong.
Select a dataset
Dataset depends on selected method
There it is quite easier:
Counts | |
---|---|
wilcox | Normalized counts |
MAST | Raw counts |
DESeq2 | Raw counts |
Limma | Raw counts |
ANOVA | Normalized counts |
Name your conditions
You may want to manually select your cells, here we have 7311 cells, we won’t click them one-by-one.
Name your conditions the way you want them to be named in your publication, or you’ll have to re-run all this work.
Choose your annotation
A missing annotation class here, means a missing parameter in your SingleCellExperiment object.
Choose parameters
FDR and Fold Change
About thresholds on FDR (False Discovery Rate) and Log2(FC) (Log of the Fold Change), there are many discussions.
Remember, here the threshold on Fold Change is Logged. A
log2(1) =
0. And keep in mind the following:
- If one selects a fold change threshold above 1.7, then their study concludes that smoking is not related to lung cancer.
- If one selects a fold change threshold above 1, then their study concludes that a fast-food based diet does not lead to weight gain.
- If one selects a fold change threshold above 1/25 000 000, then their study concludes: it is a complete hazard that mice have featal malformation when in presence of Bisphanol.
In conclusion, there one, and only one reason to filter on fold change: in my experience, a fold change below 0.7 will be hard to see/verify on wet-lab (qRT).
If you need to reduce a too large number of differentially expressed genes, then reduce the FDR to 0.01, or even better, to 0.001. With that, you reduce your number of false claims.
Now run the differential analysis.
Results and interpretation
Big table
There you have the results, make sure to download it.
Remember, a p-value can never be an absolute 0. Here, the numbers are so small, that R rounded them to 0.
Never consider raw P-Values, adjusted P-Values are better by far. So FDR column is the one that decides wether a gene is DE or not.
Note this table contains only 4075 entries, so 4075 genes. This means that a “missing” gene is a non-differentially-expressed gene.
Remember the FC a logged in your conclusions.
What about our genes: Jund and Mindy1 ?
Heatmap
::plotDEGHeatmap(
singleCellTKinSCE=integrated_sctk,
useResults="G3_vs_G4_Wilcox",
log2fcThreshold=0.01,
fdrThreshold=0.05,
class="sample",
useAssay="SLNst"
)
Violin plot
::plotDEGViolin(
singleCellTKinSCE=integrated_sctk,
useResults="G3_vs_G4_Wilcox"
)
Linear model plot
::plotDEGRegression(
singleCellTKinSCE=integrated_sctk,
useResults="G3_vs_G4_Wilcox"
)
TLDR – Too long, didn’t read
- Perform differential analysis with Wilxocon as long as you have lots of cells (> 300)
- Use the same method as the paper/work you’re referring to
- Don’t filter on Fold Change if you don’t have good reasons
- Save your results regularly