Differential Expression Analysis

logos

Forewords

R for advanced users

In this presentation, there will be screen captures, for you to follow the lesson. There will also be R command lines (hidden by default). Do not take care of the command lines if you find them too challenging. Our goal here, is to understand the main mechanism of Differential Expression Analysis. R is just a tool.

In fact, the whole TP can be done in two lines of code:

integrated_sctk <- readRDS(file = "dataset/INTEGRATED_02d_ClustLab.rds")

integrated_sctk <- singleCellTK::runWilcox(
  inSCE=integrated_sctk, 
  groupName1="G3", 
  groupName2="G4", 
  analysisName="G3_vs_G4_Wilcox", 
  log2fcThreshold=0.01, 
  fdrThreshold=0.05, 
  class="sample", 
  useAssay="SLNst", 
  classGroup1="G3", 
  classGroup2="G4"
)

integrated_sctk <- singleCellTK::runDESeq(
  inSCE=integrated_sctk, 
  groupName1="G3", 
  groupName2="G4", 
  analysisName="G3_vs_G4_DESeq", 
  log2fcThreshold=0.01, 
  fdrThreshold=0.05, 
  class="sample", 
  useAssay="counts", 
  classGroup1="G3", 
  classGroup2="G4"
)

Our goal is to understand these lines, not to be able to write them.

Purpose of this session

Up to now, we have:

  1. Identified to which cell each sequenced reads come from
  2. Identified to which gene each read come from
  3. Identified possible bias in gene expression for each cell
  4. Annotated cell clusters

We would like to identify the list of genes that caracterize differences between G3 and G4 groups.

At the end of this session you will know:

  1. how to select a differential analysis method
  2. how to select the correct normalization (if any?) that must be provided to your differential analysis method
  3. How to read differential expression results

Load RDS dataset

You aleady have a dataset loaded ? Good. Keep on going with it ! You don’t have one ? Use mine:

In SingleCellTK, click “Data”.

  1. Import SingleCellExperiment or Seurat object stored in an RDS file
  2. click “Browse” and select INTEGRATED_02d_ClustLab.rds
  3. click “Add To Sample List”
  4. click “Import”
  5. wait …
  6. Set “Rownames (Default)” as “Feature for Display”

This may be a bit long, depending on the size of the input dataset.

Insight

Have a look at your dataset. The Shiny interface returns the following description:

singlecelltk_input

Aside, we might want to look at this dataset through R, using the function head. A RDS (R Dataset) contains lots of information.

head(integrated_sctk)
class: SingleCellExperiment 
dim: 6 7311 
metadata(9): assayType sctk ... misc diffExp
assays(2): counts SLNst
rownames(6): Mrpl15 Lypla1 ... Rb1cc1 Pcmtd1
rowData names(4): rownames counts_filter
  seurat_variableFeatures_vst_varianceStandardized
  seurat_variableFeatures_vst_mean
colnames(7311): G3.1 G3.2 ... G4.4586 G4.4587
colData names(27): sample nCount ... SingleR_immgen_fine_delta.next
  SingleR_immgen_fine_pruned.labels
reducedDimNames(5): QC_UMAP SLNst2K_seuratPCA
  SLNst2K_seuratPCA_seuratUMAP20 PCA_FMNN PCA_FMNN_seuratUMAP30
mainExpName: NULL
altExpNames(1): SLNst2K

We can see the name of the normalized dataset and raw counts, we read the name of the dimension reduction steps, we can see the number of cells (7311) and the list of available metadata (8). This information is quite precious when you recieve an unknown dataset, but here, we know where these data come from.

Select a DE method

Available methods

dea_methods

There are many DEA methods available in SingleCellTK:

  1. wilcox: The wilcoxon test tests the mean of expression and looks for a difference in these means.
  2. MAST: This tool has been built for Single Cell. It is based on a statistical model called “Hurdle Model”, which excells with data that contains lots of zeros (which is our case in Single Cell RNA-Seq: most of the genes are not expressed.)
  3. DESeq2: This tool has originally been built for bulk RNA-Seq but now includes specific funcitons for Single Cell. It performs well when counts are highly variable or when you wand to compare a handful of cells.
  4. Limma: This tool is based on similar methods as DESeq2, and IMHO their results are quite alike.
  5. ANOVA: ANOVA stands for “Analysis of Variance” and is also called “Fisher analysis of Variance”.

The main question now is how to choose the right test: spoilers, there are no option better than another in all ways.

From Soneson & Robinson (2018) Nature Methods:

de_tools

Here, researchers have designed an artificial dataset where they knew in advance the list of differentially expressed genes. They have used all these algorithms and consigned the results.

  1. DESeq2, Limma seems to have a higher number of false positives (genes called differentially expressed while they were not.)
  2. Wilcoxon seems to be better in general
  3. Mast performs well in absolute

ANOVA was not present in this study.

In conclusion, to select your method, use the following:

  1. If you have already done a study with one of these methods, keep using the same. This is crutial if you ever want to compare your new result with the old ones.
  2. If you want to compare your study with a published one, use the same method.
  3. If you have no idea, use Wilcoxon.
  4. If you have bulk data analyzed with DESeq2/Limma, use DESeq2/Limma. It will be easier to take both works in consideration.

Please, never use a simple Wilcoxon on bulk RNA-Seq data.

Hands on a single gene: Jund

Question is now to guess wether this gene is differnetially expressed or not.

Cell observations

Let’s have a look at the gene named ‘Jund’, involved in senescence and apoptosis.

singleCellTK::plotSCEViolin(
  inSCE=integrated_sctk, 
  feature="Jund", 
  slotName="assays", 
  itemName="SLNst", 
  groupBy="sample", 
  ylab="Expression", 
  xlab="Sample", 
  defaultTheme=FALSE,
  title="Expression of Jund in both conditions"
)

Using ‘intuition’, is that gene differentially expressed ?

I can give you expression density for that gene if it helps:

singleCellTK::plotSCEDensity(
  inSCE=integrated_sctk, 
  feature="Jund", 
  slotName="assays", 
  itemName="SLNst", 
  groupBy="sample", 
  ylab="Expression", 
  xlab="Sample", 
  defaultTheme=FALSE, 
  title="Expression density of Jund in both conditions"
)

Okay, let’s have some informations about these distributions.

Within the G3 group:

countsG3 <- as.data.frame(
  SummarizedExperiment::assay(x=integrated_sctk, i="SLNst")
) %>% select( matches("^G3."))

JundG3 <- as.data.frame(t(countsG3["Jund", ]))
JundG3$Group <- "G3"

summary(JundG3)
      Jund            Group          
 Min.   :-0.8072   Length:3468       
 1st Qu.:-0.8072   Class :character  
 Median :-0.8072   Mode  :character  
 Mean   :-0.4501                     
 3rd Qu.:-0.8072                     
 Max.   : 2.0387                     

Withing the G4 group:

countsG4 <- as.data.frame(
  SummarizedExperiment::assay(x=integrated_sctk, i="SLNst")
) %>% select( matches("^G4."))

JundG4 <- as.data.frame(t(countsG4["Jund", ]))
JundG4$Group <- "G4"

summary(JundG4)
      Jund            Group          
 Min.   :-0.8072   Length:3843       
 1st Qu.:-0.8072   Class :character  
 Median : 0.7629   Mode  :character  
 Mean   : 0.4061                     
 3rd Qu.: 1.2086                     
 Max.   : 2.8188                     

From biology to statistics

Okay, let us resort on statistics to evaluate our chances to be guess correctly.

We have lots of observations: 3 468 cells in G3 group, and 3 843 cells in G4 groups. Statisticians really like to have a lot of observations! Ideally, statisticians always want to have more observation than tests. We have a total of 7311 observations and we are testing 1 gene. For them, this is a dream come true!

Our cells are supposed to be interactig with each others ? Are they independent from each others ? This is very important, and usually, it requires a discussion.

Does the expression in our cells follow a normal distribution ? It’s easy to check. Let’s draw the expression like we did above:

Jund <- rbind(JundG3, JundG4)
gghistogram(Jund, x="Jund", y = '..density..', fill = "steelblue", bins=15, add_density=TRUE)

Our distribution seems to be binomial we will have to rely on non-parametric tests.

Let’s run a non-parametric test base on the mean of distributions, since it’s the clothest to our ‘intuitive’ approach. Let there be Wilcoxon test.

In R, it’s quite straightforward: we have the function wilcoxon_test to perform the test, then we can plot the result.

stat.test <- Jund %>% wilcox_test(Jund ~ Group) %>% add_significance()
eff.size <- Jund %>% wilcox_effsize(Jund ~ Group)
Wilcoxon test result
.y. group1 group2 n1 n2 statistic p p.signif
Jund G3 G4 3468 3843 3766722 0 ****

Wilcoxon test says: the distributions are different, with a \(9.31e-280\) % of risks of being wrong. The gene Jund can safely be said differentially expressed. We can compute a fold change and conclude.

The same thing with DESeq2 leads to the following results:

DESeq2 test result
baseMean log2FoldChange lfcSE stat pvalue padj
Jund 1.742497 0.7222429 0.0274657 26.29621 0 0

With an adjusted pvalue of \(1.200032e-149\) (raw p-value = \(2.118861e-152\)) DEseq2 also says Jund is differentially expressed.

Here, two methods (Wilcoxon and DESeq2) lead to the same conclusion.

Hands on a second gene: Mindy1

Cell observations

Let’s have a look at the gene named ‘Mindy1’, mutation in that gene are associated with Encephalitis.

singleCellTK::plotSCEViolin(
  inSCE=integrated_sctk, 
  feature="Mindy1", 
  slotName="assays", 
  itemName="SLNst", 
  groupBy="sample", 
  ylab="Expression", 
  xlab="Sample", 
  defaultTheme=FALSE,
  title="Expression of Mindy1 in both conditions"
)

singleCellTK::plotSCEDensity(
  inSCE=integrated_sctk, 
  feature="Mindy1", 
  slotName="assays", 
  itemName="SLNst", 
  groupBy="sample", 
  ylab="Expression", 
  xlab="Sample", 
  defaultTheme=FALSE, 
  title="Expression density of Mindy1 in both conditions"
)

Okay, let’s have some informations about these distributions.

Within the G3 group:

countsG3 <- as.data.frame(
  SummarizedExperiment::assay(x=integrated_sctk, i="SLNst")
) %>% select( matches("^G3."))

Mindy1G3 <- as.data.frame(t(countsG3["Mindy1", ]))
Mindy1G3$Group <- "G3"

summary(Mindy1G3)
     Mindy1           Group          
 Min.   :-0.1172   Length:3468       
 1st Qu.:-0.1172   Class :character  
 Median :-0.1172   Mode  :character  
 Mean   : 0.1271                     
 3rd Qu.:-0.1172                     
 Max.   :10.0000                     

Withing the G4 group:

countsG4 <- as.data.frame(
  SummarizedExperiment::assay(x=integrated_sctk, i="SLNst")
) %>% select( matches("^G4."))

Mindy1G4 <- as.data.frame(t(countsG4["Mindy1", ]))
Mindy1G4$Group <- "G4"

summary(Mindy1G4)
     Mindy1           Group          
 Min.   :-0.1172   Length:3843       
 1st Qu.:-0.1172   Class :character  
 Median :-0.1172   Mode  :character  
 Mean   :-0.1172                     
 3rd Qu.:-0.1172                     
 Max.   :-0.1172                     
Mindy1 <- rbind(Mindy1G3, Mindy1G4)
Mindy1$Cells <- rownames(Mindy1)

Not easy, isn’t it ?

Outlayers exists, but only in one condition. Are these outlayers related to the G3 population or are they linked to another unknown effect ?

Statistical results

Using the same parameters as for Jund, we can run Wilcoxon method, and have the following result:

stat.test <- Mindy1 %>% wilcox_test(Mindy1 ~ Group) %>% add_significance()
eff.size <- Mindy1 %>% wilcox_effsize(Mindy1 ~ Group)
Wilcoxon test result
.y. group1 group2 n1 n2 statistic p p.signif
Mindy1 G3 G4 3468 3843 6865520 0 ****

Then with DESeq2:

kable(res["Mindy1", ], caption = "DESeq2 test result")
DESeq2 test result
baseMean log2FoldChange lfcSE stat pvalue padj
Mindy1 1.068407 -0.0518225 0.0335339 -1.545376 0.1222553 0.2427601

Here the conclusion is not easy. Two tools, two answers with an adjusted P-Value under 5%. While DESeq2 rejects the hypothesis with a p-value = \(0.1222553\), adjusted to \(0.2427601\), Wilcoxon accepts it with an adjusted p-value of \(0.0497590536460365\).

DESeq2 identified these points as outlayers while Wilcoxon, standing on mean expression, identified it as differentially expressed. It is not possible to state who’s right and who’s wrong.

Select a dataset

Dataset depends on selected method

There it is quite easier:

How to select your counts
Counts
wilcox Normalized counts
MAST Raw counts
DESeq2 Raw counts
Limma Raw counts
ANOVA Normalized counts

Name your conditions

name_conditions

You may want to manually select your cells, here we have 7311 cells, we won’t click them one-by-one.

Name your conditions the way you want them to be named in your publication, or you’ll have to re-run all this work.

Choose your annotation

annotation_class

A missing annotation class here, means a missing parameter in your SingleCellExperiment object.

Choose parameters

parameters_choise

FDR and Fold Change

About thresholds on FDR (False Discovery Rate) and Log2(FC) (Log of the Fold Change), there are many discussions.

Remember, here the threshold on Fold Change is Logged. A log2(1) =0. And keep in mind the following:

  1. If one selects a fold change threshold above 1.7, then their study concludes that smoking is not related to lung cancer.
  2. If one selects a fold change threshold above 1, then their study concludes that a fast-food based diet does not lead to weight gain.
  3. If one selects a fold change threshold above 1/25 000 000, then their study concludes: it is a complete hazard that mice have featal malformation when in presence of Bisphanol.

In conclusion, there one, and only one reason to filter on fold change: in my experience, a fold change below 0.7 will be hard to see/verify on wet-lab (qRT).

If you need to reduce a too large number of differentially expressed genes, then reduce the FDR to 0.01, or even better, to 0.001. With that, you reduce your number of false claims.

Now run the differential analysis.

Results and interpretation

Big table

There you have the results, make sure to download it.

Remember, a p-value can never be an absolute 0. Here, the numbers are so small, that R rounded them to 0.

Never consider raw P-Values, adjusted P-Values are better by far. So FDR column is the one that decides wether a gene is DE or not.

Note this table contains only 4075 entries, so 4075 genes. This means that a “missing” gene is a non-differentially-expressed gene.

Remember the FC a logged in your conclusions.

What about our genes: Jund and Mindy1 ?

Heatmap

singleCellTK::plotDEGHeatmap(
  inSCE=integrated_sctk,
  useResults="G3_vs_G4_Wilcox",
  log2fcThreshold=0.01,
  fdrThreshold=0.05,
  class="sample", 
  useAssay="SLNst"
)

heatmap

Violin plot

singleCellTK::plotDEGViolin(
  inSCE=integrated_sctk,
  useResults="G3_vs_G4_Wilcox"
)

violin_plot

Linear model plot

singleCellTK::plotDEGRegression(
  inSCE=integrated_sctk,
  useResults="G3_vs_G4_Wilcox"
)

regression_plot

TLDR – Too long, didn’t read

  1. Perform differential analysis with Wilxocon as long as you have lots of cells (> 300)
  2. Use the same method as the paper/work you’re referring to
  3. Don’t filter on Fold Change if you don’t have good reasons
  4. Save your results regularly