We load the integrated, clustered Seurat object :
## Loading the Seurat object
sobj <- base::readRDS(
file = "/shared/projects/form_2022_32/SingleCellRNASeq/Normalization/Scaled_Normalized_Harmony_Clustering_Seurat_Object.RDS"
)
## A quick overview of the object
sobj
An object of class Seurat
6941 features across 3984 samples within 1 assay
Active assay: RNA (6941 features, 2000 variable features)
3 dimensional reductions calculated: pca, umap, harmony
To annotate cells, we need some knowledge base.
We will use a database focused on immunological cell types called ImmGen, thanks to the celldex R package that “provides a collection of reference expression datasets with curated cell type labels, for use in procedures like automated annotation of single-cell data or deconvolution of bulk RNA-seq”
Note : In the following chunk code, you may be prompted to create a directory. This is normal, and you can safely say yes. This is due to the fact that for your first call, the ImmGen package will download and deposit the database in a personal cache, which will make any future calls faster.
## Loading the ImmGen database
annotation <- celldex::ImmGenData()
## A quick description of the db
annotation
class: SummarizedExperiment
dim: 22134 830
metadata(0):
assays(1): logcounts
rownames(22134): Zglp1 Vmn2r65 ... Tiparp Kdm1a
rowData names(0):
colnames(830):
GSM1136119_EA07068_260297_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_1.CEL
GSM1136120_EA07068_260298_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_2.CEL ...
GSM920654_EA07068_201214_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_1.CEL
GSM920655_EA07068_201215_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_2.CEL
colData names(3): label.main label.fine label.ont
This database contains 3 levels of granularity : * A “main” level (coarse grain) * A “fine” level (self-explanatory) * The “ONT” level (data are mapped to a defined ontology)
As we are in a context of sorted cells of the same lineage, we’re going to use the fine label.
Let’s see how many cell types are described in this ImmGen database :
length(unique(annotation$label.fine))
[1] 253
The tool we will use to perform the automatic cell type annotation, SingleR works better with the normalized data. Thus, we will extract the normalized matrix from our Seurat object :
norm_exp_mat <- Seurat::GetAssayData(
object = sobj,
assay = "RNA",
slot = "data"
)
dim(norm_exp_mat)
[1] 6941 3984
We are ready to start the annotation :
ann_predictions <- if (file.exists('ann_predictions.RDS')) readRDS('ann_predictions.RDS') else {
SingleR::SingleR(
test = norm_exp_mat,
ref = annotation,
labels = annotation$label.fine,
assay.type.test = "logcounts",
assay.type.ref = "logcounts",
BPPARAM = BiocParallel::SerialParam()
)
}
The resulting object is a special kind of data.frame
:
is(ann_predictions)
[1] "DFrame" "DataFrame" "SimpleList"
[4] "RectangularData" "List" "DataFrame_OR_NULL"
[7] "Vector" "list_OR_List" "Annotated"
[10] "vector_OR_Vector"
head(ann_predictions, n = 3)
DataFrame with 3 rows and 5 columns
scores first.labels
<matrix> <character>
TD3A.1024 0.2468332:0.2531070:0.2672337:... T cells (T.CD4TESTCJ)
TD3A.3642 0.1841115:0.1843756:0.1871214:... T cells (T.DPsm)
TD3A.1731 0.0862956:0.0870638:0.0932266:... T cells (T.DPsm)
tuning.scores labels pruned.labels
<DataFrame> <character> <character>
TD3A.1024 0.294951:0.294104 T cells (T.8Nve) T cells (T.8Nve)
TD3A.3642 0.336526:0.270188 T cells (T.DPsm) T cells (T.DPsm)
TD3A.1731 0.465899:0.334169 T cells (T.DPsm) T cells (T.DPsm)
It contains 5 columns of information for each cell :
dim(ann_predictions)
[1] 3984 5
How many different kind of labels were identified ?
length(unique(ann_predictions$labels))
[1] 37
How many cells have been labelled for each annotation ?
sorted_idlab <- sort(table(ann_predictions$labels), decreasing = TRUE)
head(sorted_idlab, n = 10)
T cells (T.DPsm) T cells (T.DP69+) T cells (T.DPbl)
3457 216 71
T cells (T.ISP) T cells (T.4SP24int) T cells (T.DN3A)
55 45 42
T cells (T.8Nve) T cells (T.8NVE.OT1) T cells (T.CD4.5H)
28 9 8
T cells (T.CD4TESTCJ)
6
Besides scoring, SingleR assesses the score quality, and prunes bad results.
How many cells got a poor quality annotation ?
summary(is.na(ann_predictions$pruned.labels))
Mode FALSE TRUE
logical 3973 11
SingleR allows to visualize some control plots :
SingleR::plotScoreHeatmap(ann_predictions)
SingleR::plotDeltaDistribution(results = ann_predictions, ncol = 4)
We add the annotation to our Seurat object.
all.equal(rownames(ann_predictions), colnames(sobj))
[1] TRUE
sobj$singler_cells_labels = ann_predictions$labels
We can visualize cells annotation the the 2D projection (uMAP, here) :
seeable_palette = setNames(
c(RColorBrewer::brewer.pal(name = "Dark2", n = 8),
c(1:(length(unique(ann_predictions$labels)) - 8))),
nm = names(sort(table(ann_predictions$labels), decreasing = TRUE)))
ann_plot = Seurat::DimPlot(
object = sobj,
reduction = "umap",
group.by = "singler_cells_labels",
pt.size = 2,
cols = seeable_palette
) + ggplot2::theme(legend.position = "bottom")
clust_plot = Seurat::DimPlot(
object = sobj,
reduction = "umap",
group.by = "RNA_snn_res.0.8",
pt.size = 2,
label = TRUE,
repel = TRUE
)
print(ann_plot + clust_plot)
Maybe the annotation is not perfectly suited for our dataset. Some cell populations in the annotation are closely related, and this leads to annotation competition for our cells.
It is possible to run the annotation at the cluster level : it will be cleaner than the single cell level annotation. But, be sure that the clustering is not merging several cell populations.
We can check the number of cell types attributed to each cluster :
table(sobj$singler_cells_labels,
sobj$RNA_snn_res.0.8)
0 1 2 3 4 5 6 7 8
DC (DC.8-4-11B-) 0 0 0 0 0 0 0 0 0
DC (DC.PDC.8+) 0 0 0 0 0 0 0 0 1
Macrophages (MF.II+480LO) 0 0 0 0 0 0 0 0 0
Macrophages (MF.MEDL) 0 0 0 0 0 0 0 0 0
Monocytes (MO.6C+II-) 0 0 0 0 0 0 0 0 0
T cells (T.4FP3+25+) 0 0 0 0 0 0 0 0 1
T cells (T.4Nve) 0 0 0 0 0 0 1 0 4
T cells (T.4SP24int) 0 0 0 0 41 2 1 0 1
T cells (T.8EFF.OT1.12HR.LISOVA) 0 0 0 0 3 0 0 0 2
T cells (T.8EFF.OT1.24HR.LISOVA) 0 0 0 0 0 0 1 0 1
T cells (T.8EFF.OT1.48HR.LISOVA) 0 0 0 0 0 0 0 0 1
T cells (T.8MEM.OT1.D45.LISOVA) 0 0 0 0 0 0 0 0 1
T cells (T.8Mem) 0 0 0 0 0 0 0 0 1
T cells (T.8MEM) 0 0 0 0 0 0 0 0 2
T cells (T.8MEMKLRG1-CD127+.D8.LISOVA) 0 0 0 0 0 0 0 0 3
T cells (T.8NVE.OT1) 0 0 0 0 1 0 0 0 8
T cells (T.8Nve) 0 0 0 0 2 0 0 0 26
T cells (T.8SP24-) 0 0 0 0 0 0 0 1 1
T cells (T.8SP24int) 0 0 0 0 0 0 0 0 1
T cells (T.CD4.5H) 0 0 0 0 8 0 0 0 0
T cells (T.CD4.CTR) 0 0 0 0 0 0 0 0 1
T cells (T.CD4TESTCJ) 0 0 0 0 3 0 1 0 2
T cells (T.CD8.1H) 0 0 0 0 0 0 0 0 1
T cells (T.CD8.5H) 0 0 0 0 3 0 0 0 0
T cells (T.CD8.CTR) 0 0 0 0 0 0 0 0 1
T cells (T.DN2A) 0 0 0 0 0 0 0 0 0
T cells (T.DN3-4) 0 0 0 0 1 0 0 0 0
T cells (T.DN3A) 0 0 0 0 0 0 0 0 0
T cells (T.DP69+) 0 0 1 0 201 3 10 0 0
T cells (T.DPbl) 0 0 0 0 0 0 0 71 0
T cells (T.DPsm) 877 771 531 529 79 316 284 22 0
T cells (T.ISP) 0 0 1 0 0 0 0 54 0
T cells (T.Tregs) 0 0 0 0 0 0 0 0 2
Tgd (Tgd.imm.VG1+VD6+) 0 0 0 0 2 0 0 0 0
Tgd (Tgd.imm.vg2) 0 0 0 0 0 0 0 0 1
Tgd (Tgd.VG2+) 0 0 0 0 0 0 0 0 1
Tgd (Tgd) 0 0 0 0 0 0 0 0 1
9 10
DC (DC.8-4-11B-) 1 0
DC (DC.PDC.8+) 1 0
Macrophages (MF.II+480LO) 1 0
Macrophages (MF.MEDL) 1 0
Monocytes (MO.6C+II-) 1 0
T cells (T.4FP3+25+) 0 0
T cells (T.4Nve) 0 0
T cells (T.4SP24int) 0 0
T cells (T.8EFF.OT1.12HR.LISOVA) 0 0
T cells (T.8EFF.OT1.24HR.LISOVA) 0 0
T cells (T.8EFF.OT1.48HR.LISOVA) 0 0
T cells (T.8MEM.OT1.D45.LISOVA) 0 0
T cells (T.8Mem) 0 0
T cells (T.8MEM) 0 0
T cells (T.8MEMKLRG1-CD127+.D8.LISOVA) 0 0
T cells (T.8NVE.OT1) 0 0
T cells (T.8Nve) 0 0
T cells (T.8SP24-) 0 0
T cells (T.8SP24int) 0 0
T cells (T.CD4.5H) 0 0
T cells (T.CD4.CTR) 0 0
T cells (T.CD4TESTCJ) 0 0
T cells (T.CD8.1H) 0 0
T cells (T.CD8.5H) 0 0
T cells (T.CD8.CTR) 0 0
T cells (T.DN2A) 0 2
T cells (T.DN3-4) 0 1
T cells (T.DN3A) 0 42
T cells (T.DP69+) 1 0
T cells (T.DPbl) 0 0
T cells (T.DPsm) 46 2
T cells (T.ISP) 0 0
T cells (T.Tregs) 0 0
Tgd (Tgd.imm.VG1+VD6+) 0 0
Tgd (Tgd.imm.vg2) 0 0
Tgd (Tgd.VG2+) 0 0
Tgd (Tgd) 0 0
We can eventually check if some clusters contain multiple cell types. We compute the proportion of each cell type in each cluster. If a cluster is composed of two cell types (or more) representing more than 30 % of cells, maybe this cluster is too large, thus underclustered ?
pop_by_cluster = prop.table(table(sobj$singler_cells_labels,
sobj$RNA_snn_res.0.8),
margin = 2)
colSums(pop_by_cluster > 0.3)
0 1 2 3 4 5 6 7 8 9 10
1 1 1 1 1 1 1 2 1 1 1
Clusters : 7 maybe contain several cell types : be careful !
We will now run the annotation at the cluster level (SingleR will summarize the expression profiles of all cells from the same cluster, and then assess the resulting aggregation) :
clust_col <- 'RNA_snn_res.0.8'
clust_ann_predictions <- if (file.exists('clust_ann_predictions.RDS')) readRDS('clust_ann_predictions.RDS') else {
SingleR::SingleR(
test = norm_exp_mat,
clusters = sobj[[clust_col]],
ref = annotation,
labels = annotation$label.fine,
assay.type.test = "logcounts",
assay.type.ref = "logcounts",
BPPARAM = BiocParallel::SerialParam()
)
}
How many labels were affected to our clusters ?
length(unique(clust_ann_predictions$labels))
[1] 5
How many clusters have been labelled for each annotation label ?
head(sort(table(clust_ann_predictions$labels), decreasing = TRUE), n = 10)
T cells (T.DPsm) T cells (T.8Nve) T cells (T.DN3A) T cells (T.DP69+)
7 1 1 1
T cells (T.DPbl)
1
For how many clusters was the annotation of poor quality ?
summary(is.na(clust_ann_predictions$pruned.labels))
Mode FALSE
logical 11
We can visualize the scores for each cell type, to each cell, as a heatmap :
SingleR::plotScoreHeatmap(clust_ann_predictions)
We add the annotation to our Seurat object.
clust_labels_col <- paste0(clust_col, '_labels')
sobj@meta.data[[clust_labels_col]] <- sobj@meta.data[[clust_col]]
levels(sobj@meta.data[[clust_labels_col]]) = clust_ann_predictions$labels
We can visualize cells annotation the the 2D projection :
ann_plot = Seurat::DimPlot(
object = sobj,
reduction = "umap",
group.by = clust_labels_col,
pt.size = 2,
label = TRUE
) + ggplot2::theme(legend.position = "bottom")
clust_plot = Seurat::DimPlot(
object = sobj,
reduction = "umap",
group.by = clust_col,
pt.size = 2,
label = TRUE,
repel = TRUE
)
ann_plot + clust_plot
We save the annotated Seurat object :
path <- "/shared/projects/form_2022_32/SingleCellRNASeq/Normalization/"
base::saveRDS(
object = sobj,
file = paste0(path, "/Scaled_Normalized_Harmony_Clustering_Annotated_Seurat_Object.RDS")
)
sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS
Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.2.1/lib/libopenblasp-r0.3.21.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] celldex_1.6.0 SummarizedExperiment_1.26.1
[3] Biobase_2.56.0 GenomicRanges_1.48.0
[5] GenomeInfoDb_1.32.4 IRanges_2.30.1
[7] S4Vectors_0.34.0 BiocGenerics_0.42.0
[9] MatrixGenerics_1.8.1 matrixStats_0.62.0
loaded via a namespace (and not attached):
[1] AnnotationHub_3.4.0 BiocFileCache_2.4.0
[3] plyr_1.8.8 igraph_1.3.5
[5] lazyeval_0.2.2 sp_1.5-1
[7] splines_4.2.1 BiocParallel_1.30.4
[9] listenv_0.8.0 scattermore_0.8
[11] ggplot2_3.4.0 digest_0.6.30
[13] htmltools_0.5.3 viridis_0.6.2
[15] fansi_1.0.3 magrittr_2.0.3
[17] memoise_2.0.1 ScaledMatrix_1.4.1
[19] tensor_1.5 cluster_2.1.4
[21] ROCR_1.0-11 globals_0.16.1
[23] Biostrings_2.64.1 spatstat.sparse_3.0-0
[25] colorspace_2.0-3 rappdirs_0.3.3
[27] blob_1.2.3 ggrepel_0.9.2
[29] xfun_0.34 dplyr_1.0.10
[31] crayon_1.5.2 RCurl_1.98-1.9
[33] jsonlite_1.8.3 progressr_0.11.0
[35] spatstat.data_3.0-0 survival_3.4-0
[37] zoo_1.8-11 glue_1.6.2
[39] polyclip_1.10-4 gtable_0.3.1
[41] zlibbioc_1.42.0 XVector_0.36.0
[43] leiden_0.4.3 DelayedArray_0.22.0
[45] BiocSingular_1.12.0 future.apply_1.10.0
[47] abind_1.4-5 scales_1.2.1
[49] pheatmap_1.0.12 DBI_1.1.3
[51] spatstat.random_3.0-1 miniUI_0.1.1.1
[53] Rcpp_1.0.9 viridisLite_0.4.1
[55] xtable_1.8-4 reticulate_1.26
[57] rsvd_1.0.5 bit_4.0.4
[59] htmlwidgets_1.5.4 httr_1.4.4
[61] RColorBrewer_1.1-3 ellipsis_0.3.2
[63] Seurat_4.2.1 ica_1.0-3
[65] farver_2.1.1 pkgconfig_2.0.3
[67] dbplyr_2.2.1 sass_0.4.2
[69] uwot_0.1.14 deldir_1.0-6
[71] utf8_1.2.2 labeling_0.4.2
[73] tidyselect_1.2.0 rlang_1.0.6
[75] reshape2_1.4.4 later_1.3.0
[77] AnnotationDbi_1.58.0 BiocVersion_3.15.2
[79] munsell_0.5.0 tools_4.2.1
[81] cachem_1.0.6 cli_3.4.1
[83] ExperimentHub_2.4.0 generics_0.1.3
[85] RSQLite_2.2.18 ggridges_0.5.4
[87] evaluate_0.18 stringr_1.4.1
[89] fastmap_1.1.0 yaml_2.3.6
[91] goftest_1.2-3 knitr_1.40
[93] bit64_4.0.5 fitdistrplus_1.1-8
[95] purrr_0.3.5 RANN_2.6.1
[97] KEGGREST_1.36.3 sparseMatrixStats_1.8.0
[99] pbapply_1.5-0 future_1.29.0
[101] nlme_3.1-160 mime_0.12
[103] compiler_4.2.1 rstudioapi_0.14
[105] interactiveDisplayBase_1.34.0 filelock_1.0.2
[107] curl_4.3.3 plotly_4.10.1
[109] png_0.1-7 spatstat.utils_3.0-1
[111] tibble_3.1.8 bslib_0.4.1
[113] stringi_1.7.8 highr_0.9
[115] lattice_0.20-45 Matrix_1.5-3
[117] vctrs_0.5.0 pillar_1.8.1
[119] lifecycle_1.0.3 BiocManager_1.30.19
[121] spatstat.geom_3.0-3 lmtest_0.9-40
[123] jquerylib_0.1.4 BiocNeighbors_1.14.0
[125] RcppAnnoy_0.0.20 data.table_1.14.4
[127] cowplot_1.1.1 bitops_1.0-7
[129] irlba_2.3.5.1 httpuv_1.6.6
[131] patchwork_1.1.2 R6_2.5.1
[133] promises_1.2.0.1 KernSmooth_2.23-20
[135] gridExtra_2.3 parallelly_1.32.1
[137] SingleR_1.10.0 codetools_0.2-18
[139] MASS_7.3-58.1 assertthat_0.2.1
[141] withr_2.5.0 SeuratObject_4.1.3
[143] sctransform_0.3.5 GenomeInfoDbData_1.2.8
[145] parallel_4.2.1 beachmat_2.12.0
[147] grid_4.2.1 tidyr_1.2.1
[149] DelayedMatrixStats_1.18.2 rmarkdown_2.18
[151] Rtsne_0.16 spatstat.explore_3.0-5
[153] shiny_1.7.3