4.5 实战五 | CEL-seq2 | 人胰腺细胞
刘小泽写于2020.7.20

1 前言

第一代CEL-Seq由以色列理工学院研究团队于2012发表在Cell Reports
CEL-Seq的全称是:Cell expression by linear amplification and sequencing,采用线性扩增的测序方法,其主要优势在于错误率比较低,但是和PCR一样都存在序列偏向性。另外还具有以下优点:
    使用barcode允许多种类型的细胞同时分析;
    每个细胞在一个管中进行处理,避免了样本之间的污染
后来在2016年,CEL-seq2诞生,与早期版本相比具有低成本、高灵敏度,并缩短了实验操作实验时间

数据准备

这里使用的数据是: Grun et al. (2016) 中不同人类供体的胰腺细胞
1
library(scRNAseq)
2
sce.grun <- GrunPancreasData()
3
4
sce.grun
5
# class: SingleCellExperiment
6
# dim: 20064 1728
7
# metadata(0):
8
# assays(1): counts
9
# rownames(20064): A1BG-AS1__chr19 A1BG__chr19 ...
10
# ZZEF1__chr17 ZZZ3__chr1
11
# rowData names(2): symbol chr
12
# colnames(1728): D2ex_1 D2ex_2 ... D17TGFB_95
13
# D17TGFB_96
14
# colData names(2): donor sample
15
# reducedDimNames(0):
16
# altExpNames(1): ERCC
17
18
rowData(sce.grun)
19
# DataFrame with 20064 rows and 2 columns
20
# symbol chr
21
# <character> <character>
22
# A1BG-AS1__chr19 A1BG-AS1 chr19
23
# A1BG__chr19 A1BG chr19
24
# A1CF__chr10 A1CF chr10
25
# A2M-AS1__chr12 A2M-AS1 chr12
26
# A2ML1__chr12 A2ML1 chr12
27
# ... ... ...
28
# ZYG11A__chr1 ZYG11A chr1
29
# ZYG11B__chr1 ZYG11B chr1
30
# ZYX__chr7 ZYX chr7
31
# ZZEF1__chr17 ZZEF1 chr17
32
# ZZZ3__chr1 ZZZ3 chr1
Copied!

ID转换

先将symbol转为Ensembl ID
1
library(org.Hs.eg.db)
2
gene.ids <- mapIds(org.Hs.eg.db, keys=rowData(sce.grun)$symbol,
3
keytype="SYMBOL", column="ENSEMBL")
Copied!
接下来有两种处理方式:
第一种:将Ensembl ID加到原来数据中
1
library(scater)
2
rowData(sce.grun)$ensembl <- uniquifyFeatureNames(
3
rowData(sce.grun)$symbol, gene.ids)
4
rowData(sce.grun)
5
# DataFrame with 20064 rows and 3 columns
6
# symbol chr ensembl
7
# <character> <character> <character>
8
# A1BG-AS1__chr19 A1BG-AS1 chr19 ENSG00000268895
9
# A1BG__chr19 A1BG chr19 ENSG00000121410
10
# A1CF__chr10 A1CF chr10 ENSG00000148584
11
# A2M-AS1__chr12 A2M-AS1 chr12 ENSG00000245105
12
# A2ML1__chr12 A2ML1 chr12 ENSG00000166535
13
# ... ... ... ...
14
# ZYG11A__chr1 ZYG11A chr1 ENSG00000203995
15
# ZYG11B__chr1 ZYG11B chr1 ENSG00000162378
16
# ZYX__chr7 ZYX chr7 ENSG00000159840
17
# ZZEF1__chr17 ZZEF1 chr17 ENSG00000074755
18
# ZZZ3__chr1 ZZZ3 chr1 ENSG00000036549
Copied!
第二种:将没有匹配的NA去掉,并且去掉重复的行(因为可能存在一个symbol对应多个Ensembl ID的情况)【这里就试一下这种方法】
1
keep <- !is.na(gene.ids) & !duplicated(gene.ids)
2
> sum(is.na(gene.ids));sum(duplicated(gene.ids))
3
[1] 2487
4
[1] 2515
5
6
sce.grun <- sce.grun[keep,]
7
rownames(sce.grun) <- gene.ids[keep]
8
> dim(sce.grun) # 过滤掉2000多基因
9
[1] 17548 1728
Copied!

2 质控

前几次实战的质控都很顺利,但并不意味着实际操作中这一步就可以跑跑代码解决 这一次,就遇到了一个常规代码解决不了的问题,需要思考+调整

数据备份

把unfiltered数据主要用在质控的探索上
1
unfiltered <- sce.grun
Copied!
看到这里的数据中没有线粒体基因
1
table(rowData(sce.grun)$chr)
2
#
3
# chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18
4
# 1766 672 1071 947 328 551 532 740 1050 256
5
# chr19 chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7
6
# 1245 1142 484 188 400 1014 693 791 907 825
7
# chr8 chr9 chrX chrY
8
# 604 664 658 20
Copied!
倒是有几个donor的批次信息
1
table(colData(sce.grun)$donor)
2
#
3
# D10 D17 D2 D3 D7
4
# 288 480 96 480 384
Copied!
我们之前进行质控的时候,一般采用的先使用perCellQCMetrics 指定线粒体计算,然后用 quickPerCellQC 指定ERCC+线粒体进行过滤的策略。现在既然没有线粒体的信息,那么在第一步的perCellQCMetrics就不需要加subset参数了
1
stats <- perCellQCMetrics(sce.grun)
2
# 参数subsets的含义: used to identify interesting feature subsets such as ERCC spike-in transcripts or mitochondrial genes.
3
# > colnames(stats)
4
# [1] "sum" "detected"
5
# [3] "percent_top_50" "percent_top_100"
6
# [5] "percent_top_200" "percent_top_500"
7
# [7] "altexps_ERCC_sum" "altexps_ERCC_detected"
8
# [9] "altexps_ERCC_percent" "total"
Copied!

然后进行过滤,并让函数注意批次信息

1
qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
2
batch=sce.grun$donor)
3
colSums(as.matrix(qc), na.rm=TRUE)
4
# low_lib_size low_n_features high_altexps_ERCC_percent
5
# 85 93 88
6
# discard
7
# 129
Copied!

作图

1
colData(unfiltered) <- cbind(colData(unfiltered), stats)
2
unfiltered$discard <- qc$discard
3
4
gridExtra::grid.arrange(
5
plotColData(unfiltered, x="donor", y="sum", colour_by="discard") +
6
scale_y_log10() + ggtitle("Total count"),
7
plotColData(unfiltered, x="donor", y="detected", colour_by="discard") +
8
scale_y_log10() + ggtitle("Detected features"),
9
plotColData(unfiltered, x="donor", y="altexps_ERCC_percent",
10
colour_by="discard") + ggtitle("ERCC percent"),
11
ncol=3
12
)
Copied!

【重点】发现了问题

质控并非是一帆风顺的,遇到问题应该知道如何探索解决
看到ERCC那个图,很明显D10和D3没有被正确过滤,它们中高ERCC的细胞还是没有被去掉
质控过滤有一个前提条件:大部分的细胞都是高质量的,但显然这里的D10、D3不符合这个要求,因此我们需要再次只根据D17、D2、D7这三个“好一点的样本”进行过滤
1
# 重新计算过滤条件
2
qc2 <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
3
batch=sce.grun$donor,
4
subset=sce.grun$donor %in% c("D17", "D7", "D2"))
5
6
colSums(as.matrix(qc2), na.rm=TRUE)
7
# low_lib_size low_n_features high_altexps_ERCC_percent discard
8
# 450 511 606 665
Copied!
这次再作图看看
1
colData(unfiltered) <- cbind(colData(unfiltered), stats)
2
unfiltered$discard <- qc2$discard
3
4
gridExtra::grid.arrange(
5
plotColData(unfiltered, x="donor", y="sum", colour_by="discard") +
6
scale_y_log10() + ggtitle("Total count"),
7
plotColData(unfiltered, x="donor", y="detected", colour_by="discard") +
8
scale_y_log10() + ggtitle("Detected features"),
9
plotColData(unfiltered, x="donor", y="altexps_ERCC_percent",
10
colour_by="discard") + ggtitle("ERCC percent"),
11
ncol=3
12
)
Copied!
看到这次对D10、D3的过滤力度大了很多,基本满意

最后把过滤条件应用在原数据

1
sce.grun <- sce.grun[,!qc$discard]
Copied!

3 归一化

也是使用预分群+去卷积计算size factor的方法
1
library(scran)
2
set.seed(1000)
3
clusters <- quickCluster(sce.grun)
4
sce.grun <- computeSumFactors(sce.grun, cluster=clusters)
5
sce.grun <- logNormCounts(sce.grun)
6
7
summary(sizeFactors(sce.grun))
8
# Min. 1st Qu. Median Mean 3rd Qu. Max.
9
# 0.09581 0.50459 0.79505 1.00000 1.22212 12.16491
Copied!
看看两种归一化方法的差异
1
plot(librarySizeFactors(sce.grun), sizeFactors(sce.grun), pch=16,
2
col=as.integer(factor(sce.grun$donor)),
3
xlab="Library size factors", ylab="Deconvolution factors", log="xy")
4
abline(a=0, b=1, col="red")
Copied!
也是再一次证明了去卷积方法比常规的方法更加精细,对批次层面也会纳入考量

4 找表达量高变化基因

既然有批次的信息,那就加到构建模型中去,而且有ERCC,就选用modelGeneVarWithSpikes这个方法
1
block <- paste0(sce.grun$sample, "_", sce.grun$donor)
2
dec.grun <- modelGeneVarWithSpikes(sce.grun, spikes="ERCC", block=block)
3
top.grun <- getTopHVGs(dec.grun, prop=0.1)
Copied!

5 矫正批次效应

使用FastMNN方法:Correct for batch effects in single-cell expression data using a fast version of the mutual nearest neighbors (MNN) method.
1
library(batchelor)
2
set.seed(1001010)
3
merged.grun <- fastMNN(sce.grun, subset.row=top.grun, batch=sce.grun$donor)
4
merged.grun
5
# class: SingleCellExperiment
6
# dim: 1467 1063
7
# metadata(2): merge.info pca.info
8
# assays(1): reconstructed
9
# rownames(1467): ENSG00000172023 ENSG00000172016 ...
10
# ENSG00000144867 ENSG00000179820
11
# rowData names(1): rotation
12
# colnames(1063): D2ex_1 D2ex_2 ... D17TGFB_94
13
# D17TGFB_95
14
# colData names(1): batch
15
# reducedDimNames(2): corrected
16
# altExpNames(0):
Copied!
检查一下结果,使用lost.var ,值越大表示丢失的真实生物异质性越多
1
metadata(merged.grun)$merge.info$lost.var
2
# D10 D17 D2 D3 D7
3
# [1,] 0.030843209 0.030956515 0.000000000 0.00000000 0.00000000
4
# [2,] 0.007129994 0.011275730 0.036836491 0.00000000 0.00000000
5
# [3,] 0.003528589 0.005027722 0.006846244 0.05020422 0.00000000
6
# [4,] 0.012070814 0.014673302 0.013972521 0.01267586 0.05281354
Copied!
    It contains a matrix of the variance lost in each batch (column) at each merge step (row).
    Large proportions of lost variance (>10%) suggest that correction is removing genuine biological heterogeneity.

6 降维 + 聚类

降维

1
set.seed(100111)
2
merged.grun <- runTSNE(merged.grun, dimred="corrected")
Copied!

聚类

1
snn.gr <- buildSNNGraph(merged.grun, use.dimred="corrected")
2
colLabels(merged.grun) <- factor(igraph::cluster_walktrap(snn.gr)$membership)
3
4
# 聚类的分群与批次之间的对比
5
table(Cluster=colLabels(merged.grun), Donor=merged.grun$batch)
6
## Donor
7
## Cluster D10 D17 D2 D3 D7
8
## 1 32 71 33 80 29
9
## 2 5 7 3 4 4
10
## 3 3 44 0 0 13
11
## 4 11 119 0 0 55
12
## 5 12 73 30 3 78
13
## 6 14 29 3 2 64
14
## 7 1 9 0 0 7
15
## 8 2 5 2 3 4
16
## 9 5 13 0 0 10
17
## 10 15 33 11 10 37
18
## 11 5 18 0 1 33
19
## 12 4 13 0 0 1
Copied!
最后作图看看批次效应矫正如何
1
gridExtra::grid.arrange(
2
plotTSNE(merged.grun, colour_by="label"),
3
plotTSNE(merged.grun, colour_by="batch"),
4
ncol=2
5
)
Copied!
最近更新 1yr ago