1 前言
这是也是来自多个供体的人类胰腺细胞,使用Smart-seq2建库技术,数据来自Segerstolpe et al. (2016)
数据准备
复制 library (scRNAseq)
sce.seger <- SegerstolpePancreasData()
sce.seger
# class: SingleCellExperiment
# dim: 26179 3514
# metadata(0):
# assays(1): counts
# rownames(26179): SGIP1 AZIN2 ... BIVM-ERCC5 eGFP
# rowData names(2): symbol refseq
# colnames(3514): HP1502401_N13 HP1502401_D14 ...
# HP1526901T2D_O11 HP1526901T2D_A8
# colData names(8): Source Name individual ... age
# body mass index
# reducedDimNames(0):
# altExpNames(1): ERCC
看到3500多个细胞,包含ERCC,使用Symbol ID
看下样本信息:
ID转换
选择的方式是:将没有匹配的NA去掉,并且去掉重复的行
复制 # 首先得到symbol ID和对应的Ensembl ID(其中会存在无对应的NA情况)
library (AnnotationHub)
edb <- AnnotationHub() [[ "AH73881" ]]
symbols <- rowData( sce.seger ) $ symbol
ens.id <- mapIds( edb, keys = symbols, keytype = "SYMBOL" , column = "GENEID" )
# 之前见到的方法是:
# keep <- !is.na(gene.ids) & !duplicated(gene.ids)
# 这里使用了另一种方法(不是直接将NA去掉,而且替换成了symbol)
ens.id <- ifelse ( is.na (ens.id), symbols, ens.id)
keep <- ! duplicated (ens.id)
sce.seger <- sce.seger[keep,]
rownames (sce.seger) <- ens.id[keep]
小结一下:至此见到了三种ID转换的方式,根据最后保留的基因数量,可以排个序:
保留基因最多(保留了NA和重复):uniquifyFeatureNames
中等(保留了NA,去掉重复):ifelse(is.na(ens.id), symbols, ens.id)
最少(去掉了NA以及重复):!is.na(gene.ids) & !duplicated(gene.ids)
编辑样本信息
之前有8列样本的信息,有点冗余了。这里只保留3列关心的,并重新命名
复制 emtab.meta <- colData( sce.seger ) [, c ( "cell type" ,
"individual" , "single cell well quality" )]
colnames (emtab.meta) <- c ( "CellType" , "Donor" , "Quality" )
colData( sce.seger ) <- emtab.meta
另外把细胞类型这一列中的“cell”字符去掉,并把首字母大写
复制 sce.seger $ CellType <- gsub ( " cell" , "" , sce.seger $ CellType)
sce.seger $ CellType <- paste0 (
toupper ( substr (sce.seger $ CellType, 1 , 1 )),
substring (sce.seger $ CellType, 2 ))
2 质控
依然是备份一下,把unfiltered数据主要用在质控的探索上
复制 unfiltered <- sce.seger
之前作者在数据中已经标注了细胞质量,可以看到有问题的细胞还是很多的:
复制 table (sce.seger $ Quality)
#
# control, 2-cell well control, empty well low quality cell OK
# 32 96 1177 2209
因此就要注意了,这里的数据会不会满足“大部分细胞都是高质量的”这个假设?
还是需要试一下,看看结果先
复制 library (scater)
stats <- perCellQCMetrics( sce.seger )
qc1 <- quickPerCellQC( stats, percent_subsets = "altexps_ERCC_percent" ,
batch = sce.seger $ Donor )
colData( unfiltered ) <- cbind ( colData( unfiltered ) , stats)
unfiltered $ discard <- qc1 $ discard
gridExtra :: grid.arrange(
plotColData( unfiltered, x = "Donor" , y = "sum" , colour_by = "discard" ) +
scale_y_log10() + ggtitle( "Total count" ) +
theme( axis.text.x = element_text( angle = 90 )) ,
plotColData( unfiltered, x = "Donor" , y = "detected" , colour_by = "discard" ) +
scale_y_log10() + ggtitle( "Detected features" ) +
theme( axis.text.x = element_text( angle = 90 )) ,
plotColData( unfiltered, x = "Donor" , y = "altexps_ERCC_percent" ,
colour_by = "discard" ) + ggtitle( "ERCC percent" ) +
theme( axis.text.x = element_text( angle = 90 )) ,
ncol = 3
)
看到HP1509101在过滤时存在过滤不完全的情况,HP1504901过滤的ERCC数量太多,推测这两个批次效果可能并不是很好,可能存在大量的低质量细胞
因此,再次指定subset
参数,重新画图
复制 library (scater)
stats <- perCellQCMetrics( sce.seger )
qc <- quickPerCellQC( stats, percent_subsets = "altexps_ERCC_percent" ,
batch = sce.seger $ Donor,
subset =! sce.seger $ Donor %in% c ( "HP1504901" , "HP1509101" ) )
看看过滤掉多少
复制 colSums ( as.matrix (qc))
## low_lib_size low_n_features high_altexps_ERCC_percent
## 788 1056 1031
## discard
## 1246
最后将qc过滤的与本来标注低质量的一同过滤
复制 low.qual <- sce.seger $ Quality == "low quality cell"
sce.seger <- sce.seger[, ! (qc $ discard | low.qual)]
# 过滤了大概1500个细胞
> dim (unfiltered); dim (sce.seger)
[ 1 ] 26179 3514
[ 1 ] 26179 2090
3 归一化
此处会有一点小问题,值得注意!
本来有ERCC,操作应该是:
复制 library (scran)
sce.seger = computeSpikeFactors( sce.seger, "ERCC" )
sce.seger <- logNormCounts( sce.seger )
# Error in .local(x, ...) : size factors should be positive
但由于存在几个细胞中一个ERCC都没有,所以会报错
此时面临两个选择:要么把这几个细胞去掉;要么就不借助ERCC,用另一种去卷积方法
复制 > table ( colSums ( counts(altExp( sce.seger )) ) == 0 )
FALSE TRUE
2087 3
如果要去掉这几个细胞:
复制 test = sce.seger[, ! colSums ( counts(altExp( sce.seger )) ) == 0 ]
sce.test = computeSpikeFactors( test, "ERCC" )
sce.test <- logNormCounts( test )
我们这里选择保守的方法,不去掉细胞,使用另一种去卷积方法:
复制 clusters <- quickCluster( sce.seger )
sce.seger <- computeSumFactors( sce.seger, clusters = clusters )
sce.seger <- logNormCounts( sce.seger )
summary ( sizeFactors( sce.seger ) )
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.0000 0.1832 0.4016 1.0000 1.0996 12.9607
4 找表达量高变化基因
下面构建模型想使用modelGeneVarWithSpikes
,于是首先应该把那几个没有ERCC的细胞去掉;另外由于AZ这个批次相对其他批次的细胞数量过于少,因此在模型构建中也把它去掉吧
复制 for.hvg <- sce.seger[, librarySizeFactors(altExp( sce.seger )) > 0
& sce.seger $ Donor != "AZ" ]
dec.seger <- modelGeneVarWithSpikes( for.hvg, "ERCC" , block = for.hvg $ Donor )
chosen.hvgs <- getTopHVGs( dec.seger, n = 2000 )
如果要批量作图检查的话
复制 # 批次数量较多,因此设置多行多列显示
par (mfrow = c ( 3 , 3 ))
blocked.stats <- dec.seger $ per.block
for (i in colnames (blocked.stats)) {
current <- blocked.stats[[i]]
plot (current $ mean, current $ total, main = i, pch = 16 , cex = 0.5 ,
xlab = "Mean of log-expression" , ylab = "Variance of log-expression" )
curfit <- metadata( current )
points (curfit $ mean, curfit $ var, col = "red" , pch = 16 )
curve (curfit $ trend( x ) , col = 'dodgerblue' , add = TRUE , lwd = 2 )
}
注意,这里在找完HVGs后,没有进行批次矫正,如果继续向下做,会发现什么?
5 降维聚类
降维
复制 library (BiocSingular)
set.seed ( 101011001 )
sce.seger <- runPCA( sce.seger, subset_row = chosen.hvgs, ncomponents = 25 )
sce.seger <- runTSNE( sce.seger, dimred = "PCA" )
聚类
复制 snn.gr <- buildSNNGraph( sce.seger, use.dimred = "PCA" )
colLabels( sce.seger ) <- factor (igraph :: cluster_walktrap( snn.gr ) $ membership)
检查聚类分群与批次
复制 tab <- table (Cluster = colLabels( sce.seger ) , Donor = sce.seger $ Donor)
library (pheatmap)
pheatmap(log10 (tab + 10 ) , color = viridis :: viridis( 100 ) )
结果真的是:批次效应影响了分群,因此最好还是做一遍fastMNN
操作
tSNE图中也是显示出了强烈的批次效应
复制 gridExtra :: grid.arrange(
plotTSNE( sce.seger, colour_by = "label" ) ,
plotTSNE( sce.seger, colour_by = "Donor" ) ,
ncol = 2
)
6 补充矫正批次效应
上图看到很明显的批次效应,那么如果处理后,会有什么不同吗?
利用fastMNN矫正
复制 library (batchelor)
set.seed ( 1001010 )
merged.seger <- fastMNN( sce.seger, subset.row = chosen.hvgs,
batch = sce.seger $ Donor )
merged.seger
# class: SingleCellExperiment
# dim: 2000 2090
# metadata(2): merge.info pca.info
# assays(1): reconstructed
# rownames(2000): GCG TTR ... MAP6 LCP1
# rowData names(1): rotation
# colnames(2090): HP1502401_H13 HP1502401_J14 ...
# HP1526901T2D_N8 HP1526901T2D_A8
# colData names(2): batch label
# reducedDimNames(2): corrected TSNE
# altExpNames(0):
# metadata(merged.seger)$merge.info$lost.var
# lost.var :值越大表示丢失的真实生物异质性越多
因为fastMNN会包含PCA降维,所以下面继续进行tSNE即可
降维聚类
复制 library (BiocSingular)
set.seed ( 101011001 )
merged.seger <- runTSNE( merged.seger, dimred = "corrected" )
snn.gr <- buildSNNGraph( merged.seger, use.dimred = "corrected" )
colLabels( merged.seger ) <- factor (igraph :: cluster_walktrap( snn.gr ) $ membership)
再次作图,是不是明显比之前好很多?
复制 gridExtra :: grid.arrange(
plotTSNE( merged.seger, colour_by = "label" ) ,
plotTSNE( merged.seger, colour_by = "batch" ) ,
ncol = 2
)