# 4.6 实战六 | CEL-seq | 人胰腺细胞

## 1 前言

这次使用的数据是：[Muraro et al. (2016)](https://pubmed.ncbi.nlm.nih.gov/27693023/) 中的不同人类供体的胰腺细胞，和上一次相比使用的是更早期的CEL-seq。整体操作和上次CEL-seq2类似

### **数据准备**

```r
library(scRNAseq)
sce.muraro <- MuraroPancreasData()
sce.muraro
# class: SingleCellExperiment 
# dim: 19059 3072 
# metadata(0):
#   assays(1): counts
# rownames(19059): A1BG-AS1__chr19 A1BG__chr19 ...
# ZZEF1__chr17 ZZZ3__chr1
# rowData names(2): symbol chr
# colnames(3072): D28-1_1 D28-1_2 ... D30-8_95
# D30-8_96
# colData names(3): label donor plate
# reducedDimNames(0):
#   altExpNames(1): ERCC
```

这次有4个供体

```r
table(sce.muraro$donor)
# 
# D28 D29 D30 D31 
# 768 768 768 768
```

不过这个基因命名很奇怪，它全部加上了染色体编号

```r
> head(rownames(sce.muraro))
[1] "A1BG-AS1__chr19" "A1BG__chr19"     "A1CF__chr10"    
[4] "A2M-AS1__chr12"  "A2ML1__chr12"    "A2M__chr12"
```

### **ID转换**

选择的方式是：将没有匹配的NA去掉，并且去掉重复的行

由于基因名很奇怪，所以需要把`__chr`及后面的去掉

```r
library(AnnotationHub)
edb <- AnnotationHub()[["AH73881"]]
gene.symb <- sub("__chr.*$", "", rownames(sce.muraro))
gene.ids <- mapIds(edb, keys=gene.symb, 
    keytype="SYMBOL", column="GENEID")

keep <- !is.na(gene.ids) & !duplicated(gene.ids)
# 过滤掉2000多基因
> table(keep)
keep
FALSE  TRUE 
 2119 16940 

sce.muraro <- sce.muraro[keep,]
rownames(sce.muraro) <- gene.ids[keep]
```

## 2 质控

**依然是备份一下，把unfiltered数据主要用在质控的探索上**

```r
unfiltered <- sce.muraro
```

和上一次一样，如果只是针对ERCC和全部的批次进行质控，结果是

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-094258.png)

很明显，这个D28个捣鬼，钻了我们“大部分细胞都是高质量”的假设漏洞

**因此，在过滤时不能考虑这个D28**

```r
library(scater)
stats <- perCellQCMetrics(sce.muraro)
qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
    batch=sce.muraro$donor, subset=sce.muraro$donor!="D28")
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-094502.png)

**看看过滤掉多少**

```r
colSums(as.matrix(qc))
# low_lib_size            low_n_features high_altexps_ERCC_percent                   discard 
# 663                       700                       738                       773
```

**最后把过滤条件应用在原数据**

```r
sce.muraro <- sce.muraro[,!qc$discard]
```

## 3 归一化

继续使用去卷积方法

```r
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.muraro)
sce.muraro <- computeSumFactors(sce.muraro, clusters=clusters)
sce.muraro <- logNormCounts(sce.muraro)

summary(sizeFactors(sce.muraro))
# Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
# 0.08782  0.54109  0.82081  1.00000  1.21079 13.98692
```

## 4 找表达量高变化基因

再看一眼数据，发现其中有plate和donor信息，它们都是与批次相关的

```r
sce.muraro
# class: SingleCellExperiment 
# dim: 16940 2299 
# metadata(0):
#   assays(2): counts logcounts
# rownames(16940): ENSG00000268895 ENSG00000121410 ...
# ENSG00000159840 ENSG00000074755
# rowData names(2): symbol chr
# colnames(2299): D28-1_1 D28-1_2 ... D30-8_93
# D30-8_94
# colData names(4): label donor plate sizeFactor
# reducedDimNames(0):
#   altExpNames(1): ERCC

table(sce.muraro$donor)
# 
# D28 D29 D30 D31 
# 333 601 676 689 
table(sce.muraro$plate)
# 
# 1   2   3   4   5   6   7   8 
# 281 292 292 295 282 285 283 289
```

因此就把这二者结合作为批次信息，依然是使用针对ERCC的构建模型方法

```r
block <- paste0(sce.muraro$plate, "_", sce.muraro$donor)
dec.muraro <- modelGeneVarWithSpikes(sce.muraro, "ERCC", block=block)
top.muraro <- getTopHVGs(dec.muraro, prop=0.1)
```

## 5 矫正批次效应

```r
library(batchelor)
set.seed(1001010)
merged.muraro <- fastMNN(sce.muraro, subset.row=top.muraro, 
    batch=sce.muraro$donor)

metadata(merged.muraro)$merge.info$lost.var
##           D28      D29      D30     D31
## [1,] 0.060847 0.024121 0.000000 0.00000
## [2,] 0.002646 0.003018 0.062421 0.00000
## [3,] 0.003449 0.002641 0.002598 0.08162
```

## 6 降维+聚类

### **降维**

```r
set.seed(100111)
# 前面矫正过批次效应，所以这里指定参数dimred: specifying the existing dimensionality reduction results to use
merged.muraro <- runTSNE(merged.muraro, dimred="corrected")
```

### **聚类**

```r
snn.gr <- buildSNNGraph(merged.muraro, use.dimred="corrected")
colLabels(merged.muraro) <- factor(igraph::cluster_walktrap(snn.gr)$membership)
```

**如果想看一下这里的分群和之前的批次之间的关系：**

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-095422.png)

**Tip：如果感觉批次或分群数量太多，看着效果不好，可以用热图的形式展示：**

```r
tab <- table(Cluster=colLabels(merged.muraro), CellType=sce.muraro$label)
library(pheatmap)
pheatmap(log10(tab+10), color=viridis::viridis(100))
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-095539.png)

### **最后检查一下供体的批次效应**

```r
gridExtra::grid.arrange(
    plotTSNE(merged.muraro, colour_by="label"),
    plotTSNE(merged.muraro, colour_by="batch"),
    ncol=2
)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-095634.png)
