# 4.6 实战六 | CEL-seq | 人胰腺细胞

## 1 前言

这次使用的数据是：[Muraro et al. (2016)](https://pubmed.ncbi.nlm.nih.gov/27693023/) 中的不同人类供体的胰腺细胞，和上一次相比使用的是更早期的CEL-seq。整体操作和上次CEL-seq2类似

### **数据准备**

```r
library(scRNAseq)
sce.muraro <- MuraroPancreasData()
sce.muraro
# class: SingleCellExperiment 
# dim: 19059 3072 
# metadata(0):
#   assays(1): counts
# rownames(19059): A1BG-AS1__chr19 A1BG__chr19 ...
# ZZEF1__chr17 ZZZ3__chr1
# rowData names(2): symbol chr
# colnames(3072): D28-1_1 D28-1_2 ... D30-8_95
# D30-8_96
# colData names(3): label donor plate
# reducedDimNames(0):
#   altExpNames(1): ERCC
```

这次有4个供体

```r
table(sce.muraro$donor)
# 
# D28 D29 D30 D31 
# 768 768 768 768
```

不过这个基因命名很奇怪，它全部加上了染色体编号

```r
> head(rownames(sce.muraro))
[1] "A1BG-AS1__chr19" "A1BG__chr19"     "A1CF__chr10"    
[4] "A2M-AS1__chr12"  "A2ML1__chr12"    "A2M__chr12"
```

### **ID转换**

选择的方式是：将没有匹配的NA去掉，并且去掉重复的行

由于基因名很奇怪，所以需要把`__chr`及后面的去掉

```r
library(AnnotationHub)
edb <- AnnotationHub()[["AH73881"]]
gene.symb <- sub("__chr.*$", "", rownames(sce.muraro))
gene.ids <- mapIds(edb, keys=gene.symb, 
    keytype="SYMBOL", column="GENEID")

keep <- !is.na(gene.ids) & !duplicated(gene.ids)
# 过滤掉2000多基因
> table(keep)
keep
FALSE  TRUE 
 2119 16940 

sce.muraro <- sce.muraro[keep,]
rownames(sce.muraro) <- gene.ids[keep]
```

## 2 质控

**依然是备份一下，把unfiltered数据主要用在质控的探索上**

```r
unfiltered <- sce.muraro
```

和上一次一样，如果只是针对ERCC和全部的批次进行质控，结果是

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-094258.png)

很明显，这个D28个捣鬼，钻了我们“大部分细胞都是高质量”的假设漏洞

**因此，在过滤时不能考虑这个D28**

```r
library(scater)
stats <- perCellQCMetrics(sce.muraro)
qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
    batch=sce.muraro$donor, subset=sce.muraro$donor!="D28")
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-094502.png)

**看看过滤掉多少**

```r
colSums(as.matrix(qc))
# low_lib_size            low_n_features high_altexps_ERCC_percent                   discard 
# 663                       700                       738                       773
```

**最后把过滤条件应用在原数据**

```r
sce.muraro <- sce.muraro[,!qc$discard]
```

## 3 归一化

继续使用去卷积方法

```r
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.muraro)
sce.muraro <- computeSumFactors(sce.muraro, clusters=clusters)
sce.muraro <- logNormCounts(sce.muraro)

summary(sizeFactors(sce.muraro))
# Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
# 0.08782  0.54109  0.82081  1.00000  1.21079 13.98692
```

## 4 找表达量高变化基因

再看一眼数据，发现其中有plate和donor信息，它们都是与批次相关的

```r
sce.muraro
# class: SingleCellExperiment 
# dim: 16940 2299 
# metadata(0):
#   assays(2): counts logcounts
# rownames(16940): ENSG00000268895 ENSG00000121410 ...
# ENSG00000159840 ENSG00000074755
# rowData names(2): symbol chr
# colnames(2299): D28-1_1 D28-1_2 ... D30-8_93
# D30-8_94
# colData names(4): label donor plate sizeFactor
# reducedDimNames(0):
#   altExpNames(1): ERCC

table(sce.muraro$donor)
# 
# D28 D29 D30 D31 
# 333 601 676 689 
table(sce.muraro$plate)
# 
# 1   2   3   4   5   6   7   8 
# 281 292 292 295 282 285 283 289
```

因此就把这二者结合作为批次信息，依然是使用针对ERCC的构建模型方法

```r
block <- paste0(sce.muraro$plate, "_", sce.muraro$donor)
dec.muraro <- modelGeneVarWithSpikes(sce.muraro, "ERCC", block=block)
top.muraro <- getTopHVGs(dec.muraro, prop=0.1)
```

## 5 矫正批次效应

```r
library(batchelor)
set.seed(1001010)
merged.muraro <- fastMNN(sce.muraro, subset.row=top.muraro, 
    batch=sce.muraro$donor)

metadata(merged.muraro)$merge.info$lost.var
##           D28      D29      D30     D31
## [1,] 0.060847 0.024121 0.000000 0.00000
## [2,] 0.002646 0.003018 0.062421 0.00000
## [3,] 0.003449 0.002641 0.002598 0.08162
```

## 6 降维+聚类

### **降维**

```r
set.seed(100111)
# 前面矫正过批次效应，所以这里指定参数dimred: specifying the existing dimensionality reduction results to use
merged.muraro <- runTSNE(merged.muraro, dimred="corrected")
```

### **聚类**

```r
snn.gr <- buildSNNGraph(merged.muraro, use.dimred="corrected")
colLabels(merged.muraro) <- factor(igraph::cluster_walktrap(snn.gr)$membership)
```

**如果想看一下这里的分群和之前的批次之间的关系：**

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-095422.png)

**Tip：如果感觉批次或分群数量太多，看着效果不好，可以用热图的形式展示：**

```r
tab <- table(Cluster=colLabels(merged.muraro), CellType=sce.muraro$label)
library(pheatmap)
pheatmap(log10(tab+10), color=viridis::viridis(100))
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-095539.png)

### **最后检查一下供体的批次效应**

```r
gridExtra::grid.arrange(
    plotTSNE(merged.muraro, colour_by="label"),
    plotTSNE(merged.muraro, colour_by="batch"),
    ncol=2
)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-095634.png)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://jieandze1314.osca.top/04/04-6.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
