# 4.5 实战五 | CEL-seq2 | 人胰腺细胞

## 1 前言

第一代CEL-Seq由以色列理工学院研究团队于[2012发表在Cell Reports](https://pubmed.ncbi.nlm.nih.gov/22939981/)

CEL-Seq的全称是：Cell expression by linear amplification and sequencing，采用线性扩增的测序方法，其主要优势在于错误率比较低，但是和PCR一样都存在序列偏向性。另外还具有以下优点：

* 使用barcode允许多种类型的细胞同时分析；
* 每个细胞在一个管中进行处理，避免了样本之间的污染

后来在2016年，CEL-seq2诞生，与早期版本相比具有低成本、高灵敏度，并缩短了实验操作实验时间

### **数据准备**

这里使用的数据是： [Grun et al. (2016)](https://pubmed.ncbi.nlm.nih.gov/27345837/) 中不同人类供体的胰腺细胞

```r
library(scRNAseq)
sce.grun <- GrunPancreasData()

sce.grun
# class: SingleCellExperiment 
# dim: 20064 1728 
# metadata(0):
#   assays(1): counts
# rownames(20064): A1BG-AS1__chr19 A1BG__chr19 ...
# ZZEF1__chr17 ZZZ3__chr1
# rowData names(2): symbol chr
# colnames(1728): D2ex_1 D2ex_2 ... D17TGFB_95
# D17TGFB_96
# colData names(2): donor sample
# reducedDimNames(0):
#   altExpNames(1): ERCC

rowData(sce.grun)
# DataFrame with 20064 rows and 2 columns
# symbol         chr
# <character> <character>
#   A1BG-AS1__chr19    A1BG-AS1       chr19
# A1BG__chr19            A1BG       chr19
# A1CF__chr10            A1CF       chr10
# A2M-AS1__chr12      A2M-AS1       chr12
# A2ML1__chr12          A2ML1       chr12
# ...                     ...         ...
# ZYG11A__chr1         ZYG11A        chr1
# ZYG11B__chr1         ZYG11B        chr1
# ZYX__chr7               ZYX        chr7
# ZZEF1__chr17          ZZEF1       chr17
# ZZZ3__chr1             ZZZ3        chr1
```

### **ID转换**

**先将symbol转为Ensembl ID**

```r
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=rowData(sce.grun)$symbol,
    keytype="SYMBOL", column="ENSEMBL")
```

**接下来有两种处理方式：**

第一种：将Ensembl ID加到原来数据中

```r
library(scater)
rowData(sce.grun)$ensembl <- uniquifyFeatureNames(
  rowData(sce.grun)$symbol, gene.ids)
rowData(sce.grun)
# DataFrame with 20064 rows and 3 columns
# symbol         chr         ensembl
# <character> <character>     <character>
#   A1BG-AS1__chr19    A1BG-AS1       chr19 ENSG00000268895
# A1BG__chr19            A1BG       chr19 ENSG00000121410
# A1CF__chr10            A1CF       chr10 ENSG00000148584
# A2M-AS1__chr12      A2M-AS1       chr12 ENSG00000245105
# A2ML1__chr12          A2ML1       chr12 ENSG00000166535
# ...                     ...         ...             ...
# ZYG11A__chr1         ZYG11A        chr1 ENSG00000203995
# ZYG11B__chr1         ZYG11B        chr1 ENSG00000162378
# ZYX__chr7               ZYX        chr7 ENSG00000159840
# ZZEF1__chr17          ZZEF1       chr17 ENSG00000074755
# ZZZ3__chr1             ZZZ3        chr1 ENSG00000036549
```

第二种：将没有匹配的NA去掉，并且去掉重复的行（因为可能存在一个symbol对应多个Ensembl ID的情况）【这里就试一下这种方法】

```r
keep <- !is.na(gene.ids) & !duplicated(gene.ids)
> sum(is.na(gene.ids));sum(duplicated(gene.ids))
[1] 2487
[1] 2515

sce.grun <- sce.grun[keep,]
rownames(sce.grun) <- gene.ids[keep]
> dim(sce.grun) # 过滤掉2000多基因
[1] 17548  1728
```

## 2 质控

> 前几次实战的质控都很顺利，但并不意味着实际操作中这一步就可以跑跑代码解决 **这一次，就遇到了一个常规代码解决不了的问题，需要思考+调整**

### 数据备份

把unfiltered数据主要用在质控的探索上

```r
unfiltered <- sce.grun
```

看到这里的数据中没有线粒体基因

```r
table(rowData(sce.grun)$chr)
# 
# chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 
# 1766   672  1071   947   328   551   532   740  1050   256 
# chr19  chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7 
# 1245  1142   484   188   400  1014   693   791   907   825 
# chr8  chr9  chrX  chrY 
# 604   664   658    20
```

倒是有几个donor的批次信息

```r
table(colData(sce.grun)$donor)
# 
# D10 D17  D2  D3  D7 
# 288 480  96 480 384
```

我们之前进行质控的时候，一般采用的先使用`perCellQCMetrics` 指定线粒体**计算**，然后用 `quickPerCellQC` 指定ERCC+线粒体进行**过滤**的策略。现在既然没有线粒体的信息，那么在第一步的`perCellQCMetrics`就不需要加`subset`参数了

```r
stats <- perCellQCMetrics(sce.grun)
# 参数subsets的含义: used to identify interesting feature subsets such as ERCC spike-in transcripts or mitochondrial genes.
# > colnames(stats)
# [1] "sum"                   "detected"             
# [3] "percent_top_50"        "percent_top_100"      
# [5] "percent_top_200"       "percent_top_500"      
# [7] "altexps_ERCC_sum"      "altexps_ERCC_detected"
# [9] "altexps_ERCC_percent"  "total"
```

### **然后进行过滤，并让函数注意批次信息**

```r
qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
                     batch=sce.grun$donor)
colSums(as.matrix(qc), na.rm=TRUE)
# low_lib_size            low_n_features high_altexps_ERCC_percent 
# 85                        93                        88 
# discard 
# 129
```

### **作图**

```r
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- qc$discard

gridExtra::grid.arrange(
  plotColData(unfiltered, x="donor", y="sum", colour_by="discard") +
    scale_y_log10() + ggtitle("Total count"),
  plotColData(unfiltered, x="donor", y="detected", colour_by="discard") +
    scale_y_log10() + ggtitle("Detected features"),
  plotColData(unfiltered, x="donor", y="altexps_ERCC_percent",
              colour_by="discard") + ggtitle("ERCC percent"),
  ncol=3
)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-082948.png)

### **【重点】发现了问题**

> 质控并非是一帆风顺的，遇到问题应该知道如何探索解决

看到ERCC那个图，很明显D10和D3没有被正确过滤，它们中高ERCC的细胞还是没有被去掉

质控过滤有一个前提条件：大部分的细胞都是高质量的，但显然这里的D10、D3不符合这个要求，因此我们需要再次只根据D17、D2、D7这三个“好一点的样本”进行过滤

```r
# 重新计算过滤条件
qc2 <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
    batch=sce.grun$donor,
    subset=sce.grun$donor %in% c("D17", "D7", "D2"))

colSums(as.matrix(qc2), na.rm=TRUE)
# low_lib_size            low_n_features high_altexps_ERCC_percent                   discard 
# 450                       511                       606                       665
```

这次再作图看看

```r
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- qc2$discard

gridExtra::grid.arrange(
  plotColData(unfiltered, x="donor", y="sum", colour_by="discard") +
    scale_y_log10() + ggtitle("Total count"),
  plotColData(unfiltered, x="donor", y="detected", colour_by="discard") +
    scale_y_log10() + ggtitle("Detected features"),
  plotColData(unfiltered, x="donor", y="altexps_ERCC_percent",
              colour_by="discard") + ggtitle("ERCC percent"),
  ncol=3
)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-091059.png)

看到这次对D10、D3的过滤力度大了很多，基本满意

### **最后把过滤条件应用在原数据**

```r
sce.grun <- sce.grun[,!qc$discard]
```

## 3 归一化

也是使用预分群+去卷积计算size factor的方法

```r
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.grun)
sce.grun <- computeSumFactors(sce.grun, cluster=clusters)
sce.grun <- logNormCounts(sce.grun)

summary(sizeFactors(sce.grun))
# Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
# 0.09581  0.50459  0.79505  1.00000  1.22212 12.16491
```

**看看两种归一化方法的差异**

```r
plot(librarySizeFactors(sce.grun), sizeFactors(sce.grun), pch=16,
     col=as.integer(factor(sce.grun$donor)),
     xlab="Library size factors", ylab="Deconvolution factors", log="xy")
abline(a=0, b=1, col="red")
```

也是再一次证明了去卷积方法比常规的方法更加精细，对批次层面也会纳入考量

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-091250.png)

## 4 找表达量高变化基因

既然有批次的信息，那就加到构建模型中去，而且有ERCC，就选用`modelGeneVarWithSpikes`这个方法

```r
block <- paste0(sce.grun$sample, "_", sce.grun$donor)
dec.grun <- modelGeneVarWithSpikes(sce.grun, spikes="ERCC", block=block)
top.grun <- getTopHVGs(dec.grun, prop=0.1)
```

## 5 矫正批次效应

> 使用FastMNN方法：Correct for batch effects in single-cell expression data using a fast version of the mutual nearest neighbors (MNN) method.

```r
library(batchelor)
set.seed(1001010)
merged.grun <- fastMNN(sce.grun, subset.row=top.grun, batch=sce.grun$donor)
merged.grun
# class: SingleCellExperiment 
# dim: 1467 1063 
# metadata(2): merge.info pca.info
# assays(1): reconstructed
# rownames(1467): ENSG00000172023 ENSG00000172016 ...
# ENSG00000144867 ENSG00000179820
# rowData names(1): rotation
# colnames(1063): D2ex_1 D2ex_2 ... D17TGFB_94
# D17TGFB_95
# colData names(1): batch
# reducedDimNames(2): corrected
# altExpNames(0):
```

检查一下结果，使用`lost.var` ，值越大表示丢失的真实生物异质性越多

```r
metadata(merged.grun)$merge.info$lost.var
# D10         D17          D2         D3         D7
# [1,] 0.030843209 0.030956515 0.000000000 0.00000000 0.00000000
# [2,] 0.007129994 0.011275730 0.036836491 0.00000000 0.00000000
# [3,] 0.003528589 0.005027722 0.006846244 0.05020422 0.00000000
# [4,] 0.012070814 0.014673302 0.013972521 0.01267586 0.05281354
```

* It contains a matrix of the **variance lost in each batch** (column) at each merge step (row).
* Large proportions of lost variance **(>10%)** suggest that correction is **removing genuine biological heterogeneity.**&#x20;

## 6 降维 + 聚类

### **降维**

```r
set.seed(100111)
merged.grun <- runTSNE(merged.grun, dimred="corrected")
```

### **聚类**

```r
snn.gr <- buildSNNGraph(merged.grun, use.dimred="corrected")
colLabels(merged.grun) <- factor(igraph::cluster_walktrap(snn.gr)$membership)

# 聚类的分群与批次之间的对比
table(Cluster=colLabels(merged.grun), Donor=merged.grun$batch)
##        Donor
## Cluster D10 D17  D2  D3  D7
##      1   32  71  33  80  29
##      2    5   7   3   4   4
##      3    3  44   0   0  13
##      4   11 119   0   0  55
##      5   12  73  30   3  78
##      6   14  29   3   2  64
##      7    1   9   0   0   7
##      8    2   5   2   3   4
##      9    5  13   0   0  10
##      10  15  33  11  10  37
##      11   5  18   0   1  33
##      12   4  13   0   0   1
```

**最后作图看看批次效应矫正如何**

```r
gridExtra::grid.arrange(
    plotTSNE(merged.grun, colour_by="label"),
    plotTSNE(merged.grun, colour_by="batch"),
    ncol=2
)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-092141.png)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://jieandze1314.osca.top/04/04-5.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
