# 4.8 实战八 | Smart-seq2  | 人胰腺细胞

## 1 前言

这是也是来自多个供体的人类胰腺细胞，使用Smart-seq2建库技术，数据来自[Segerstolpe et al. (2016)](https://pubmed.ncbi.nlm.nih.gov/27864352/)

### **数据准备**

```r
library(scRNAseq)
sce.seger <- SegerstolpePancreasData()
sce.seger
# class: SingleCellExperiment 
# dim: 26179 3514 
# metadata(0):
#   assays(1): counts
# rownames(26179): SGIP1 AZIN2 ... BIVM-ERCC5 eGFP
# rowData names(2): symbol refseq
# colnames(3514): HP1502401_N13 HP1502401_D14 ...
# HP1526901T2D_O11 HP1526901T2D_A8
# colData names(8): Source Name individual ... age
# body mass index
# reducedDimNames(0):
#   altExpNames(1): ERCC
```

看到3500多个细胞，包含ERCC，使用Symbol ID

看下样本信息：

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-114942.png)

### **ID转换**

选择的方式是：将没有匹配的NA去掉，并且去掉重复的行

```r
# 首先得到symbol ID和对应的Ensembl ID（其中会存在无对应的NA情况）
library(AnnotationHub)
edb <- AnnotationHub()[["AH73881"]]
symbols <- rowData(sce.seger)$symbol
ens.id <- mapIds(edb, keys=symbols, keytype="SYMBOL", column="GENEID")

# 之前见到的方法是：
# keep <- !is.na(gene.ids) & !duplicated(gene.ids)

# 这里使用了另一种方法（不是直接将NA去掉，而且替换成了symbol）
ens.id <- ifelse(is.na(ens.id), symbols, ens.id)
keep <- !duplicated(ens.id)

sce.seger <- sce.seger[keep,]
rownames(sce.seger) <- ens.id[keep]
```

> **小结一下：至此见到了三种ID转换的方式，根据最后保留的基因数量，可以排个序：**
>
> 保留基因最多（保留了NA和重复）：`uniquifyFeatureNames` 中等（保留了NA，去掉重复）：`ifelse(is.na(ens.id), symbols, ens.id)` 最少（去掉了NA以及重复）：`!is.na(gene.ids) & !duplicated(gene.ids)`

### **编辑样本信息**

之前有8列样本的信息，有点冗余了。这里只保留3列关心的，并重新命名

```r
emtab.meta <- colData(sce.seger)[,c("cell type", 
    "individual", "single cell well quality")]
colnames(emtab.meta) <- c("CellType", "Donor", "Quality")
colData(sce.seger) <- emtab.meta
```

另外把细胞类型这一列中的“cell”字符去掉，并把首字母大写

```r
sce.seger$CellType <- gsub(" cell", "", sce.seger$CellType)
sce.seger$CellType <- paste0(
    toupper(substr(sce.seger$CellType, 1, 1)),
    substring(sce.seger$CellType, 2))
```

## 2 质控

**依然是备份一下，把unfiltered数据主要用在质控的探索上**

```r
unfiltered <- sce.seger
```

之前作者在数据中已经标注了细胞质量，可以看到有问题的细胞还是很多的：

```r
table(sce.seger$Quality)
# 
# control, 2-cell well  control, empty well     low quality cell                   OK 
# 32                   96                 1177                 2209
```

因此就要注意了，这里的数据会不会满足“大部分细胞都是高质量的”这个假设？

还是需要试一下，看看结果先

```r
library(scater)
stats <- perCellQCMetrics(sce.seger)
qc1 <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
                     batch=sce.seger$Donor)


colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- qc1$discard

gridExtra::grid.arrange(
  plotColData(unfiltered, x="Donor", y="sum", colour_by="discard") +
    scale_y_log10() + ggtitle("Total count") +
    theme(axis.text.x = element_text(angle = 90)),
  plotColData(unfiltered, x="Donor", y="detected", colour_by="discard") +
    scale_y_log10() + ggtitle("Detected features") +
    theme(axis.text.x = element_text(angle = 90)),
  plotColData(unfiltered, x="Donor", y="altexps_ERCC_percent",
              colour_by="discard") + ggtitle("ERCC percent") +
    theme(axis.text.x = element_text(angle = 90)),
  ncol=3
)
```

看到HP1509101在过滤时存在过滤不完全的情况，HP1504901过滤的ERCC数量太多，推测这两个批次效果可能并不是很好，可能存在大量的低质量细胞

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-140234.png)

因此，再次指定`subset` 参数，重新画图

```r
library(scater)
stats <- perCellQCMetrics(sce.seger)
qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
    batch=sce.seger$Donor,
    subset=!sce.seger$Donor %in% c("HP1504901", "HP1509101"))
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-140512.png)

**看看过滤掉多少**

```r
colSums(as.matrix(qc))
##              low_lib_size            low_n_features high_altexps_ERCC_percent 
##                       788                      1056                      1031 
##                   discard 
##                      1246
```

**最后将qc过滤的与本来标注低质量的一同过滤**

```r
low.qual <- sce.seger$Quality == "low quality cell"
sce.seger <- sce.seger[,!(qc$discard | low.qual)]

# 过滤了大概1500个细胞
> dim(unfiltered);dim(sce.seger)
[1] 26179  3514
[1] 26179  2090
```

## 3 归一化

> **此处会有一点小问题，值得注意！**

本来有ERCC，操作应该是：

```r
library(scran)
sce.seger  = computeSpikeFactors(sce.seger, "ERCC")
sce.seger <- logNormCounts(sce.seger) 
# Error in .local(x, ...) : size factors should be positive
```

**但由于存在几个细胞中一个ERCC都没有，所以会报错**&#x20;

此时面临两个选择：要么把这几个细胞去掉；要么就不借助ERCC，用另一种去卷积方法

```r
> table(colSums(counts(altExp(sce.seger)))==0)

FALSE  TRUE 
 2087     3
```

如果要去掉这几个细胞：

```r
test=sce.seger[,!colSums(counts(altExp(sce.seger)))==0]
sce.test = computeSpikeFactors(test, "ERCC")
sce.test <- logNormCounts(test)
```

我们这里**选择保守的方法，不去掉细胞，使用另一种去卷积方法：**

```r
clusters <- quickCluster(sce.seger)
sce.seger <- computeSumFactors(sce.seger, clusters=clusters)
sce.seger <- logNormCounts(sce.seger) 

summary(sizeFactors(sce.seger))
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.0000  0.1832  0.4016  1.0000  1.0996 12.9607
```

## 4 找表达量高变化基因

下面构建模型想使用`modelGeneVarWithSpikes`，于是首先应该把那几个没有ERCC的细胞去掉；另外由于AZ这个批次相对其他批次的细胞数量过于少，因此在模型构建中也把它去掉吧

```r
for.hvg <- sce.seger[,librarySizeFactors(altExp(sce.seger)) > 0
    & sce.seger$Donor!="AZ"]
dec.seger <- modelGeneVarWithSpikes(for.hvg, "ERCC", block=for.hvg$Donor)
chosen.hvgs <- getTopHVGs(dec.seger, n=2000)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-143533.png)

**如果要批量作图检查的话**

```r
# 批次数量较多，因此设置多行多列显示
par(mfrow=c(3,3))
blocked.stats <- dec.seger$per.block
for (i in colnames(blocked.stats)) {
    current <- blocked.stats[[i]]
    plot(current$mean, current$total, main=i, pch=16, cex=0.5,
        xlab="Mean of log-expression", ylab="Variance of log-expression")
    curfit <- metadata(current)
    points(curfit$mean, curfit$var, col="red", pch=16)
    curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
}
```

> 注意，这里在找完HVGs后，没有进行批次矫正，如果继续向下做，会发现什么？

## 5 降维聚类

### **降维**

```r
library(BiocSingular)
set.seed(101011001)
sce.seger <- runPCA(sce.seger, subset_row=chosen.hvgs, ncomponents=25)
sce.seger <- runTSNE(sce.seger, dimred="PCA")
```

### **聚类**

```r
snn.gr <- buildSNNGraph(sce.seger, use.dimred="PCA")
colLabels(sce.seger) <- factor(igraph::cluster_walktrap(snn.gr)$membership)
```

### **检查聚类分群与批次**

```r
tab <- table(Cluster=colLabels(sce.seger), Donor=sce.seger$Donor)
library(pheatmap)
pheatmap(log10(tab+10), color=viridis::viridis(100))
```

结果真的是：批次效应影响了分群，因此最好还是做一遍`fastMNN`操作

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-144127.png)

**tSNE图中也是显示出了强烈的批次效应**

```r
gridExtra::grid.arrange(
    plotTSNE(sce.seger, colour_by="label"),
    plotTSNE(sce.seger, colour_by="Donor"),
    ncol=2
)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-144300.png)

## 6 补充矫正批次效应

> 上图看到很明显的批次效应，那么**如果处理后，会有什么不同吗？**

### **利用fastMNN矫正**

```r
library(batchelor)
set.seed(1001010)
merged.seger <- fastMNN(sce.seger, subset.row=chosen.hvgs, 
                         batch=sce.seger$Donor)
merged.seger
# class: SingleCellExperiment 
# dim: 2000 2090 
# metadata(2): merge.info pca.info
# assays(1): reconstructed
# rownames(2000): GCG TTR ... MAP6 LCP1
# rowData names(1): rotation
# colnames(2090): HP1502401_H13 HP1502401_J14 ...
# HP1526901T2D_N8 HP1526901T2D_A8
# colData names(2): batch label
# reducedDimNames(2): corrected TSNE
# altExpNames(0):

# metadata(merged.seger)$merge.info$lost.var
# lost.var ：值越大表示丢失的真实生物异质性越多
```

因为fastMNN会包含PCA降维，所以下面继续进行tSNE即可

### **降维聚类**

```r
library(BiocSingular)
set.seed(101011001)
merged.seger <- runTSNE(merged.seger, dimred="corrected")

snn.gr <- buildSNNGraph(merged.seger, use.dimred="corrected")
colLabels(merged.seger) <- factor(igraph::cluster_walktrap(snn.gr)$membership)
```

**再次作图，是不是明显比之前好很多？**

```r
gridExtra::grid.arrange(
  plotTSNE(merged.seger, colour_by="label"),
  plotTSNE(merged.seger, colour_by="batch"),
  ncol=2
)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-145143.png)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://jieandze1314.osca.top/04/04-8.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
