# 4.7 实战七 | SMARTer | 人胰腺细胞

## 1 前言

这次使用的数据是：[Lawlor et al. (2017) ](https://pubmed.ncbi.nlm.nih.gov/27864352/)中的不同人类供体的胰腺细胞

SMARTer其实不是个像Fluidigm、10X一样的制备系统，它是一个试剂盒。SMART技术是Clontech（Takara旗下全资子公司）的专利技术，**2009年升级为SAMRTer技术后**，采用更灵敏的SMARTer Oligo和高效的SMARTScribe RT进行逆转录，使其灵敏度提高至皮克级。利用SMARTer技术**只需单管、单酶即可完成逆转录，无需接头连接**，减少了样品操作步骤，也极大降低了样品损失，保留了原始信息，为RNA-Seq提供了可靠的基础。

2013年 Clontech公司就为Fluidigm的C1单细胞全自动制备系统推出了SMARTer Ultra Low RNA Kit。当时也是能够从C1捕获的单细胞中产生mRNA-seq文库，方便了研究。Fluidigm在C1平台上检验了多款试剂盒，发现SMARTer cDNA合成是最可靠的方法之一

> 参考：<http://www.ebiotrade.com/newsf/2013-9/20139493414227.htm>

后来很多平台也在使用SMARTer试剂（如WaferGen ICELL8 Single-Cell System），不过后来发现Smart-seq2的扩增效果优于SMARTer试剂盒，而且Smart-seq2技术比传统的SMARTer方法能产生更长和更多的cDNA，在低表达量时，Smart-seq2比SMARTer的基因检出率更高，结果更稳定

### **数据准备**

```r
library(scRNAseq)
sce.lawlor <- LawlorPancreasData()
sce.lawlor
# class: SingleCellExperiment 
# dim: 26616 638 
# metadata(0):
#   assays(1): counts
# rownames(26616): ENSG00000229483 ENSG00000232849 ...
# ENSG00000251576 ENSG00000082898
# rowData names(0):
#   colnames(638): 10th_C10_S104 10th_C11_S96 ...
# 9th-C96_S81 9th-C9_S13
# colData names(8): title age ... race Sex
# reducedDimNames(0):
#   altExpNames(0):
```

### **ID转换**

```r
library(AnnotationHub)
edb <- AnnotationHub()[["AH73881"]]
anno <- select(edb, keys=rownames(sce.lawlor), keytype="GENEID", 
    columns=c("SYMBOL", "SEQNAME"))
rowData(sce.lawlor) <- anno[match(rownames(sce.lawlor), anno[,1]),-1]
rowData(sce.lawlor)
# DataFrame with 26616 rows and 2 columns
# SYMBOL     SEQNAME
# <character> <character>
#   ENSG00000229483   LINC00362          13
# ENSG00000232849   LINC00363          13
# ENSG00000229558    SACS-AS1          13
# ENSG00000232977   LINC00327          13
# ENSG00000227893   LINC00352          13
# ...                     ...         ...
# ENSG00000232746   LINC02022           3
# ENSG00000150867     PIP4K2A          10
# ENSG00000255021  AC093496.1           3
# ENSG00000251576   LINC01267           3
# ENSG00000082898        XPO1           2
```

## 2 质控

**依然是备份一下，把unfiltered数据主要用在质控的探索上**

```r
unfiltered <- sce.lawlor
```

**检查是否有线粒体基因和批次信息**

```r
# 其中有MT基因
table(rowData(sce.lawlor)$SEQNAME=="MT")
# 
# FALSE  TRUE 
# 25269    13 

# 还有一些批次信息
table(sce.lawlor$`islet unos id`)
# 
# ACCG268 ACCR015A ACEK420A  ACEL337  ACHY057  ACIB065  ACIW009  ACJV399 
# 136       57       45      103       39       57       93      108
```

**进行质控**

```r
stats <- perCellQCMetrics(sce.lawlor, 
    subsets=list(Mito=which(rowData(sce.lawlor)$SEQNAME=="MT")))
qc <- quickPerCellQC(stats, percent_subsets="subsets_Mito_percent",
    batch=sce.lawlor$`islet unos id`)
# 过滤了34个细胞
table(qc$discard)
# 
# FALSE  TRUE 
# 604    34 

sce.lawlor <- sce.lawlor[,!qc$discard]
```

**看看过滤掉多少**

```r
colSums(as.matrix(qc))
# low_lib_size            low_n_features high_subsets_Mito_percent                   discard 
# 9                         5                        25                        34
```

**作图看一下**

```r
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- qc$discard

gridExtra::grid.arrange(
    plotColData(unfiltered, x="islet unos id", y="sum", colour_by="discard") +
        scale_y_log10() + ggtitle("Total count") +
        theme(axis.text.x = element_text(angle = 90)),
    plotColData(unfiltered, x="islet unos id", y="detected", 
        colour_by="discard") + scale_y_log10() + ggtitle("Detected features") +
        theme(axis.text.x = element_text(angle = 90)), 
    plotColData(unfiltered, x="islet unos id", y="subsets_Mito_percent",
        colour_by="discard") + ggtitle("Mito percent") +
        theme(axis.text.x = element_text(angle = 90)),
    ncol=2
)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-112030.png)

**看一下文库大小和线粒体占比的关系**

```r
plotColData(unfiltered, x="sum", y="subsets_Mito_percent",
    colour_by="discard") + scale_x_log10()
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-112119.png)

**最后把过滤条件应用在原数据**

```r
sce.lawlor <- sce.lawlor[,!qc$discard]
```

## 3 归一化

继续使用去卷积方法

```r
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.lawlor)
sce.lawlor <- computeSumFactors(sce.lawlor, clusters=clusters)
sce.lawlor <- logNormCounts(sce.lawlor)

summary(sizeFactors(sce.lawlor))
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.2955  0.7807  0.9633  1.0000  1.1820  2.6287
```

## 4 找表达量高变化基因

这里没有ERCC也没有UMI，所以就用最基础的方法构建模型：`modelGeneVar`

不过还是要指定批次信息

```r
dec.lawlor <- modelGeneVar(sce.lawlor, block=sce.lawlor$`islet unos id`)
chosen.genes <- getTopHVGs(dec.lawlor, n=2000)
```

## 5 【尝试】矫正批次

这里写“尝试”是因为这里有一个问题：**细胞总数不多，才600个，但批次的数量很多**，所以归到单独的批次上细胞数就很少。这时如果继续矫正批次，**不知道会不会抹除一些真实的生物学特性。**

归根结底，还是一个技术噪音与生物因素之间的取舍问题

```r
table(sce.lawlor$`islet unos id`)
# 
# ACCG268 ACCR015A ACEK420A  ACEL337  ACHY057  ACIB065  ACIW009  ACJV399 
# 136       57       45      103       39       57       93      108
```

所以可以尝试一下：

```r
library(batchelor)
set.seed(1001010)
merged.lawlor <- fastMNN(sce.lawlor, subset.row=chosen.genes, 
                         batch=sce.lawlor$`islet unos id`)

metadata(merged.lawlor)$merge.info$lost.var
```

关于这个结果：**`lost.var` ，值越大表示丢失的真实生物异质性越多**

* It contains a matrix of the **variance lost in each batch** (column) at each merge step (row).
* Large proportions of lost variance **(>10%)** suggest that correction is **removing genuine biological heterogeneity.**&#x20;

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-113144.png)

看到的确有损失生物异质性的可能性，**那么就先放弃这个计划，直接进行下面的降维**

## 5 降维聚类

### **降维**

```r
library(BiocSingular)
set.seed(101011001)
sce.lawlor <- runPCA(sce.lawlor, subset_row=chosen.genes, ncomponents=25)
sce.lawlor <- runTSNE(sce.lawlor, dimred="PCA")
```

### **聚类**

```r
snn.gr <- buildSNNGraph(sce.lawlor, use.dimred="PCA")
colLabels(sce.lawlor) <- factor(igraph::cluster_walktrap(snn.gr)$membership)
```

**看分群与细胞类型之间关系**

```r
tab <- table(colLabels(sce.lawlor), sce.lawlor$`cell type`)
library(pheatmap)
pheatmap(log10(tab+10), color=viridis::viridis(100))
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-113541.png)

**看分群与批次之间**

```r
tab2 <- table(colLabels(sce.lawlor), sce.lawlor$`islet unos id`)
library(pheatmap)
pheatmap(log10(tab2+10), color=viridis::viridis(100))
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-113638.png)

### **最后看看批次效应**

```r
gridExtra::grid.arrange(
    plotTSNE(sce.lawlor, colour_by="label"),
    plotTSNE(sce.lawlor, colour_by="islet unos id"),
    ncol=2
)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-20-113811.png)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://jieandze1314.osca.top/04/04-7.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
