# 4.11 实战十一 | Smart-seq2 | 小鼠造血干细胞

## 1 前言

数据来自 ([Nestorowa et al. 2016](https://pubmed.ncbi.nlm.nih.gov/27365425/)) 的小鼠造血干细胞 haematopoietic stem cell (HSC) ，使用的技术是Smart-seq2

### **准备数据**

```r
library(scRNAseq)
sce.nest <- NestorowaHSCData()

sce.nest
# class: SingleCellExperiment 
# dim: 46078 1920 
# metadata(0):
#   assays(1): counts
# rownames(46078): ENSMUSG00000000001
# ENSMUSG00000000003 ... ENSMUSG00000107391
# ENSMUSG00000107392
# rowData names(0):
#   colnames(1920): HSPC_007 HSPC_013 ... Prog_852
# Prog_810
# colData names(2): cell.type FACS
# reducedDimNames(1): diffusion
# altExpNames(1): ERCC

counts(sce.nest)[1:3,1:3]
# 3 x 3 sparse Matrix of class "dgCMatrix"
# HSPC_007 HSPC_013 HSPC_019
# ENSMUSG00000000001        .        7        1
# ENSMUSG00000000003        .        .        .
# ENSMUSG00000000028        4        1        2
```

看到使用了ERCC、Ensembl ID

### **ID转换**

```r
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
anno <- select(ens.mm.v97, keys=rownames(sce.nest), 
    keytype="GENEID", columns=c("SYMBOL", "SEQNAME"))
# 这里全部对应
> sum(is.na(anno$SYMBOL))
[1] 0
> sum(is.na(anno$SEQNAME))
[1] 0
# 接下来只需要匹配顺序即可
rowData(sce.nest) <- anno[match(rownames(sce.nest), anno$GENEID),]

sce.nest
# class: SingleCellExperiment 
# dim: 46078 1920 
# metadata(0):
#   assays(1): counts
# rownames(46078): ENSMUSG00000000001
# ENSMUSG00000000003 ... ENSMUSG00000107391
# ENSMUSG00000107392
# rowData names(3): GENEID SYMBOL SEQNAME
# colnames(1920): HSPC_007 HSPC_013 ... Prog_852
# Prog_810
# colData names(2): cell.type FACS
# reducedDimNames(1): diffusion
# altExpNames(1): ERCC
```

## 2 质控

**依然是备份一下，把unfiltered数据主要用在质控的探索上**

```r
unfiltered <- sce.nest
```

这里没有线粒体基因，因此只能用ERCC计算过滤条件

```r
library(scater)
stats <- perCellQCMetrics(sce.nest)
qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent")
sce.nest <- sce.nest[,!qc$discard]

# 看下过滤的细胞
colSums(as.matrix(qc))
# low_lib_size            low_n_features 
# 146                        28 
# high_altexps_ERCC_percent                   discard 
# 241                       264
```

做个图

```r
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- qc$discard

gridExtra::grid.arrange(
    plotColData(unfiltered, y="sum", colour_by="discard") +
        scale_y_log10() + ggtitle("Total count"),
    plotColData(unfiltered, y="detected", colour_by="discard") +
        scale_y_log10() + ggtitle("Detected features"),
    plotColData(unfiltered, y="altexps_ERCC_percent",
        colour_by="discard") + ggtitle("ERCC percent"),
    ncol=2
)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-21-125348.png)

**最后对数据进行过滤**

```r
sce.nest <- sce.nest[,!qc$discard]

# 过滤前后
> dim(unfiltered);dim(sce.nest)
[1] 46078  1920
[1] 46078  1656
```

## 3 归一化

```r
library(scran)
set.seed(101000110)
clusters <- quickCluster(sce.nest)
sce.nest <- computeSumFactors(sce.nest, clusters=clusters)
sce.nest <- logNormCounts(sce.nest)
```

## 4 找表达量高变化基因

使用基于ERCC的构建模型方法

```r
set.seed(00010101)
dec.nest <- modelGeneVarWithSpikes(sce.nest, "ERCC")
top.nest <- getTopHVGs(dec.nest, prop=0.1)

class(dec.nest)
# [1] "DFrame"
# attr(,"package")
# [1] "S4Vectors"

# 其中ERCC的信息就存储在dec.nest的metadata中
curfit <- metadata(dec.nest)
class(curfit)
# [1] "list"
names(curfit)
# [1] "mean"    "var"     "trend"   "std.dev"
length(unique(names(curfit$mean))) # 一共92个ERCC spike-in
# [1] 92 

# 其中的mean、var就定义了横纵坐标
head(curfit$mean)
# ERCC-00002  ERCC-00003  ERCC-00004  ERCC-00009  ERCC-00012  ERCC-00013 
# 14.91183375 11.27060119 13.31197197 11.94866319  0.02211546  0.21249156 
head(curfit$var)
# ERCC-00002 ERCC-00003 ERCC-00004 ERCC-00009 ERCC-00012 ERCC-00013 
# 0.02375131 0.29308411 0.05376959 0.41814635 0.14928826 1.08599155
```

**然后把基因（黑点）、ERCC（红点）、根据ERCC拟合的线（蓝线）画出来**

```r
plot(dec.nest$mean, dec.nest$total, pch=16, cex=0.5,
    xlab="Mean of log-expression", ylab="Variance of log-expression")
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
points(curfit$mean, curfit$var, col="red")
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-21-130616.png)

## 5 降维聚类

### **降维**

```r
set.seed(101010011)
sce.nest <- denoisePCA(sce.nest, technical=dec.nest, subset.row=top.nest)
sce.nest <- runTSNE(sce.nest, dimred="PCA")

# 检查PC的数量
ncol(reducedDim(sce.nest, "PCA"))
## [1] 9
```

### **聚类**

```r
snn.gr <- buildSNNGraph(sce.nest, use.dimred="PCA")
colLabels(sce.nest) <- factor(igraph::cluster_walktrap(snn.gr)$membership)

table(colLabels(sce.nest))
## 
##   1   2   3   4   5   6   7   8   9 
## 203 472 258 175 142 229  20  83  74
```

### 作图&#x20;

```r
plotTSNE(sce.nest, colour_by="label")
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-21-130920.png)

## 6 marker基因检测

```r
markers <- findMarkers(sce.nest, colLabels(sce.nest), 
    test.type="wilcox", direction="up", lfc=0.5,
    row.data=rowData(sce.nest)[,"SYMBOL",drop=FALSE])
```

比如检测一下cluster8：

```r
chosen <- markers[['8']]
best <- chosen[chosen$Top <= 10,]
length(best)
# [1] 13

# 将cluster8与其他clusters对比的AUC结果提取出来
aucs <- getMarkerEffects(best, prefix="AUC")
rownames(aucs) <- best$SYMBOL

library(pheatmap)
pheatmap(aucs, color=viridis::plasma(100))
```

看到其中血红蛋白相关基因（Hba1、Hba2、Hbb）、Car2、Hebp1基因上调，说明**clsuter8可能包含红细胞前体细胞**

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-21-131927.png)

## 7 细胞类型注释

> 将会使用内置的参考注释数据，[`SingleR`](https://bioconductor.org/packages/release/bioc/vignettes/SingleR/inst/doc/SingleR.html)中就包含了一些内置数据集，大部分是bulk RNA-Seq或芯片数据中经过筛选的细胞类型。

### **准备参考数据**

```r
library(SingleR)
mm.ref <- MouseRNAseqData()
mm.ref
# class: SummarizedExperiment 
# dim: 21214 358 
# metadata(0):
#   assays(1): logcounts
# rownames(21214): Xkr4 Rp1 ... LOC100039574
# LOC100039753
# rowData names(0):
#   colnames(358): ERR525589Aligned
# ERR525592Aligned ... SRR1044043Aligned
# SRR1044044Aligned
# colData names(3): label.main label.fine
# label.ont
```

### **进行转换**

```r
renamed <- sce.nest
# 参考数据集中使用的是symbol name，这里也转换一下
rownames(renamed) <- uniquifyFeatureNames(rownames(renamed),
                                          rowData(sce.nest)$SYMBOL)
# 然后把我们的细胞在参考数据集中找对应的细胞类型
# 返回的pred结果是一个数据框，每行是我们自己数据的一个细胞
pred <- SingleR(test=renamed, ref=mm.ref, labels=mm.ref$label.fine)
table(pred$labels)
# 
# B cells Endothelial cells      Erythrocytes      Granulocytes       Macrophages         Monocytes          NK cells           T cells 
# 61                 1              1005                 1                 2               500                 1                85
```

这里也看到cluster8与红细胞更相近

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-21-133040.png)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://jieandze1314.osca.top/04/04-11.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
