# 4.3 实战三 | 10X | 未过滤的PBMC

## 1 前言

相信学习单细胞数据分析的大家对PBMC都不陌生，虽然不是相关背景，但Seurat的PBMC数据深入人心。PBMC全称是peripheral blood mononuclear cell ，外周血单核细胞。

我们这里用的数据集来自[Zheng et al. 2017](https://pubmed.ncbi.nlm.nih.gov/28091601/)，并在[10X Genomics官网也可以获取](https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc4k)，数量比Seurat使用的的2700个细胞数据更大

## 2 数据准备

### **下载**

当然也可以跳过这一步，自己下载好之后读入。这里只是学习一下另一种方法

```r
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
    "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))
# 最后也就是得到这三个文件：barcodes.tsv genes.tsv    matrix.mtx
```

### **读取**

```r
library(DropletUtils)
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)
# 看这里的数量惊人，但是后面还需要过滤
sce.pbmc
# class: SingleCellExperiment 
# dim: 33694 737280 
# metadata(1): Samples
# assays(1): counts
# rownames(33694): ENSG00000243485
# ENSG00000237613 ... ENSG00000277475
# ENSG00000268674
# rowData names(2): ID Symbol
# colnames(737280): AAACCTGAGAAACCAT-1
# AAACCTGAGAAACCGC-1 ... TTTGTCATCTTTAGTC-1
# TTTGTCATCTTTCCTC-1
# colData names(2): Sample Barcode
# reducedDimNames(0):
#   altExpNames(0):
```

### **转换ID，添加染色体信息**

```r
library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
    rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)

library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, 
    column="SEQNAME", keytype="GENEID")
```

## 3 质控

10X数据面临的一大问题就是空液滴，因此需要`emptyDrops`检验一下

```r
set.seed(100)
# 对70多万个液滴进行检验
e.out <- emptyDrops(counts(sce.pbmc))
# > e.out
# DataFrame with 737280 rows and 5 columns
# Total   LogProb    PValue   Limited       FDR
# <integer> <numeric> <numeric> <logical> <numeric>
#   AAACCTGAGAAACCAT-1         1        NA        NA        NA        NA
# AAACCTGAGAAACCGC-1         0        NA        NA        NA        NA
# AAACCTGAGAAACCTA-1         1        NA        NA        NA        NA
# AAACCTGAGAAACGAG-1         0        NA        NA        NA        NA
# AAACCTGAGAAACGCC-1         1        NA        NA        NA        NA
# ...                      ...       ...       ...       ...       ...
# TTTGTCATCTTTACAC-1         2        NA        NA        NA        NA
# TTTGTCATCTTTACGT-1        33        NA        NA        NA        NA
# TTTGTCATCTTTAGGG-1         0        NA        NA        NA        NA
# TTTGTCATCTTTAGTC-1         0        NA        NA        NA        NA
# TTTGTCATCTTTCCTC-1         1        NA        NA        NA        NA

# 最后过滤，剩下4000多
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]
sce.pbmc
# class: SingleCellExperiment 
# dim: 33694 4233 
# metadata(1): Samples
# assays(1): counts
# rownames(33694): RP11-34P13.3 FAM138A ...
# AC213203.1 FAM231B
# rowData names(2): ID Symbol
# colnames(4233): AAACCTGAGAAGGCCT-1
# AAACCTGAGACAGACC-1 ... TTTGTCACAGGTCCAC-1
# TTTGTCATCCCAAGAT-1
# colData names(2): Sample Barcode
# reducedDimNames(0):
#   altExpNames(0):
```

### 数据备份

把unfiltered数据主要用在质控的探索上

```r
unfiltered <- sce.pbmc
```

这里过滤的条件不需要太严苛，**只需要去除线粒体含量太高的细胞即可**

> 成熟的mRNA会通过核孔来到细胞质。如果细胞遭到破坏，大量细胞质中mRNA流失，而线粒体体积比较大，流不出去，含量基本不变，最后导致细胞质中捕获的线粒体RNA占比升高

```r
stats <- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
sce.pbmc <- sce.pbmc[,!high.mito]

summary(high.mito)
##    Mode   FALSE    TRUE 
## logical    3922     311
```

### **根据原来的数据，加上质控标准作图**

```r
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- high.mito

gridExtra::grid.arrange(
    plotColData(unfiltered, y="sum", colour_by="discard") +
        scale_y_log10() + ggtitle("Total count"),
    plotColData(unfiltered, y="detected", colour_by="discard") +
        scale_y_log10() + ggtitle("Detected features"),
    plotColData(unfiltered, y="subsets_Mito_percent",
        colour_by="discard") + ggtitle("Mito percent"),
    ncol=3
)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-19-100732.png)

### **再看下文库大小分别和线粒体含量的关系**

```r
plotColData(unfiltered, x="sum", y="subsets_Mito_percent",
    colour_by="discard") + scale_x_log10()
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-19-100808.png)

## 4 归一化

也是使用预分群+去卷积计算size factor的方法

```r
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)
sce.pbmc <- logNormCounts(sce.pbmc)

summary(sizeFactors(sce.pbmc))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.009   0.710   0.871   1.000   1.094  13.948
```

**看看两种归一化方法的差异**

```r
plot(librarySizeFactors(sce.pbmc), sizeFactors(sce.pbmc), pch=16,
    xlab="Library size factors", ylab="Deconvolution factors", log="xy")
abline(a=0, b=1, col="red")
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-19-101041.png)

## 5 找表达量高变化基因

既然有UMI，就可以用第三种方法【在之前 [3.3 挑选高变化基因](https://jieandze1314.osca.top/03/03-3) 的 2.3 考虑技术噪音中介绍过，如果没有spike-in】：可以考虑利用数据分布来表示技术噪音，例如只考虑技术噪音的话，UMI counts通常会呈现近似泊松分布

```r
set.seed(1001)
dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)
top.pbmc <- getTopHVGs(dec.pbmc, prop=0.1)

# 作图
plot(dec.pbmc$mean, dec.pbmc$total, pch=16, cex=0.5,
    xlab="Mean of log-expression", ylab="Variance of log-expression")
curfit <- metadata(dec.pbmc)
# 可视化一条线（下图的蓝线），这条线指所有的基因都会存在的一种偏差
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-19-101244.png)

## 6 降维

```r
set.seed(10000)
sce.pbmc <- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc)

set.seed(100000)
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")

set.seed(1000000)
sce.pbmc <- runUMAP(sce.pbmc, dimred="PCA")
```

看看保留了几个PC

```r
ncol(reducedDim(sce.pbmc, "PCA"))
## [1] 8
```

## 7 聚类

```r
g <- buildSNNGraph(sce.pbmc, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
colLabels(sce.pbmc) <- factor(clust)

table(colLabels(sce.pbmc))
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
## 585 518 364 458 170 791 295 107  45  46 152  84  40  60 142  16  28  21

plotTSNE(sce.pbmc, colour_by="label")
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-19-101650.png)

## 8 找marker基因并解释结果

```r
markers <- findMarkers(sce.pbmc, pval.type="some", direction="up")
```

这次看看cluster 7的marker基因

```r
marker.set <- markers[["7"]]
as.data.frame(marker.set[1:30,1:3])
# p.value           FDR summary.logFC
# FCN1   4.881588e-137 1.644802e-132      2.715872
# LGALS2 3.729029e-133 6.282295e-129      2.191398
# CSTA   1.426854e-131 1.602548e-127      2.123738
# CFD    1.207067e-102  1.016773e-98      1.503274
# FGL2    8.567117e-93  5.773209e-89      1.358891
# IFI30   7.822561e-80  4.392889e-76      1.276366
# CLEC7A  6.052032e-79  2.913102e-75      1.109366
# MS4A6A  1.958033e-78  8.246744e-75      1.419465
# CFP     8.802407e-73  3.295426e-69      1.312191
# S100A8  6.193215e-70  2.086742e-66      3.431603
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-19-102158.png)

再和其他clusters对比

```r
plotExpression(sce.pbmc, features=c("CD14", "CD68",
    "MNDA", "FCGR3A"), x="label", colour_by="label")
```

![](https://jieandze1314-1255603621.cos.ap-guangzhou.myqcloud.com/blog/2020-07-19-102238.png)

其实**最后还是考验的背景知识**：根据cluster7中CD14、CD68、MNDA表达量升高，同时又检查了CD16基因下调，推测cluster7是单核细胞


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://jieandze1314.osca.top/04/04-3.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
