2.3.7 工具 | 2021-一个很有想法的工具——Ikarus，想要在单细胞水平直接鉴定肿瘤细胞

刘小泽写于2021.10.28

前言

题目：Identifying tumor cells at the single cell level

日期：2021-10-25

期刊：biorxiv预印本

链接：https://www.biorxiv.org/content/10.1101/2021.10.15.463909v2

代码: https://github.com/BIMSBbioinfo/ikarus

图表复现: https://github.com/BIMSBbioinfo/ikarus---auxiliary

设计初衷

目前主要有三类细胞注释工具：

manifold matching software：aligns the shape of an unlabeled dataset to an expertly labeled dataset
deep learning based tools：model the batch effects through latent space embedding
gene set based classifiers：使用已知的marker辅助分类

目前使用最多的应该是使用marker gene辅助

作者提出一个问题：能否做出一个分类器（重点就是选出一系列基因），可以直接从一群细胞中区分出tumor和normal？这样可以帮助肿瘤抗原的自动预测，帮助检查肿瘤细胞状态，还可以帮助诊断病理组织

设计思路

Ikarus工具为了检查肿瘤细胞，设计了2步：

首先还是根据比较靠谱的，经过专家审核的单细胞数据集，从中找到一些共有的肿瘤细胞基因signature
构建逻辑回归分类器 + network-based propagation of cell labels using a custom built cell-cell network

作者先整合了结直肠癌+肺癌的10X数据，其中数据注释了细胞是否是癌细胞。接下来寻找癌细胞相关的marker gene：

每个数据差异分析，然后取交集（但是不同癌症的marker基因直接取交集是否有意义呢？那些特异性的gene岂不是被抛弃？）=> 得到162 个交集基因集
同理选出正常组织的marker基因，它包含两部分：cell type specific markers & genes which are specifically depleted in the tumor cells

验证

找几个数据集验证一下tumor和normal 各自的基因集，找了5种癌症类型的patient-derived xenograft (PDX)、cancer cell line encyclopedia (CCLE)，发现 tumor signature score was significantly higher than the normal signature score

用几个测试数据集比较了不同方法区分tumor和normal细胞，包括标准的机器学习（SVM, random forest, and logistic regression）、SingleCellNet、ACTINN。发现Ikarus的平均准确度可以达到0.98（A图），并且AUROC也很高。

使用Lambrechts lung cancer dataset（图DE），看到Ikarus也能基本得到和作者一样的肿瘤细胞分布

测试一下如果没有足够的基因，或者缺少某几个关键基因，Ikarus还能预测准确吗

图A：当然，使用的tumor gene 数量越多越准确
图B-D：当缺少serum amyloid A (SAA1) 和fibrinogen beta chain (FGB)，会导致准确度大大降低（也就是说这个工具还是很依赖关键的marker gene的，因此前期还是要筛选合适的marker基因作为输入）；当然作者说只在Lambrechts这个数据集中发现了这个情况，其他没发现

作者认为自己的signature找的很有效，于是看看它们具体有哪些特性

发现了

在各个数据中，这些基因之间的Pearson相关性竟然大部分接近0
tumor gene signature is partially related to cell cycle, and DNA replication（C图）
tumor gene signature preferentially overlapped with the cell cycle hallmark（D图）

看一下筛出来的基因集和预后的联系

more than 75% of tumor signature genes are predictive of unfavourable prognosis in at least one cancer type
在5种癌症中（liver, renal, pancreatic, lung and endometrial）这个基因集对预后不利
与CNV： significantly higher overlap with the known CNV regions in the majority of profiled cancer types

上一页2.3.6 工具 | 2021-MACA: 一款自动注释细胞类型的工具下一页3 流程篇：分析框架

最后更新于3年前

这有帮助吗？