摘要: |
单细胞转录组测序(single-cell RNA-sequencing,scRNA-seq)数据具有高稀疏性、高噪声、高维度、结构信息和位置信息缺乏等特点,且数据规模迅速增大,使得单细胞聚类面临较大的挑战。为便于对不同的scRNA-seq数据选择合适的分析方法,本研究对scRNA-seq数据的质量控制、基因选择和聚类等方法进行比较分析。首先,分析质量控制中过滤和归一化的方法及其阈值设置;然后,从模型因子、测序技术、方法局限性和优势等方面,对6种典型的基因选择方法进行比较;最后,详细阐述6种典型的单细胞聚类方法,并分析其适用的数据规模和优缺点。收集14个带有真实标签的金标准scRNA-seq数据集,包括5个全长测序数据集和9个双端测序数据集,其中5个数据集包含的细胞数大于3 000个,对6种典型的基因选择方法和6种单细胞聚类方法进行实验比较,分析它们在识别高差异基因时和在聚类性能上的差异。结果发现,不同的基因选择方法在Adam和Wang_Lung数据集分别可以检测到182个和124个共有基因,以及一些独有基因。此外,Seurat、SC3、Monocle 3和scDeepCluster的聚类稳定性更好,Seurat在所有数据集上的聚类稳定性和准确性最好,scDeepCluster在大部分数据集上有很好的聚类准确性。因此,选择合适的scRNA-seq数据分析方法,需要综合考虑测序平台、数据规模,以及基因表达分布等因素。 |
关键词: 单细胞转录组测序数据 质量控制 基因选择 聚类 细胞类型识别 |
DOI:10.13656/j.cnki.gxkx.20230928.016 |
投稿时间:2022-10-07修订日期:2022-12-11 |
基金项目:国家自然科学基金项目(62141207)资助。 |
|
Comparison of Clustering Methods for Large-scale Single-cell RNA-sequencing Data |
ZHU Xiaoshu1,2, MENG Shuang1, LONG Faning2
|
(1.School of Computer Science and Engineering, Guangxi Normal University, Guilin, Guangxi, 541004, China;2.School of Computer Science and Engineering, Yulin Normal University, Yulin, Guangxi, 537000, China) |
Abstract: |
Single-cell RNA-sequencing (scRNA-seq) data has the characteristics of high sparseness,high noise,high dimension,lack of structural information and location information,and the scale of data increases rapidly,which makes single-cell clustering face great challenges.In order to facilitate the selection of appropriate analysis methods for different scRNA-seq data,this study compared and analyzed the quality control,gene selection and clustering methods of scRNA-seq data.Firstly,the method of filtering and normalization in quality control and its threshold setting are analyzed.Then,six typical gene selection methods were compared from the aspects of model factors,sequencing technology,method limitations and advantages.Finally,6 typical single-cell clustering methods are described in detail,and their applicable scale of datasets,advantages and disadvantages are analyzed.14 scRNA-seq datasets with real labels were collected,including 5 full-length sequencing datasets and 9 double-ended sequencing datasets,among which 5 datasets were larger than 3 000 cells.6 typical gene selection methods and 6 single-cell clustering methods were compared experimentally to analyze their differences in identifying highly differentially expressed genes and clustering performance.The results showed that different gene selection methods could detect 182 and 124 common genes,as well as some unique genes in Adam and Wang_Lung datasets,respectively.In addition,Seurat,SC3,Monocle 3 and scDeepCluster have better clustering stability.Seurat has the best clustering stability and accuracy on all data sets,and scDeepCluster has good clustering accuracy on most datasets.Therefore,selecting the appropriate scRNA-seq data analysis method requires comprehensive consideration of factors such as sequencing platform,data size,and gene expression distribution. |
Key words: single-cell RNA-sequencing data quality control gene selection clustering cell type identification |