广西科学院学报

引用本文：

赖德焕,陈庆锋,黄丽宇,梁家海.基于K-spectrum的下一代测序数据的纠错算法分析[J].广西科学院学报,2017,33(1):7-11. [点击复制]
LAI Dehuan,CHEN Qingfeng,HUANG Liyu,LIANG Jiahai.K-spectrum-based Analysis for Error Correction of Next Generation Sequencing[J].Journal of Guangxi Academy of Sciences,2017,33(1):7-11. [点击复制]

【打印本页】【在线阅读全文】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】【关闭】

本文已被：浏览 554次下载 667次	码上扫一扫！
基于K-spectrum的下一代测序数据的纠错算法分析
赖德焕¹, 陈庆锋^1,2, 黄丽宇³, 梁家海⁴
0 字体:加大+\|默认\|缩小-
(1.广西大学计算机与电子信息学院, 广西南宁 530004;2.广西大学亚热带农业生物资源保护与利用国家重点实验室, 广西南宁 530004;3.广西大学信息网络中心, 广西南宁 530004;4.钦州学院电子与信息工程学院, 广西钦州 535000)

摘要:

[目的]对现有的下一代测序(Next Generation Sequencing,NGS)纠错算法和工具进行分析,提出基于Hadoop平台的纠错算法,以解决大数据处理中计算机内存不足和运行时间长的问题,提升纠错性能。[方法]使用特定的数据对现有的基于K-spectrum的纠错算法进行测试,对各纠错工具的运行时间、内存峰值和纠错结果进行比较来衡量纠错工具的性能。在此基础上提出Hadoop分布式并行纠错算法(Parallel algorithm),并与串行程序、Lighter和Racer进行比较,分析分布式并行实现的可行性。[结果]现有的基于K-spectrum的纠错工具普遍存在较大的内存消耗现象,其中Racer和Sga的纠错效果较好。而Hadoop分布式并行纠错算法对计算机单机内存的消耗较低,当数据量超过一定值时,并行分布式程序的运算时间比串行单机程序明显减少。[结论]本研究提出的Hadoop分布式并行纠错算法不仅降低了内存消耗,而且提高了运算性能,更有利于大规模基因数据的分析处理。

关键词: NGS 基因错误修正 Hadoop K-spectrum

DOI：10.13657/j.cnki.gxkxyxb.20170228.001

投稿时间：2016-12-20

基金项目:国家自然科学基金项目(61363025)和广西自然科学基金重点项目(2013GXNSFDA019029)资助。

K-spectrum-based Analysis for Error Correction of Next Generation Sequencing

LAI Dehuan¹, CHEN Qingfeng^1,2, HUANG Liyu³, LIANG Jiahai⁴

(1.School of Computer, Electronic and Information in Guangxi University, Nanning, Guangxi, 530004, China;2.State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, Guangxi University, Nanning, Guangxi, 530004, China;3.Information Network Center, Guangxi University, Nanning, Guangxi, 530004, China;4.School of Electronics and Information Engineering, Qinzhou University, Qinzhou, Guangxi, 535000, China)

Abstract:

[Objective] The existing Next Generation Sequencing (NGS) error correction algorithms and tools are analyzed and summarized,and an error correction tool based on Hadoop platform is proposed to solve the problem of insufficient memory and long running time in large data processing.[Methods] The existing K-spectrum-based error correction algorithm is tested with the specific data,and the performance of the error correction software is measured by comparing the run time,peak memory and error correction result of each correction tool.A new error correction algorithm is designed by combining Hadoop parallel distributed program and the algorithm proposed in this paper.A comparison is made between the serial program,Lighter and Racer to analyze the feasibility of distributed parallel program.[Results] The existing error correction tools based on K-spectrum method generally have large memory footprint,in which Racer and Sga have better error correction effect.And Hadoop distributed parallel error correction algorithm shows lower memory consumption on single computer.When the data size exceeds a certain value,comparing with the time of the serial single program,the parallel and distributed computing time significantly reduces.[Conclusion] The parallel error correction program combined with Hadoop improves the memory and operation performance of the NGS error correction program based on K-spectrum,which is good for the analysis and processing of large scale gene data.

Key words: NGS gene error correction Hadoop K-spectrum

用微信扫一扫