删除或更新信息,请邮件至freekaoyan#163.com(#换成@)

Gclust: A Parallel Clustering Tool for Microbial Genomic Data

本站小编 Free考研考试/2022-01-03

The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.
国际公共微生物基因组数据不断加速增长,给研究人员对此类资源的使用带来了沉重的负担。因此,针对大规模微生物基因组数据,基于聚类分析技术构建非冗余参考序列的数据库至关重要。然而,当前的聚类算法在长基因组序列上的聚类性能较低。针对这一难题,我们开发了针对大规模微生物全基因组并行聚类的软件Gclust。首先,该软件采用新颖的并行化策略和基于稀疏后缀数组(sparse suffix arrays,SSAs)的快速序列比对算法来加速聚类。其次,文中采用了序列的最大精确匹配(maximal exact matches,MEMs)来计算两条序列之间的一致性。最后,在四个标准基因组序列数据集上进行测试,实验结果表明:Gclust具有最高的聚类性能和聚类质量。Gclust是一款开源的软件,可供研究者用于非商业用途的免费下载,下载地址:https://github.com/niu-lab/gclust。我们还开发了可供用户上传基因组数据进行测试的在线聚类平台,访问地址:http://niulab.scgrid.cn/gclust。





PDF全文下载地址:

http://gpb.big.ac.cn/articles/download/732
相关话题/gen