删除或更新信息,请邮件至freekaoyan#163.com(#换成@)

Gclust: A Parallel Clustering Tool for Microbial Genomic Data

本站小编 Free考研考试/2022-01-03

The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.
国际公共微生物基因组数据不断加速增长,给研究人员对此类资源的使用带来了沉重的负担。因此,针对大规模微生物基因组数据,基于聚类分析技术构建非冗余参考序列的数据库至关重要。然而,当前的聚类算法在长基因组序列上的聚类性能较低。针对这一难题,我们开发了针对大规模微生物全基因组并行聚类的软件Gclust。首先,该软件采用新颖的并行化策略和基于稀疏后缀数组(sparse suffix arrays,SSAs)的快速序列比对算法来加速聚类。其次,文中采用了序列的最大精确匹配(maximal exact matches,MEMs)来计算两条序列之间的一致性。最后,在四个标准基因组序列数据集上进行测试,实验结果表明:Gclust具有最高的聚类性能和聚类质量。Gclust是一款开源的软件,可供研究者用于非商业用途的免费下载,下载地址:https://github.com/niu-lab/gclust。我们还开发了可供用户上传基因组数据进行测试的在线聚类平台,访问地址:http://niulab.scgrid.cn/gclust。





PDF全文下载地址:

http://gpb.big.ac.cn/articles/download/732
开通VIP:万种考研考证视频随便看,每本不到一分钱
547所院校考研考博1130种指定教材的千余种配套题库、视频,涵盖英语、经济、证券、金融、理工、管理、社会、财会、教育心理、中文、艺术、新闻传播、法学、医学、计算机、历史、地理、政治、哲学、体育类等28类学科,VIP会员低至128.00元!
相关话题/gen

要找考研考博专业课真题、题库、视频?这里资源超全!在线免费阅读!
2万种考研考博电子书(题库、视频、全套资料)及历年真题,涵盖547所院校4万余个考研考博专业科目、考研公共课(政治英语数学)、40种专业硕士(金融硕士、MBA、国际商务硕士、新闻传播硕士、社会工作硕士等)、28类同等学力申硕专业、1130种经典教材。无论您是真题演练、题库刷题,还是复习教材,一个VIP会员均可满足您的需求。