The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.
国际公共微生物基因组数据不断加速增长,给研究人员对此类资源的使用带来了沉重的负担。因此,针对大规模微生物基因组数据,基于聚类分析技术构建非冗余参考序列的数据库至关重要。然而,当前的聚类算法在长基因组序列上的聚类性能较低。针对这一难题,我们开发了针对大规模微生物全基因组并行聚类的软件Gclust。首先,该软件采用新颖的并行化策略和基于稀疏后缀数组(sparse suffix arrays,SSAs)的快速序列比对算法来加速聚类。其次,文中采用了序列的最大精确匹配(maximal exact matches,MEMs)来计算两条序列之间的一致性。最后,在四个标准基因组序列数据集上进行测试,实验结果表明:Gclust具有最高的聚类性能和聚类质量。Gclust是一款开源的软件,可供研究者用于非商业用途的免费下载,下载地址:https://github.com/niu-lab/gclust。我们还开发了可供用户上传基因组数据进行测试的在线聚类平台,访问地址:http://niulab.scgrid.cn/gclust。
PDF全文下载地址:
http://gpb.big.ac.cn/articles/download/732
删除或更新信息,请邮件至freekaoyan#163.com(#换成@)
Gclust: A Parallel Clustering Tool for Microbial Genomic Data
本站小编 Free考研考试/2022-01-03
相关话题/gen
Mapping Genome Variants Sheds Light on Genetic and Phenotypic Differentiation in Chinese
遗传变异和人类健康和精准医疗息息相关,因此绘制全人类基因组遗传变异图谱成为全球科学家共同奋斗的目标。近年来,国际千人基因组等多个研究小组纷纷致力于发现世界不同种族人群中基因组变异。我国是个多民族国家,拥有大约20%的世界人口和丰富的遗传多样性。但由于缺乏中国南北方人群特异的参考基因组以及深度测序数据 ...中科院北京基因组研究所 本站小编 Free考研考试 2022-01-03Whole Genome Analyses of Chinese Population and De Novo Assembly of A Northern Han Genome
Tounravelthegeneticmechanismsofdiseaseandphysiologicaltraits,itrequirescomprehensivesequencinganalysisoflargesamplesizeinChinesepopulations.Here,werep ...中科院北京基因组研究所 本站小编 Free考研考试 2022-01-03H3K27me3 Signal in the Cis Regulatory Elements Reveals the Differentiation Potential of Progenitors
Drosophilaneuraldevelopmentundergoesextensivechromatinremodelingandpreciseepigeneticregulation.However,therolesofchromatinremodelinginestablishmentand ...中科院北京基因组研究所 本站小编 Free考研考试 2022-01-03C3: Consensus Cancer Driver Gene Caller
Next-generationsequencinghasallowedidentificationofmillionsofsomaticmutationsinhumancancercells.Akeychallengeininterpretingcancergenomesistodistinguis ...中科院北京基因组研究所 本站小编 Free考研考试 2022-01-03gFACs: Gene Filtering, Analysis, and Conversion to Unify Genome Annotations Across Alignment and Gen
Publishedgenomesfrequentlycontainerroneousgenemodelsthatrepresentissuesassociatedwithidentificationofopenreadingframes,startsites,splicesites,andrelat ...中科院北京基因组研究所 本站小编 Free考研考试 2022-01-03m6A Regulates Neurogenesis and Neuronal Development by Modulating Histone Methyltransferase Ezh2
N6-methyladenosine(m6A),catalyzedbythemethyltransferasecomplexconsistingofMettl3andMettl14,isthemostabundantRNAmodificationinmRNAsandparticipatesindiv ...中科院北京基因组研究所 本站小编 Free考研考试 2022-01-03Chronic Food Antigen-specific IgG-mediated Hypersensitivity Reaction as A Risk Factor for Adolescent
Majordepressivedisorder(MDD)isthemostcommonnonfataldiseaseburdenworldwide.Systemicchroniclow-gradeinflammationhasbeenreportedtobeassociatedwithMDDprog ...中科院北京基因组研究所 本站小编 Free考研考试 2022-01-03Integrating Culture-based Antibiotic Resistance Profiles with Whole-genome Sequencing Data for 11,08
Emergingantibioticresistanceisamajorglobalhealththreat.Theanalysisofnucleicacidsequenceslinkedtosusceptibilityphenotypesfacilitatesthestudyofgenetican ...中科院北京基因组研究所 本站小编 Free考研考试 2022-01-03SeqSQC: A Bioconductor Package for Evaluating the Sample Quality of Next-generation Sequencing Data
Asnext-generationsequencing(NGS)technologyhasbecomewidelyusedtoidentifygeneticcausalvariantsforvariousdiseasesandtraits,anumberofpackagesforcheckingNG ...中科院北京基因组研究所 本站小编 Free考研考试 2022-01-03How Microbes Shape Their Communities? A Microbial Community Model Based on Functional Genes
Exploringthemechanismsofmaintainingmicrobialcommunitystructureisimportanttounderstandbiofilmdevelopmentormicrobiotadysbiosis.Inthispaper,weproposeafun ...中科院北京基因组研究所 本站小编 Free考研考试 2022-01-03