删除或更新信息,请邮件至freekaoyan#163.com(#换成@)

gFACs: Gene Filtering, Analysis, and Conversion to Unify Genome Annotations Across Alignment and Gen

本站小编 Free考研考试/2022-01-03

Published genomes frequently contain erroneous gene models that represent issues associated with identification of open reading frames, start sites, splice sites, and related structural features. The source of these inconsistencies is often traced back to integration across text file formats designed to describe long read alignments and predicted gene structures. In addition, the majority of gene prediction frameworks do not provide robust downstream filtering to remove problematic gene annotations, nor do they represent these annotations in a format consistent with current file standards. These frameworks also lack consideration for functional attributes, such as the presence or absence of protein domains that can be used for gene model validation. To provide oversight to the increasing number of published genome annotations, we present a software package, the Gene Filtering, Analysis, and Conversion (gFACs), to filter, analyze, and convert predicted gene models and alignments. The software operates across a wide range of alignment, analysis, and gene prediction files with a flexible framework for defining gene models with reliable structural and functional attributes. gFACs supports common downstream applications, including genome browsers, and generates extensive details on the filtering process, including distributions that can be visualized to further assess the proposed gene space. gFACs is freely available and implemented in Perl with support from BioPerl libraries at https://gitlab.com/PlantGenomicsLab/gFACs.
近年来,随着高通量测序技术的发展和普及,基因组的大小以及组装注释的复杂度日益增长。尽管如此,GenBank数据库的近7800个真核细胞基因组中,仅有少数组装注释到染色体水平。而就这些真核细胞的基因组而言,超过85%的基因组包含错误的基因注释信息。这种错误的产生往往是由于整合不同格式基因注释信息的文本文件所导致的。此外,大多数基因预测流程缺少冗余基因过滤步骤,并且不提供主流标准输出形式的注释结果文件。同时,这些流程很少涉及功能属性注释,如那些可以用于考证基因模型准确性的蛋白质结构域信息等。为了对日益增多的基因组注释信息提供有效的监督,我们开发了一个针对基因注释文件和比对结果,集筛选、分析、转换功能于一体的软件包——gFACs。通过结合参考基因组的信息,这款软件可以过滤错误的基因模型,生成统计信息,并提供可以进行下游分析可视化的输出文件。值得注意的是,这款软件并不能代替基于从头注释或者相似性注释的基因预测模型,而是提供一个用于比较争议性注释信息的工具,从而提高基因注释信息的准确性。同时,gFACs提供常用的附加功能,如基因组浏览,以及生成有关筛选过程的详细信息。gFACs是由Perl中的BioPerl库提供基础支持的开源包,可供研究者免费下载使用,下载地址https://gitlab.com/PlantGenomicsLab/gFACs。





PDF全文下载地址:

http://gpb.big.ac.cn/articles/download/712
相关话题/gen