生命与健康大数据中心资源

删除或更新信息，请邮件至freekaoyan#163.com(#换成@)

本站小编 Free考研考试/2022-01-01

张源笙¹^,²^,³, 夏琳¹^,²^,³, 桑健¹^,²^,³, 李漫¹^,²^,³, 刘琳¹^,²^,³, 李萌伟¹^,²^,³, 牛广艺¹^,²^,³, 曹佳宝¹^,²^,³, 滕徐菲¹^,²^,³, 周晴¹^,²^,³, 章张^,¹^,²^,³

1. 中国科学院北京基因组研究所,生命与健康大数据中心,北京 100101

2. 中国科学院北京基因组研究所,中国科学院基因组科学与信息重点实验室,北京 100101

3. 中国科学院大学,北京 100049

The BIG Data Center’s database resources

Yuansheng Zhang¹^,²^,³, Lin Xia¹^,²^,³, Jian Sang¹^,²^,³, Man Li¹^,²^,³, Lin Liu¹^,²^,³, Mengwei Li¹^,²^,³, Guangyi Niu¹^,²^,³, Jiabao Cao¹^,²^,³, Xufei Teng¹^,²^,³, Qing Zhou¹^,²^,³, Zhang Zhang^,¹^,²^,³

1. BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China

2. CAS Key Laboratory of Genomics and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China

3. University of Chinese Academy of Sciences, Beijing 100049, China

通讯作者: 章张,博士,研究员,研究方向：生物信息学。E-mail: zhangzhang@big.ac.cn

编委: 夏昆
收稿日期:2018-07-5修回日期:2018-09-12网络出版日期:2018-11-20

基金资助:

中国科学院战略性先导科技专项.XDA19050302
中国科学院战略性先导科技专项.XDB13040500
中国科学院战略性先导科技专项.XDA08020102
国家重点研发计划.2016YFC0901603
中国科学院“十三五”信息化建设专项.XXH13505-05

Received:2018-07-5Revised:2018-09-12Online:2018-11-20

Fund supported:

Supported by Strategic Priority Research Program of the Chinese Academy of Sciences.XDA19050302
Supported by Strategic Priority Research Program of the Chinese Academy of Sciences .XDB13040500
Supported by Strategic Priority Research Program of the Chinese Academy of Sciences.XDA08020102
the National Key Research & Development Program of China .2016YFC0901603
the 13th Five-year Informatization Plan of Chinese Academy of Sciences.XXH13505-05

作者简介 About authors
张源笙,硕士研究生,专业方向：生物信息学E-mail:zhangyuansheng@big.ac.cn。

夏琳,博士研究生,专业方向：生物信息学E-mail:xialin@big.ac.cn。

桑健,博士研究生,专业方向：生物信息学E-mail:sangj@big.ac.cn张源笙、夏琳和桑健并列第一作者。

摘要
生命与健康多组学数据是生命科学研究和生物医学技术发展的重要基础。然而,我国缺乏生物数据管理和共享平台,不但无法满足国内日益增长的生物医学及相关学科领域的研究发展需求,而且严重制约我国生物大数据整合共享与转化利用。鉴于此,中国科学院北京基因组研究所于2016年初成立生命与健康大数据中心(BIG Data Center, BIGD),围绕国家人口健康和重要战略生物资源,建立生物大数据管理平台和多组学数据资源体系。本文重点介绍BIGD的生命与健康大数据资源系统,主要包括组学原始数据归档库、基因组数据库、基因组变异数据库、基因表达数据库、甲基化数据库、生物信息工具库和生命科学维基知识库,提供生物大数据汇交、整合与共享服务,为促进我国生命科学数据管理、推动国家生物信息中心建设奠定重要基础。
关键词： 大数据;组学;数据共享;数据资源;生物信息学

Abstract

Omics data in life and health sciences are of fundamental significance for scientific research and biomedical technology development. However, there is yet to be a platform for biological data management and sharing in China, making it difficult to meet the development needs of biomedical and related fields and consequently leading to severe issues in big data management, sharing and translation. To address these issues, Beijing Institute of Genomics (BIG) of Chinese Academy of Sciences founded the BIG Data Center (BIGD) in 2016, which is dedicated to establish a biological big data management platform and multi-omics databases, with a particular focus on national population healthcare and important strategic biological resources. In this paper, we describe core database resources in BIGD, including GSA (Genome Sequence Archive), GWH (Genome Warehouse), GVM (Genome Variation Map), GEN (Gene Expression Nebulas), MethBank (Methylation Bank), BioCode and Science Wikis. Taken together, all these resources provide a series of services for data deposition, integration and sharing, laying solid foundations for enhancing national biological science data management and further promoting the construction of national bioinformatics center.

Keywords：big data;omics;data sharing;data resource;bioinformatics

PDF (233KB)元数据多维度评价相关文章导出EndNote|Ris|Bibtex 收藏本文
本文引用格式
张源笙, 夏琳, 桑健, 李漫, 刘琳, 李萌伟, 牛广艺, 曹佳宝, 滕徐菲, 周晴, 章张. 生命与健康大数据中心资源[J]. 遗传, 2018, 40(11): 1039-1043 doi:10.16288/j.yczz.18-190
Yuansheng Zhang, Lin Xia, Jian Sang, Man Li, Lin Liu, Mengwei Li, Guangyi Niu, Jiabao Cao, Xufei Teng, Qing Zhou, Zhang Zhang. The BIG Data Center’s database resources[J]. Hereditas(Beijing), 2018, 40(11): 1039-1043 doi:10.16288/j.yczz.18-190

随着高通量测序技术的迅猛发展及测序成本的不断下降,生命与健康数据迎来爆发式增长,我国已经成为世界上基因组数据产出大国。预计到2023年,我国将年产超过200 PB的基因组数据。然而,当前我国生命与健康数据存在两大问题：一是数据外流。虽然我国生物组学数据产量约占全球40%,但是长期缺乏国际认可的国家级生物数据库系统,科学家们不得不把宝贵的数据资源提交至美国国立生物技术信息中心(National Center for Biotechnology Information, NCBI)、欧洲生物信息学研究所(European Bioinformatics Institute, EBI)及日本DNA数据库(DNA Data Bank of Japan, DDBJ),造成我国宝贵的生物遗传数据外流。二是数据孤岛。由于缺乏数据共享管理机制,宝贵的组学数据分散在国内不同实验室和机构内部,未能有效共享整合,从而形成数据孤岛,严重制约我国生物大数据的整合共享与转化利用。因此,从国家战略层面和行业发展出发,亟需建立面向我国科学数据管理的生物大数据汇交共享平台,研发面向我国人口健康和重要战略生物资源的多组学数据资源体系,彻底改变我国在世界上处于“测序大国、数据弱国”的尴尬地位。针对上述问题,中国科学院北京基因组研究所的生命与健康大数据中心,围绕国家人口健康和重要战略生物资源的组学数据,建设生物大数据汇交、共享与管理平台,研发生命与健康多组学数据资源体系,为促进我国生命科学数据管理、推动国家生物信息中心建设奠定重要基础。

1 生命与健康大数据中心的任务与使命

生命与健康大数据中心(BIG Data Center, BIGD)于2016年2月在中科院北京基因组研究所正式成立^[1,2]。BIGD主要任务是围绕我国人口健康和国家重要战略生物资源,建立生物大数据汇交存储、整合挖掘、共享管理与转化应用体系,研发生物大数据汇交管理平台和多组学数据资源体系,支撑服务我国公益性科学研究与产业创新发展。成立至今,BIGD已建立形成涵盖多组学数据的生命与健康大数据资源系统,主要包括组学原始数据归档库(Genome Sequence Archive, GSA)、基因组数据库(Genome Warehouse, GWH)、基因组变异数据库(Genome Variation Map, GVM)、基因表达数据库(Gene Expression Nebulas, GEN)、甲基化数据库(Methylation Bank, MethBank)、生物信息工具库(Biological Tool Codes, BioCode)和生命科学维基知识库(Science Wikis)等,面向全球提供生物大数据资源共享与访问服务,形成支撑我国科学发现和产业发展的重要数据资源和基础条件。

2 面向全球服务的数据资源

2.1 组学原始数据归档库(GSA)

GSA(http://bigd.big.ac.cn/gsa)^[3,4]是一个面向全球的组学原始数据汇交、存储、管理与共享的公共数据管理平台。作为国内首个被国际期刊认可的组学数据发布平台,GSA面向全球接收不同测序平台产生的组学原始数据,免费提供长期的存储、管理与共享服务。除了收录原始测序数据,GSA还收录BAM (Binary Alignment Map)、VCF (Variant Call Format)等格式的二级分析文件。GSA中的数据元素包括元数据和原始序列两个部分,其中元数据可以按照由小到大的逻辑顺序建立多对一的关联关系,即测序信息(Run)、实验信息(Experiment)、生物样本信息(BioSample)和生物项目信息(BioProject)。为确保与国际核酸序列数据库联盟(International Nucleotide Sequence Database Collaboration,简称INSDC)系统的兼容性,GSA为用户递交的每一组数据分配唯一的数据获取号(Accession Number),项目数据以“PRJCA”为前缀,样本数据以“SAMC”为前缀,测序数据以“CRR”为前缀。截至2018年7月,GSA已接收来自全球93个研究机构的308名用户递交的数据,包括535个BioProjects,21 843个BioSamples,28 050个Experiments,29 624个Runs,覆盖178个物种,累积存储组学原始数据超过536TB。目前GSA已获得包括Cell、Nature、PNAS、AJHG、GPB、Cell Research等在内的30多个国际权威期刊认可,支撑服务国家重大科研任务。

2.2 基因组数据库(GWH)

GWH (http://bigd.big.ac.cn/gwh)提供多物种基因组序列和基因注释信息的汇交、存储、发布和共享等数据服务,涵盖动物、植物、真菌、细菌等国家重要战略资源物种。GWH遵循国际INSDC数据标准,将数据组织成3个对象,即生物样本信息(BioSample)、生物项目信息(BioProject)和基因组组装信息(Assembly)。GWH存储的数据有两个来源：(1)用户直接递交;(2)整合已发布的重要物种基因组信息。针对用户递交数据,GWH建立了严格的质量控制标准,检查基因组序列ID、序列内容、基因结构的完整性和一致性等。截至2018年7月,GWH已收录254个物种基因组数据,包括用户直接递交的116个物种的基因组数据(7个动物、10个植物、1个真菌、74个细菌、23个古细菌、1个病毒)和整合了已发布的138个新发布的物种基因组(61个动物和77个植物)。

2.3 基因组变异数据库(GVM)

GVM (http://bigd.big.ac.cn/gvm)^[5,6]是集基因组变异数据汇交、管理、整合与检索的重要数据库,提供多物种的遗传多样性信息及遗传变异与表型关联信息。GVM以物种为单位,收录其基因组中变异位点及其注释信息,涉及的数据类型主要包括单核苷酸多态性(single nucleotide polymorphism, SNP)、小片段插入与缺失(insertion and deletion, InDel)等。区别于其他变异数据库(如dbSNP),GVM收录了我国重要战略资源与生物多样性物种。截至2018年7月,GVM已涵盖包括人,畜牧如猪(Sus scrofa)、牛(Bos taurus)、羊(Capra hircus)、鸡(Gallus gallus)、鸭(Anas platyrhynchos)等,农作物如水稻(Oryza sativa)、玉米(Zea mays)、高粱(Sorghum bicolor)、小麦(Triticum aestivum)、番茄(Solanum lycopersicum)、大豆(Glycine max)等在内的25个物种,囊括了约50亿条变异信息。此外,基于科研文献的人工审编,GVM整合了基因型与表型(genotype-phenotype, G2P)关联信息,包括180 911条人类的G2P信息和13 262条非人物种的G2P信息。同时,GVM提供多条件关联组合检索及变异信息的可视化功能,支持不同类型的基因组变异数据提交和下载。同时,针对重要特色物种,已建立虚拟中国人基因组数据库(Virtual Chinese Genome Database)^[7]、高粱基因组变异数据库(Sorghum Genome SNP Database)^[8]以及家狗基因组变异数据库(Dog Genome SNP Database)^[9]。

2.4 基因表达数据库(GEN)

GEN (http://bigd.big.ac.cn/gen)是基因表达数据的汇集和共享平台,系统整合多物种、多组织、多样本的基因表达数据,为精准医学、分子育种、生物安全、生物多样性等研究提供基因表达数据信息。目前,GEN已经涵盖了基于二代测序的哺乳动物如人(Homo sapiens)、猪(S. scrofa)和小鼠(Mus musculus)等、植物如水稻(O. sativa)等的表达数据以及基于人工审编的多物种内参基因(internal control gene)相关信息,分别建立哺乳动物转录组数据库(MammalianTranscriptomic Database, MTD, http:// mtd.cbi.ac.cn)^[10],水稻表达数据库(Rice Expression Database, RED, http://expression.ic4r.org)^[11]和内参基因知识库(Internal Control Genes, ICG, http://icg.big. ac.cn)^[12]。随着高通量测序数据的积累,GEN将支持更多物种的表达数据汇集与管理,开发实现同源基因在多物种尺度下的表达谱可视化比较分析,整合单细胞水平的基因表达数据,为研究人员提供多维全面的数据信息与展示功能。

2.5 甲基化数据库(MethBank)

MethBank (http://bigd.big.ac.cn/methbank)^[13,14]整合人类和重要动植物的全基因组高精度DNA甲基化图谱,提供界面友好的数据浏览、检索、分析和下载功能。MethBank提供基因甲基化、差异甲基化、特异性甲基化、年龄相关甲基化等信息,为精准医学、公共安全、动植物育种等研究提供重要基础数据资源和分析平台：(1)在人类衰老方面,收录了4577个健康人外周血样本的450K芯片数据,整合审编成34个不同年龄组的参比甲基化组(consensus reference methylomes, CRMs);(2)在动物方面,整合两个模式动物(斑马鱼Danio rerio和小鼠M. musculus)的配子与早期胚胎发育的18个单碱基精度甲基化组;(3)在植物方面,整合5个重要经济作物,包括水稻(O. sativa)、大豆(G. max)、木薯(Manihot esculenta)、番茄(S. lycopersicum)和菜豆(Phaseolus vulgaris)不同发育阶段多个组织的336个单碱基精度甲基化组(single-base resolution methylomes)。此外,MethBank提供在线分析工具Age Predictor和IDMP (Identification of Differentially Methylated Promoter),分别用于预测人的甲基化年龄和识别差异甲基化启动子。

2.6 生物信息工具库(BioCode)

BioCode (http://bigd.big.ac.cn/biocode)是整合开源生物信息学相关软件及工具的数据库,其主要功能是分类搜集整理生物信息软件工具源代码、软件包以及重要元数据信息,包括软件名称、功能描述、分类、发表文章、文章引用情况、开发人员信息及单位信息等。截止2018年7月,已经收录软件包6980个,主要来源于生物信息学领域期刊杂志,包括Bioinformatics、Genome Biology、Nucleic Acids Research等。BioCode允许用户提交自己开发的工具及软件包,实现了生物信息学软件工具的集中归档与管理,从而支持生物信息学工具的一站式检索和公开访问,不仅方便开发者托管、存档及发布生物信息学工具,同时也可以让用户有效浏览、搜索和下载任何感兴趣的内容。

2.7 生命科学维基知识库(Science Wikis)

Science Wikis (http://bigd.big.ac.cn/sciencewikis)是基于维基(Wiki)技术或维基思想,对生命科学知识和数据进行整合、集成和共享的数据库系统。Science Wikis允许用户对生命科学知识和数据进行添加和编辑,旨在利用集体智慧实现生物知识和数据的整合与审编。目前,Science Wikis主要包括以下子库：(1) ICG：内参基因知识库,整合收录209个物种(73种动物,115种植物,12种真菌及9种细菌)中的超过700个内参基因信息;(2) LncRNAWiki^[15]：人类长非编码RNA知识库,整合了106 063个人类的长非编码RNA,并对超过1000个有文献支持的lncRNA进行审编与注释;(3) RiceWiki^[16]现收录了粳稻与籼稻共86 216个基因,并对超过1000个基因实现了高质量审编,注释信息包含表达信息、功能信息、进化信息以及参考文献信息;(4) Database Commons：全球生物数据库目录,整合近5000个已发表的生物学数据库信息。

3 结语与展望

BIGD在成立至今的两年建设阶段已取得了突破性进展,2018年被国际Nucleic Acids Research数据库专刊评价为全球主要生物信息数据中心之一。展望未来,在数据资源建设方面,BIGD将面向人口健康和国家重要战略生物资源,不仅注重数据的汇交存储(deposition),同时也充分考虑数据的整合挖掘(integration)与转化应用(translation),建立以数据汇交为基础、以数据整合为途径、以挖掘分析为导向、以前沿领域为牵引的生物信息大数据中心,形成具有中国特色的生物大数据汇交共享平台和多组学数据资源体系;在方法技术方面,BIGD将充分采用多项前沿交叉技术(包括云计算、人工智能、深度学习、Wiki、生物审编等),应用于大数据中心共享访问平台建设。尽管前路漫漫,BIGD将积极开展与NCBI和EBI等国际主要生物信息中心合作,联合国内其他生物信息资源管理和服务单位,共同推动国家生物信息中心建设。

The authors have declared that no competing interests exist.

作者已声明无竞争性利益关系。

参考文献原文顺序
文献年度倒序
文中引用次数倒序
被引期刊影响因子

[1]

BIG Data Center

Members

. The BIG Data Center: from deposition to integration to translation
Nucleic Acids Res, 2017,45(Database Issue):D18-D24.

URLPMID:27899658 [本文引用: 1]

Abstract Biological data are generated at unprecedentedly exponential rates, posing considerable challenges in big data deposition, integration and translation. The BIG Data Center, established at Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, provides a suite of database resources, including (i) Genome Sequence Archive, a data repository specialized for archiving raw sequence reads, (ii) Gene Expression Nebulas, a data portal of gene expression profiles based entirely on RNA-Seq data, (iii) Genome Variation Map, a comprehensive collection of genome variations for featured species, (iv) Genome Warehouse, a centralized resource housing genome-scale data with particular focus on economically important animals and plants, (v) Methylation Bank, an integrated database of whole-genome single-base resolution methylomes and (vi) Science Wikis, a central access point for biological wikis developed for community annotations. The BIG Data Center is dedicated to constructing and maintaining biological databases through big data integration and value-added curation, conducting basic research to translate big data into big knowledge and providing freely open access to a variety of data resources in support of worldwide research activities in both academia and industry. All of these resources are publicly available and can be found at http://bigd.big.ac.cn. The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

[2]

BIG Data Center

Members

. Database resources of the BIG Data Center in 2018
Nucleic Acids Res, 2018,46(Database Issue):D14-D20.

URLPMID:29140455 [本文引用: 1]

Abstract Cell types in cell populations change as the condition changes: some cell types die out, new cell types may emerge and surviving cell types evolve to adapt to the new condition. Using single-cell RNA-sequencing data that measure the gene expression of cells before and after the condition change, we propose an algorithm, SparseDC, which identifies cell types, traces their changes across conditions and identifies genes which are marker genes for these changes. By solving a unified optimization problem, SparseDC completes all three tasks simultaneously. SparseDC is highly computationally efficient and demonstrates its accuracy on both simulated and real data. The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

[3]

Wang

, Song

, Zhu

, Zhang

, Yang

, Chen

, Tang

, Dong

, Ding

, Zhang

, Bai

, Dong

, Chen

, Sun

, Zhai

, Sun

, Yu

, Lan

, Xiao

, Fang

, Lei

, Zhang

, Zhao

. GSA: Genome sequence archive
Genom Proteom Bioinf, 2017,15(1):14-18.

URLPMID:5339404 [本文引用: 1]

With the rapid development of sequencing technologies towards higher throughput and lower cost, sequence data are generated at an unprecedentedly explosive rate. To provide an efficient and easy-to-use platform for managing huge sequence data, here we presentGenome Sequence Archive(GSA;http://bigd.big.ac.cn/gsaorhttp://gsa.big.ac.cn), a data repository for archivingraw sequence data. In compliance with data standards and structures of the International Nucleotide Sequence Database Collaboration (INSDC), GSA adopts four data objects (BioProject, BioSample, Experiment, and Run) for data organization, accepts raw sequence reads produced by a variety of sequencing platforms, stores both sequence reads and metadata submitted from all over the world, and makes all these data publicly available to worldwide scientific communities. In the era ofbig data, GSA is not only an important complement to existing INSDC members by alleviating the increasing burdens of handling sequence data deluge, but also takes the significant responsibility for global big data archive and provides free unrestricted access to all publicly available data in support of research activities throughout the world.

[4]

Zhang

, Chen T. Zhu

, Zhou

, Chen

, Wang

, Zhao

. GSA: Genome Sequence Archive
Hereditas (Beijing), 2018,40(11):1044-1047.

[本文引用: 1]

张思思, 陈婷婷, 朱军伟, 周晴, 陈旭, 王彦青, 赵文明 . GSA: 组学原始数据归档库
遗传, 2018,40(11):1044-1047.

[本文引用: 1]

[5]

Song

, Tian

, Li

, Tang

, Dong

, Xiao

, Bao

, Zhao

, He

, Zhang

. Genome Variation Map: a data repository of genome variations in BIG Data Center
Nucleic Acids Res, 2018,46(Database Issue):D944-D949.

URL [本文引用: 1]

react-text: 179 Background/Question/Methods General theories for macroecological patterns have become increasingly prevalent in the last decade. These theories potentially allow predictions to be made in the absence of detailed understanding of the processes structuring an ecosystem. We discuss research testing one of these general theories, the Maximum Entropy Theory of Ecology, which posits that many... /react-text react-text: 180 /react-text [Show full abstract]

[6]

Song

, Teng

, Xiao

. Database resources of the reference genome and genetic varia-tion maps for the Chinese population
Hereditas (Beijing), 2018,40(11):1048-1054.

[本文引用: 1]

宋述慧, 滕徐菲, 肖景发 . 中国人群参考基因组及基因组变异图谱资源库
遗传, 2018,40(11):1048-1054.

[本文引用: 1]

[7]

Ling

, Jin

, Su

, Zhong

, Zhao

, Yu

, Wu

, Xiao

. VCGDB: a dynamic genome database of the Chinese population
BMC Genomics, 2014,15:265.

URLPMID:4028056 [本文引用: 1]

Background The data released by the 1000 Genomes Project contain an increasing number of genome sequences from different nations and populations with a large number of genetic variations. As a result, the focus of human genome studies is changing from single and static to complex and dynamic. The currently available human reference genome (GRCh37) is based on sequencing data from 13 anonymous Caucasian volunteers, which might limit the scope of genomics, transcriptomics, epigenetics, and genome wide association studies. Description We used the massive amount of sequencing data published by the 1000 Genomes Project Consortium to construct the Virtual Chinese Genome Database (VCGDB), a dynamic genome database of the Chinese population based on the whole genome sequencing data of 194 individuals. VCGDB provides dynamic genomic information, which contains 35 million single nucleotide variations (SNVs), 0.5 million insertions/deletions (indels), and 29 million rare variations, together with genomic annotation information. VCGDB also provides a highly interactive user-friendly virtual Chinese genome browser (VCGBrowser) with functions like seamless zooming and real-time searching. In addition, we have established three population-specific consensus Chinese reference genomes that are compatible with mainstream alignment software. Conclusions VCGDB offers a feasible strategy for processing big data to keep pace with the biological data explosion by providing a robust resource for genomics studies; in particular, studies aimed at finding regions of the genome associated with diseases.

[8]

Bai

, Zhao

, Tang

, Wang

, Zhang

, Yang

, Liu

, Zhu

, Irwin

, Wang

, Zhang

. DoGSD: the dog and wolf genome SNP database
Nucleic Acids Res, 2015,43(Database Issue):D777-D783.

URLPMID:25404132 [本文引用: 1]

Abstract The rapid advancement of next-generation sequencing technology has generated a deluge of genomic data from domesticated dogs and their wild ancestor, grey wolves, which have simultaneously broadened our understanding of domestication and diseases that are shared by humans and dogs. To address the scarcity of single nucleotide polymorphism (SNP) data provided by authorized databases and to make SNP data more easily/friendly usable and available, we propose DoGSD (http://dogsd.big.ac.cn), the first canidae-specific database which focuses on whole genome SNP data from domesticated dogs and grey wolves. The DoGSD is a web-based, open-access resource comprising 090804 19 million high-quality whole-genome SNPs. In addition to the dbSNP data set (build 139), DoGSD incorporates a comprehensive collection of SNPs from two newly sequenced samples (1 wolf and 1 dog) and collected SNPs from three latest dog/wolf genetic studies (7 wolves and 68 dogs), which were taken together for analysis with the population genetic statistics, Fst. In addition, DoGSD integrates some closely related information including SNP annotation, summary lists of SNPs located in genes, synonymous and non-synonymous SNPs, sampling location and breed information. All these features make DoGSD a useful resource for in-depth analysis in dog-/wolf-related studies. 0008 The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

[9]

Luo

, Zhao

, Wang

, Xia

, Wu

, Zhang

, Tang

, Zhu

, Fang

, Du

, Bekele

, Tai

, Jordan

, Godwin

, Snowdon

, Mace

, Jing

, Luo

. SorGSD: a sorghum genome SNP database
Biotechnol Biofuel, 2016,9(1):6.

URLPMID:4704391 [本文引用: 1]

Sorghum (Sorghum bicolor) is one of the most important cereal crops globally and a potential energy plant for biofuel production. In order to explore genetic gain for a range of important quantitative traits, such as drought and heat tolerance, grain yield, stem sugar accumulation, and biomass production, via the use of molecular breeding and genomic selection strategies, knowledge of the available genetic variation and the underlying sequence polymorphisms, is required. Based on the assembled and annotated genome sequences ofSorghum bicolor(v2.1) and the recently published sorghum re-sequencing data, ~62.9 M SNPs were identified among 48 sorghum accessions and included in a newly developed sorghum genome SNP database SorGSD (http://sorgsd.big.ac.cn). The diverse panel of 48 sorghum lines can be classified into four groups, improved varieties, landraces, wild and weedy sorghums, and a wild relativeSorghum propinquum. SorGSD has a web-based query interface to search or browse SNPs from individual accessions, or to compare SNPs among several lines. The query results can be visualized as text format in tables, or rendered as graphics in a genome browser. Users may find useful annotation from query results including type of SNPs such as synonymous or non-synonymous SNPs, start, stop of splice variants, chromosome locations, and links to the annotation on Phytozome (www.phytozome.net) sorghum genome database. In addition, general information related to sorghum research such as online sorghum resources and literature references can also be found on the website. All the SNP data and annotations can be freely download from the website. SorGSD is a comprehensive web-portal providing a database of large-scale genome variation across all racial types of cultivated sorghum and wild relatives. It can serve as a bioinformatics platform for a range of genomics and molecular breeding activities for sorghum and for other C4grasses.

[10]

Sheng

, Wu

, Sun

, Xian

, Li

, Sun

, Fang

, Chen

, Yu

, Xiao

. MTD: a mammalian transcriptomic database to explore gene expression regulation
Brief Bioinform, 2017,18(1):28-36.

URLPMID:26822098 [本文引用: 1]

A systematic transcriptome survey is essential for the characterization and comprehension of the molecular basis underlying phenotypic variations. Recently developed RNA-seq methodology has facilitated efficient data acquisition and information mining of transcriptomes in multiple tissues/cell lines. Current mammalian transcriptomic databases are either tissue-specific or species-specific, and they lack in-depth comparative features across tissues and species. Here, we present a mammalian transcriptomic database (MTD) that is focused on mammalian transcriptomes, and the current version contains data from humans, mice, rats and pigs. Regarding the core features, the MTD browses genes based on their neighboring genomic coordinates or joint KEGG pathway and provides expression information on exons, transcripts and genes by integrating them into a genome browser. We developed a novel nomenclature for each transcript that considers its genomic position and transcriptional features. The MTD allows a flexible search of genes or isoforms with user-defined transcriptional characteristics and provides both table-based descriptions and associated visualizations. To elucidate the dynamics of gene expression regulation, the MTD also enables comparative transcriptomic analysis in both intraspecies and interspecies manner. The MTD thus constitutes a valuable resource for transcriptomic and evolutionary studies. The MTD is freely accessible athttp://mtd.cbi.ac.cn.

[11]

Xia

, Zou

, Sang

, Xu

, Yin

, Li

, Wu

, Hu

, Hao

, Zhang

.Rice Expression Database (RED): an integrated RNA-Seq-derived gene expression database for rice
J Genet Genomics, 2017,44(5):235-241.

URLPMID:28529082 [本文引用: 1]

[12]

Sang

, Wang

, Li

, Cao

, Niu

, Xia

, Zou

, Wang

, Xu

, Han

, Fan

, Yang

, Zuo

, Zhang

, Zhao

, Bao

, Xiao

, Hu

, Hao

, Zhang

. ICG: a wiki-driven knowledgebase of internal control genes for RT-qPCR normalization
Nucleic Acids Res, 2017,46(Database Issue):D121-D126.

URLPMID:29036693 [本文引用: 1]

Abstract Real-time quantitative PCR (RT-qPCR) has become a widely used method for accurate expression profiling of targeted mRNA and ncRNA. Selection of appropriate internal control genes for RT-qPCR normalization is an elementary prerequisite for reliable expression measurement. Here, we present ICG (http://icg.big.ac.cn), a wiki-driven knowledgebase for community curation of experimentally validated internal control genes as well as their associated experimental conditions. Unlike extant related databases that focus on qPCR primers in model organisms (mainly human and mouse), ICG features harnessing collective intelligence in community integration of internal control genes for a variety of species. Specifically, it integrates a comprehensive collection of more than 750 internal control genes for 73 animals, 115 plants, 12 fungi and 9 bacteria, and incorporates detailed information on recommended application scenarios corresponding to specific experimental conditions, which, collectively, are of great help for researchers to adopt appropriate internal control genes for their own experiments. Taken together, ICG serves as a publicly editable and open-content encyclopaedia of internal control genes and accordingly bears broad utility for reliable RT-qPCR normalization and gene expression characterization in both model and non-model organisms. The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

[13]

Zou

, Sun

, Li

, Liu

, Zhang

. MethBank: a database integrating next-generation sequencing single-base- resolution DNA methylation programming data
Nucleic Acids Res, 2015,43(Database Issue):D54-D58.

URLPMID:4384011 [本文引用: 1]

DNA methylation plays crucial roles during embryonic development. Here we present MethBank (http://dnamethylome.org), a DNA methylome programming database that integrates the genome-wide single-base nucleotide methylomes of gametes and early embryos in different model organisms. Unlike extant relevant databases, MethBank incorporates the whole-genome single-base-resolution methylomes of gametes and early embryos at multiple different developmental stages in zebrafish and mouse. MethBank allows users to retrieve methylation levels, differentially methylated regions, CpG islands, gene expression profiles and genetic polymorphisms for a specific gene or genomic region. Moreover, it offers a methylome browser that is capable of visualizing high-resolution DNA methylation profiles as well as other related data in an interactive manner and thus is of great helpfulness for users to investigate methylation patterns and changes of gametes and early embryos at different developmental stages. Ongoing efforts are focused on incorporation of methylomes and related data from other organisms. Together, MethBank features integration and visualization of high-resolution DNA methylation data as well as other related data, enabling identification of potential DNA methylation signatures in different developmental stages and accordingly providing an important resource for the epigenetic and developmental studies.

[14]

, Liang

, Li

, Zou

, Sun

, Zhao

, Bao

, Xiao

, Zhang

. MethBank 3.0: a database of DNA methylomes across a variety of species
Nucleic Acids Res, 2018,46(Database Issue):D288-D295.

URLPMID:29161430 [本文引用: 1]

Abstract MethBank (http://bigd.big.ac.cn/methbank) is a database that integrates high-quality DNA methylomes across a variety of species and provides an interactive browser for visualization of methylation data. Here, we present an updated implementation of MethBank (version 3.0) by incorporating more DNA methylomes from multiple species and equipping with more enhanced functionalities for data annotation and more friendly web interfaces for data presentation, search and visualization. MethBank 3.0 features large-scale integration of high-quality methylomes, involving 34 consensus reference methylomes derived from a large number of human samples, 336 single-base resolution methylomes from different developmental stages and/or tissues of five plants, and 18 single-base resolution methylomes from gametes and early embryos at multiple stages of two animals. Additionally, it is enhanced by improving the functionalities for data annotation, which accordingly enables systematic identification of methylation sites closely associated with age, sites with constant methylation levels across different ages, differentially methylated promoters, age-specific differentially methylated cytosines/regions, and methylated CpG islands. Moreover, MethBank provides tools to estimate human methylation age online and to identify differentially methylated promoters, respectively. Taken together, MethBank is upgraded with significant improvements and advances over the previous version, which is of great help for deciphering DNA methylation regulatory mechanisms for epigenetic studies. The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

[15]

, Li

, Zou

, Xu

, Xia

, Yu

, Bajic

, Zhang

. LncRNAWiki: harnessing community knowledge in collaborative curation of human long non-coding RNAs
Nucleic Acids Res, 2015,43(Database Issue):D187-192.

URLPMID:4383965 [本文引用: 1]

Abstract Long non-coding RNAs (lncRNAs) perform a diversity of functions in numerous important biological processes and are implicated in many human diseases. In this report we present lncRNAWiki (http://lncrna.big.ac.cn), a wiki-based platform that is open-content and publicly editable and aimed at community-based curation and collection of information on human lncRNAs. Current related databases are dependent primarily on curation by experts, making it laborious to annotate the exponentially accumulated information on lncRNAs, which inevitably requires collective efforts in community-based curation of lncRNAs. Unlike existing databases, lncRNAWiki features comprehensive integration of information on human lncRNAs obtained from multiple different resources and allows not only existing lncRNAs to be edited, updated and curated by different users but also the addition of newly identified lncRNAs by any user. It harnesses community collective knowledge in collecting, editing and annotating human lncRNAs and rewards community-curated efforts by providing explicit authorship based on quantified contributions. LncRNAWiki relies on the underling knowledge of scientific community for collective and collaborative curation of human lncRNAs and thus has the potential to serve as an up-to-date and comprehensive knowledgebase for human lncRNAs. The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

[16]

Zhang

, Sang

, Ma

, Wu

, Huang

, Zou

, Liu

, Li

, Hao

, Tian

, Xu

, Wang

, Wu

, Xiao

, Dai

, Chen

, Hu

, Yu

. RiceWiki: a wiki-based database for community curation of rice genes
Nucleic Acids Res, 2014,42(Database Issue):D1222-D1228.

URL [本文引用: 1]