删除或更新信息,请邮件至freekaoyan#163.com(#换成@)

限制性两阶段多位点全基因组关联分析方法的特点与计算程序

本站小编 Free考研考试/2021-12-26

贺建波,, 刘方东, 邢光南, 王吴彬, 赵团结, 管荣展, 盖钧镒,*南京农业大学大豆研究所 / 农业部大豆生物学与遗传育种重点实验室 / 国家大豆改良中心 / 作物遗传与种质创新国家重点实验室, 江苏南京 210095

Characterization and Analytical Programs of the Restricted Two-stage Multi- locus Genome-wide Association Analysis

HE Jian-Bo,, LIU Fang-Dong, XING Guang-Nan, WANG Wu-Bin, ZHAO Tuan-Jie, GUAN Rong-Zhan, GAI Jun-Yi,*Soybean Research Institute / National Center for Soybean Improvement, Ministry of Agriculture / Key Laboratory of Biology and Genetic Improvement of Soybean (General), Ministry of Agriculture / State Key Laboratory for Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China

通讯作者: * 通信作者(Corresponding author): 盖钧镒, E-mail: sri@njau.edu.cn

第一联系人: 第一作者联系方式: E-mail: hjbxyz@gmail.com
收稿日期:2018-03-19接受日期:2018-06-12网络出版日期:2018-06-29
基金资助:本研究由国家自然科学基金项目.31701447
本研究由国家自然科学基金项目.31671718
国家重点研发计划项目.2017YFD0101500
国家重点研发计划项目.(2017YFD0101500)
教育部111项目(B08025).PCSIRT_17R55
教育部****和创新团队项目(PCSIRT_17R55).CARS-04
国家现代农业产业技术体系建设专项.CARS-04


Received:2018-03-19Accepted:2018-06-12Online:2018-06-29
Fund supported: This study was supported by the National Natural Science Foundation of China.31701447
This study was supported by the National Natural Science Foundation of China.31671718
National Key R&D Program for Crop Breeding in China.2017YFD0101500
National Key R&D Program for Crop Breeding in China the MOE 111 Project B08025.(2017YFD0101500)
MOE Program for Changjiang Scholars and Innovative Research Team in University.PCSIRT_17R55
China Agriculture Research System.CARS-04
the Jiangsu Higher Education PAPD Program, the Fundamental Research Funds for the Central Universities and the Jiangsu JCIC-MCP.CARS-04


摘要
全基因组关联分析(genome-wide association study, GWAS)的理论及应用是近十几年来国内外数量性状研究的热点, 但是以往GWAS方法注重于个别主要QTL/基因的检测与发掘。为了相对全面地解析全基因组QTL及其等位基因构成, 本研究提出了限制性两阶段多位点GWAS方法(RTM-GWAS, https://github.com/njau-sri/rtm-gwas)。RTM-GWAS首先将多个相邻且紧密连锁的SNP分组, 成为具有多个单倍型(复等位变异)的连锁不平衡区段(SNPLDB)标记, 然后采用两阶段分析策略, 基于多位点复等位变异遗传模型, 在节省计算空间的条件下保障全基因组QTL及其复等位变异检出的精确度。和以往GWAS方法相比, RTM-GWAS以性状遗传率为上限, 能够较充分地检测出QTL及其相应的复等位变异并能有效地控制假阳性的膨胀。由其结果建立的QTL-allele矩阵代表了群体中所研究性状的全部遗传组成。依据这种QTL-allele矩阵的信息, 可以设计最优基因型的遗传组成, 预测群体中最优化的杂交组合, 并用以进行群体遗传和特有与新生等位变异的研究。本研究首先对RTM-GWAS方法的特点和计算程序功能进行说明, 然后通过大豆试验数据说明RTM-GWAS计算程序的使用方法。
关键词: 限制性两阶段多位点全基因组关联分析;连锁不平衡区段;多位点模型;QTL-allele矩阵;种质资源群体;优化组合设计

Abstract
Genome-wide association studies (GWAS) have been widely used for genetic dissection of quantitative trait loci (QTL), and the previous GWAS procedures were concentrated on finding a handful of major loci, while the plant breeders are more likely interested in exploring the whole QTL system for both forward selection and background control. We proposed the restricted two-stage multi-locus genome-wide association analysis (RTM-GWAS, https://github.com/njau-sri/rtm-gwas/) for a relatively thorough detection of QTL and their multiple alleles. Firstly, RTM-GWAS groups the tightly linked sequential SNPs into linkage disequilibrium blocks (SNPLDBs) to form genomic markers with multiple haplotypes as alleles. Secondly, it utilizes two-stage association analysis based on a multi-locus multi-allele model to save computer space for focusing on genome-wide QTL identification along with their multiple alleles. Compared with the previous GWAS methods, RTM-GWAS takes the trait heritability as the upper limit of detected genetic contribution, which can avoid a large amount of false positives for a precise detection of the QTL system of the trait. The QTL-allele matrix as a compact form of the population genetic constitution can be used to design optimal genotypes, to predict optimal crosses in plant breeding, and to study the genetic properties of the population as well as the novel and newly emerged alleles. In the present study, we first introduced the function and usage of the RTM-GWAS analytical programs, and then used the experimental data from a research program on soybean to illustrate the application details of the RTM-GWAS.
Keywords:restricted two-stage multi-locus genome-wide association study;SNP linkage disequilibrium block;multi-locus model;QTL-allele matrix;germplasm population;optimal cross design


PDF (1557KB)元数据多维度评价相关文章导出EndNote|Ris|Bibtex收藏本文
本文引用格式
贺建波, 刘方东, 邢光南, 王吴彬, 赵团结, 管荣展, 盖钧镒. 限制性两阶段多位点全基因组关联分析方法的特点与计算程序[J]. 作物学报, 2018, 44(9): 1274-1289. doi:10.3724/SP.J.1006.2018.01274
HE Jian-Bo, LIU Fang-Dong, XING Guang-Nan, WANG Wu-Bin, ZHAO Tuan-Jie, GUAN Rong-Zhan, GAI Jun-Yi. Characterization and Analytical Programs of the Restricted Two-stage Multi- locus Genome-wide Association Analysis[J]. Acta Crops Sinica, 2018, 44(9): 1274-1289. doi:10.3724/SP.J.1006.2018.01274


基于全基因组高密度单核苷酸多态性(single- nucleotide polymorphism, SNP)分子标记的全基因组关联分析(genome-wide association study, GWAS)已经成为数量性状遗传基础解析的重要方法。GWAS充分利用了自然群体大量的历史重组事件, 具有较高检测精度, 已广泛应用于动植物复杂性状基因的发掘, 其理论方法与应用也是近十几年来数量性状研究的热点[1]。然而, 自然群体通常具有的群体未知结构会对GWAS产生干扰, 从而导致检测结果较高的假阳性。目前, 考虑群体结构矫正的GWAS统计方法主要包括结构关联法(structured association, SA)[2]、主成分分析法(principal components analysis, PCA)[3]和混合模型方法(linear mixed model, LMM)[4,5,6,7]。结构关联法首先利用基于模型的聚类程序如STRUCTURE[8]、ADMIXTURE[9]等推断获得群体结构, 然后将群体结构作为模型协变量进行关联测验。主成分分析法是将基于分子标记的遗传关系矩阵特征向量作为模型协变量进行关联测验。混合模型方法是在结构关联法和主成分分析法的基础上再将遗传背景效应作为随机效应加入分析模型, 并将基于分子标记的亲属关系矩阵作为该随机效应的协方差结构, 从而同时控制群体结构和家系结构, 该方法也是目前植物GWAS的常用方法[10,11,12,13]

植物中GWAS通常将种质资源群体作为试验群体, 此类群体存在广泛的遗传变异, 且普遍存在复等位基因。植物育种实质上是一个优异等位基因的聚合过程, 因此复等位基因的检测及其效应估计对植物育种极其重要, 优异等位基因的发掘不仅能为分子标记辅助选择提供依据, 更是设计育种的前提条件[14]。然而以往GWAS方法使用的SNP分子标记在一个标记位点上仅有2个等位变异, 自然无法估计资源群体中大量存在的复等位基因效应, 从而限制了其在植物育种中的应用。尽管GWAS在数量性状遗传研究中发挥了重要作用, 但由于以往GWAS方法注重于个别主要QTL/基因的检测与发掘, 通常使用较为严格的显著水平进行多重测验矫正, 如Bonferroni矫正方法使用0.05/m作为全基因组显著水平, 其中m为标记数目。这种严格的显著水平将可能导致较高的假阴性, 检测的关联位点往往仅能解释表型变异的微小部分, 不能全面解析全基因组遗传位点。例如水稻中41个性状的GWAS结果显示平均每个性状仅能检测到5个位点, 解释大约22%的表型变异[15,16]。因此, 为提供种质资源群体遗传构成信息, 有必要相对全面地检测全基因组QTL。另外, 上述常用GWAS方法均基于单位点模型, 当控制数量性状的位点有多个时, 每个位点的效应估计可能会受到相邻位点的影响[17], 最明显的表现就是导致位点表型变异解释率估计的膨胀, 同时对小效应位点的检测功效也可能会偏低。然而由于全基因组关联分析通常涉及海量的分子标记, 直接将单位点遗传模型扩展至多位点遗传模型的最大难题是模型中变量个数远远超过观测值数目, 无法直接求解, 从而限制了多位点模型在GWAS中的应用。针对上述GWAS在植物应用中的局限性, He等[18]通过合并多个相邻且紧密连锁的SNP标记组成具有复等位变异的SNPLDB标记, 并基于多位点复等位基因模型进行全基因组QTL检测, 提出了限制性两阶段多位点全基因组关联分析方法(RTM-GWAS)。该方法解决了GWAS中单个SNP仅有两个等位变异的限制, 更适用于存在广泛复等位基因的种质资源群体, 多位点模型通过拟合多个QTL, 提高了检测功效, 降低了假阳性。

RTM-GWAS方法通过全面解析资源群体QTL及其复等基因, 建立资源群体的遗传构成, 以进一步应用于数量性状基因发掘、群体遗传分化研究以及最优亲本组合的全基因组选择。目前, RTM-GWAS已被应用于大豆数量性状的全基因组遗传解析。Zhang等[19]使用RTM-GWAS方法分析了中国大豆地方品种群体的百粒重性状, 检测到55个显著关联的SNPLDB标记位点, 解释了98.5%的表型变异, 并进一步基于55个位点上的263个等位变异效应估计, 预测了生态区内及生态区间育成大粒品种的优化亲本组合。Meng等[20]使用RTM-GWAS方法分析了中国大豆地方品种异黄酮性状, 检测到44个显著关联的SNPLDB标记位点, 解释了72.2%的表型变异, 同样预测到培育高异黄酮含量的超亲组合。此外, Li等[21]研究发现RTM-GWAS方法也适用于巢式关联定位(nested association mapping, NAM)群体, 对包含4个重组自交家系群体的大豆NAM群体分析显示, RTM-GWAS方法的应用效果优于以往方法。以上研究报告发表后, 许多读者来信表示想深入了解RTM-GWAS方法程序及其使用方法, 因此本文说明RTM-GWAS方法的特点和计算程序功能, 并通过大豆试验数据说明RTM-GWAS计算程序的使用方法。

1 RTM-GWAS方法特点

RTM-GWAS方法可概括为5个关键点: (1)基于全基因组高密度SNP分子标记构建具有复等位变异的SNPLDB标记; (2)利用SNPLDB标记计算用于群体结构矫正的遗传相似系数矩阵; (3)基于两阶段多位点复等位基因模型检测全基因组QTL; (4)使用普通显著水平, 不需要进行多重测验矫正; (5)性状遗传率作为模型位点表型解释率上限。

1.1 SNPLDB标记构建

首先依据基于连锁不平衡置信区间的区段划分方法定义基因组区段[21]。然后将区段内的所有SNP合并称为SNPLDB标记, 区段内各SNP组成的单倍型作为SNPLDB标记的复等位变异, 群体内个体的基因型由各SNP组成的单倍型确定。为了控制稀有等位基因频率以便后续的统计分析, 将稀有单倍型(频率小于0.01)替换为与其最为相似的单倍型。此处单倍型间的相似性定义为处于状态同样(identity- by-state) SNP个数占区段内总SNP个数的比例。此外, 在设定的连锁不平衡条件下, 有的区段仅含单个SNP, 这种区段也被视为一个独立的SNPLDB标记。因此, SNPLDB标记有2种类型, 即包含多个SNP的SNPLDB标记和仅包含一个SNP的SNPLDB标记; 随着SNP密度的增加, 这类单个SNP的区段数将相应减少。

1.2 基于SNPLDB标记的遗传相似系数矩阵

以往用于GWAS群体结构矫正的基于标记的亲缘关系矩阵计算方法仅适用于SNP标记[22,23,24], 不适用于具有多个等位变异的SNPLDB标记。因此, RTM-GWAS方法将基于SNPLDB的遗传相似系数矩阵作为群体结构的全面估计。群体内个体间的遗传相似系数可定义为处于状态同样位点所占的比例, 即

\[{{s}_{ij}}=\sum\nolimits_{k=1}
{m}{{{c}_{ijk}}}/(2m)\]

其中, cijk定义为在第k个SNPLDB上个体i与个体j的共有等位基因数目(取值为0, 1, 2), m是SNPLDB总个数。该遗传相似系数矩阵的特征向量可作为线性模型中的协变量以降低群体结构对关联分析的影响。

1.3 两阶段多位点关联分析

GWAS通常涉及数万或数百万的分子标记, 直接进行多位点模型拟合将导致模型空间过大进而计算困难。而事实上, 大部分标记都与目标性状不相关, 为了有效缩减多位点拟合的模型空间, RTM-GWAS方法采用两阶段分析策略。简单起见, 假定群体内个体为纯合个体。第一阶段, 利用基于单位点模型的关联分析筛选所有SNPLDB标记, 考虑复等位基因的线性模型可表示如下。

\[{{y}_{i}}=\mu +\sum\limits_{j=1}
{J}{{{w}_{ij}}{{\alpha }_{j}}}+\sum\limits_{l=1}
{L}{{{x}_{il}}{{\beta }_{l}}}+{{\varepsilon }_{i}}\ \ \ (1)\]

其中, yi表示个体i的表型观测值; μ表示总体平均数; wij表示遗传相似系数矩阵第j个特征向量在个体i上的系数, αj为第j个特征向量的效应, J为用于群体结构矫正的特征向量的个数; xil为测验标记位点第l个等位基因对于个体i的基因型指示变量, 取值0或1; βl为第l个等位基因的效应; L为测验标记位点的等位基因数目; εi为假定服从正态分布的残差效应。

第二阶段, 基于第一阶段筛选得到的SNPLDB标记, 将模型(1)拓展为多位点模型进行QTL检测, 多位点复等位基因模型如下。

\[{{y}_{i}}=\mu +\sum\limits_{j=1}
{J}{{{w}_{ij}}{{\alpha }_{j}}}+\sum\limits_{k=1}
{K}{\sum\limits_{l=1}
{{{L}_{k}}}{{{x}_{ijk}}{{\beta }_{kl}}}}+{{\varepsilon }_{i}} \ \ \ (2) \]

其中, xikl为第k个位点的第l个等位基因在个体i上的基因型指示变量, 取值0或1; βkl为第k个位点的第l个等位基因的效应; Lk为第k个位点的等位基因数目; K为总QTL数目。其他符号含义与模型(1)相同。

模型(1)可以使用回归分析方法求解, 我们建议第一阶段用相对宽松的显著水平, 例如不小于0.05, 对标记初步筛选, 以保证真实的位点不被误判。模型(2)可以使用逐步回归分析方法求解, 由于多位点模型内建全试验水平误差控制的特性, 我们建议使用常规的显著水平, 例如0.01或0.05, 作为检测QTL的显著水平。由于QTL检测基于多位点模型, 因此检测的QTL所解释的总遗传变异应小于群体总遗传变异或表型变异解释率不应超过性状遗传率。

2 RTM-GWAS计算程序

我们编制了实现RTM-GWAS方法的计算程序, 可从项目网站https://github.com/njau-sri/rtm-gwas/下载使用。RTM-GWAS计算程序使用C++编程语言实现, 可运行于Microsoft Windows、Linux和Mac OS X等主流操作系统平台。借助针对不同处理器优化的高性能线性代数运算库, RTM-GWAS计算程序具有较高的计算效率。RTM-GWAS计算程序拥有交互友好的图形用户界面和用于批量任务的命令行界面(图1)。RTM-GWAS计算程序构架如图2所示, 由SNPLDB标记构建、遗传相似系数矩阵计算和关联分析三个核心模块组成, 用户可分别通过图形界面或命令行界面进行相应的计算分析, 详细操作说明见https://github.com/njau-sri/rtm-gwas/wiki。RTM- GWAS计算程序输出结果以文本文件存储, 可使用任意本文编辑软件查看输出结果。

图1

新窗口打开|下载原图ZIP|生成PPT
图1RTM-GWAS方法计算程序图形用户界面

Fig. 1Graphical user interface of the RTM-GWAS analytical program



图2

新窗口打开|下载原图ZIP|生成PPT
图2RTM-GWAS方法计算程序构架
斜体文字为相应功能的二进制程序名称。


Fig. 2Framework of the RTM-GWAS analytical program
The characters in italic type are names of binary program.




2.1 数据文件格式

分子标记数据采用国际通用的VCF文件格式 (https://github.com/samtools/hts-specs), 该文件格式适用于各种标记类型, 也是软件支持较为广泛的标记数据格式之一, 因此便于不同软件间协同分析。表型数据是空格或制表符作为分隔符的文本文件, 图3所示为3个性状的表型数据文件(仅显示前7个材料), 文件第1行为列名, 从第2行开始每行表示一条观测值。其中, 第1列为个体编号, 其余列为不同性状观测值, 第1列的列名可以任意, 其余列名表示不同性状的名称。观测值必须使用数值格式记录, 缺失值可使用“NaN”、“?”、“NA”或“.”表示。对于多环境随机区组设计试验数据, 文件必须包含指示环境和区组因子的数据列, 列名必须是“_ENV_”和“_BLK_”, 分别表示环境和区组指示变量。

图3

新窗口打开|下载原图ZIP|生成PPT
图3表型数据文件格式
Indiv表示个体/材料名称列名; SW、OC、PR为性状名称; NaN表示缺失值。


Fig. 3File format of phenotype data
Indiv represents the name of column containing individual/ accession labels; SW, OC, PR are trait names; NaN represents missing value.




2.2 SNPLDB标记构建

指定VCF格式的全基因组SNP基因型数据文件后即可开始计算, 计算程序将输出VCF格式的SNPLDB标记基因型数据文件、标记位点等位变异编码信息以及基因组组块统计信息。计算程序默认单倍型频率(Min. minor haplotype frequency)≥0.01, 区段最大长度(Max. length of blocks)为200 kb, 建议设置为群体连锁不平衡半衰距离。构建SNPLDB的3组核心参数是, LD置信区间阈值(Lower/Upper limit CI for strong LD), 在定义强LD时对LD置信上下限均作最小范围要求, 即要求下限 >70、上限 >98; 强重组置信区间阈值(Upper limit CI for strong recombination)上限<90; 区段内有效强LD占比(Min. fraction of informative strong LD) > 0.95 (图4)。

图4

新窗口打开|下载原图ZIP|生成PPT
图4SNPLDB标记构建对话框
VCF指以VCF格式存储的基因型数据文件路径; Min.: 最小值; Max.: 最大值; CI: 置信区间。


Fig. 4Program dialog for SNPLDB construction
VCF represents the VCF genotype file path; Min.: minimum;Max.: maximum; CI: confidence interval.




2.3 遗传相似系数计算

指定构建的SNPLDB标记文件(VCF格式)后即可计算, 计算程序将输出遗传相似系数矩阵及其特征向量, 默认输出前10个特征向量。其中输出的特征向量文件将作为关联分析的协变量用于群体结构矫正(图5)。

图5

新窗口打开|下载原图ZIP|生成PPT
图5遗传相似系数计算对话框
VCF指以VCF格式存储的基因型数据文件路径。


Fig. 5Program dialog for genetic similarity coefficient calculation
VCF represents the VCF genotype file path.




2.4 两阶段多位点关联分析

关联分析功能对话框需要指定SNPLDB标记基因型数据文件(VCF格式)、数量性状表型观测数据以及用于群体结构矫正的协变量(SNPLDB遗传相似系数矩阵特征向量)数据文件, 计算程序将输出与性状关联的SNPLDB标记位点、多位点模型方差分析、位点等位基因效应估计等结果文件。计算程序默认用于检测QTL的显著水平(significance level)为0.05, 建议设为0.01或0.05。用于标记初步筛选(第一阶段)的阈值默认为0.05, 一般不建议修改。为防止模型过度拟合, 计算默认设置了模型表型变异解释率上限(Max. model r-square)为0.95, 建议设为性状遗传率估计值(图6)。关联分析程序也支持多重测验矫正(multiple testing correction), 包括Bonferroni (BON)和FDR两种方法, 通常矫正后检测的位点也包含于未矫正的结果。另外, 关联分析程序还支持多环境试验原始表型数据的计算, 计算程序默认能够检测QTL与环境互作效应, 但是当基因型与环境互作方差相对较小时, 可以排除QTL与环境互作效应(genotype-environment interaction), 以降低统计模型的复杂度。

图6

新窗口打开|下载原图ZIP|生成PPT
图6关联分析功能对话框
VCF指以VCF格式存储的基因型数据文件路径; Max.: 最大值; r-square: 模型决定系数。


Fig. 6Program dialog for association analysis
VCF represents the VCF genotype file path; Max.: maximum; r-square: coefficient of determination.




3 RTM-GWAS在大豆资源群体中的应用

以下以中国栽培大豆资源群体株高试验结果的全基因组关联分析为例说明RTM-GWAS方法的应用, 详细应用可参考已发表的文献[18,19,20,21]

3.1 试验数据

参试的723份栽培大豆来自中国大豆种质资源群体, 分别于2013年和2014年进行田间试验, 采用随机区组试验设计, 设3次重复, 品种成熟后测量株高。试验群体株高变异范围15~165 cm, 平均62.78 cm, 变异系数41.57%。2年试验株高误差变异系数14.33%, 遗传率0.921, 基因与环境互作遗传率0.049。基因型数据来自用RAD-seq (restriction site-associated DNA sequencing)技术对该群体进行的基因型分型[18]。通过序列比对将测序片段比对到大豆Williams 82参考基因组并进行SNP鉴别, SNP质量控制采用的过滤标准为缺失和杂合率小于或等于20%, 最小等位基因频率大于或等于1%。最后使用fastPHASE软件[26]对SNP缺失基因型进行填补, 获得了145 558个覆盖全基因组的高质量SNP标记。

3.2 利用RTM-GWAS检测株高全基因组QTL

首先, 基于全基因组145 558个SNP标记, 利用RTM-GWAS计算程序进行SNPLDB标记构建, 采用程序默认参数。程序输出了36 952个SNPLDB标记的VCF格式基因型数据文件, 用于后续所有分析。其次, 基于构建好的SNPLDB标记, 利用RTM- GWAS计算程序计算群体内个体间的遗传相似系数矩阵, 并提取特征值最大的前10个特征向量作为控制群体结构的协变量。最后, 基于大豆株高表型原始观测值、SNPLDB标记数据以及遗传相似系数矩阵特征向量, 利用RTM-GWAS计算程序关联分析功能对大豆株高进行全基因组QTL检测并估计QTL等位基因的效应。QTL检测的显著水平设为0.01, 模型解释率上限设置为0.921, 其他参数保持默认。关联分析程序输出5个结果文件, 关联标记位点名称文件assoc.out.loc、I型模型方差分析文件assoc.out.aov1、III型模型方差分析文件assoc.out. aov3、等位基因效应估计文件assoc.out.est、标记位点统计检验概率值(P-value)文件assoc.out.ps。

基于所有标记位点关联测验P值结果数据文件, 可使用绘图软件, 如R软件(http://www.r-project. org/), 绘制Q-Q图(图7)和Manhattan图(图8)。用RTM-GWAS方法共检测到114个SNPLDB位点与株高性状关联, 根据方差分析结果数据文件, 其中10个位点主效不显著, 63个位点与环境互作效应不显著。104个主效显著的位点总表型变异解释率为78.103%, 51个显著的位点与环境互作效应总表型变异解释率为10.312%, 其中有21个位点主效表型变异解释率高于1%, 将结果进一步整理为表1, 所有位点关联结果详见附件表1。RTM-GWAS分析程序不会输出位点贡献率, 但可以根据输出的平方和分解计算位点贡献率, 即位点平方和占总平方和的比例。

图7

新窗口打开|下载原图ZIP|生成PPT
图7大豆株高全基因组关联分析Q-Q图

Fig. 7Quantile-Quantile plot of genome-wide association study of soybean plant height



Table 1
表1
表1大豆株高显著关联的大效应SNPLDB标记位点
Table 1Large effect SNPLDBs significantly associated with soybean plant height
SNPLDB染色体
Chromosome
位置
Position
Model aQTLQTL×Env. b
-lg P-lg PR2 (%)-lg PR2 (%)
LDB_19_449646301944964630-4502958458.26206.017.36152.861.693
LDB_6_44183574644183574-4428124846.46199.507.35911.640.495
LDB_16_8004288168004288-820384542.01171.866.1406.150.280
LDB_13_115392121311539212-1162599028.64120.364.0910.660.053
LDB_17_364748801736474880-3649465222.8196.333.2290.800.059
LDB_4_37782684437782684-3792309322.0683.362.8093.390.170
LDB_8_707513987075139-707709126.1981.672.61616.090.505
LDB_3_26698545326698545-2689826717.2468.852.4234.560.261
LDB_15_167739821516773982-1677401022.0972.652.2735.220.154
LDB_16_288388741628838874-2886811816.4456.111.8871.010.077
LDB_4_11093449411093449-1119212013.6151.181.8003.130.195
LDB_2_586388825863888-597903114.8847.631.4931.060.042
LDB_16_749468116749468116.2148.321.4401.040.018
LDB_1_5027790215027790217.3447.471.4138.980.239
LDB_14_246747514246747515.5745.661.3560.880.014
LDB_4_29936477429936477-2995080314.7438.981.2175.630.186
LDB_3_22147965322147965-2234269911.9832.081.20911.340.516
LDB_7_20253563720253563-2045160713.8336.681.1742.680.108
LDB_14_460956341446095634-4610657012.6735.501.0760.280.008
LDB_3_819777638197776-820246611.5132.011.0261.050.052
LDB_6_22108685622108685-2219136010.9728.351.0045.030.241
Locus with a phenotypic variance explained greater than 1% was considered as a large effect locus; R2: phenotypic variance explained; a: statistical hypothesis testing performed in QTL detection model; b: QTL-by-environment interaction effect.
大效应位点指表型变异解释率大于1%的位点; R2: 表型变异解释率; a: 为QTL检测模型显著性测验; b: 为QTL与互作效应。

新窗口打开|下载CSV

全国从东北到西南大豆资源的株高在南京表现出有114个位点的差异, 年份间有波动。关联的114个SNPLDB位点共有442个等位变异, 其中主效显著的104个位点共有417个等位变异。由于本研究大豆株高遗传基础以主效位点为主, 简便起见, 本文则以主效位点为例进行后续分析, 如要考虑特定环境的分析, 可将位点主效应与相应环境互作效应相加后再进行分析。根据主效显著的104个位点效应估计结果, 等位变异效应范围为-43.55~ +38.26, 结合群体SNPLDB基因型, 可进一步将等位变异效应整理为位点×材料(104×417)的QTL-allele矩阵作为群体株高性状的遗传构成(图9), 可以进一步使用绘图软件将该矩阵可视化(图10)。

图8

新窗口打开|下载原图ZIP|生成PPT
图8大豆株高全基因组关联分析Manhattan图

Fig. 8Manhattan plot of genome-wide association study of soybean plant height



图9

新窗口打开|下载原图ZIP|生成PPT
图9大豆株高主效QTL-allele矩阵数据文件
行表示104个主效显著的株高关联位点, 列表示723份栽培大豆材料, 数据为104×723的主效位点等位基因效应矩阵。


Fig. 9QTL-allele matrix data file of main effect
Rows represent the 104 loci and columns represent the 723 accessions, the data are allele effects and presented in 104×723 matrix.




图10

新窗口打开|下载原图ZIP|生成PPT
图10大豆株高QTL-allele可视化矩阵

Fig. 10Graphical representation of the QTL-allele matrix of soybean plant height



3.3 基于QTL-allele矩阵的优化组合设计

对于723个材料所有可能261 003个单交组合, 通过F1连续自交分别模拟产生2000个纯系后代基因型, 依据包括104个位点效应的株高QTL-allele矩阵计算所有QTL基因型值总和, 作为后代基因型值预测值。亲本i和亲本j组合的表型预测值为 yij = gij + (yi - gi + yj - gj)/2, 其中gij为组合后代基因型预测值, yiyj分别为双亲表型观测值, gigj分别为双亲基因型值预测值。所有模拟计算通过编制的计算程序Cross (https://github.com/njau-sri/cross)完成, 计算程序将输出所有单交组合后代纯合群体的株高性状描述统计数据(图11), 可根据实际需求在计算程序中设置用于筛选优化组合的百分位数统计量, 计算程序默认输出第1 (最小值)、25 (Q1)、50 (中位数)、75 (Q3)、100 (最大值)百分位数, 本文设置第10、第50、第90百分位数作为选择依据。后代纯合群体第10、第50、第90百分位数使用其他绘图软件绘制的散点图, 可以看出第10、第90百分位数均有超亲组合出现(图12)。以高杆大豆育种为例, 按照第90百分位数筛选优化组合, 可筛选出101个亲本组合预测株高大于165 cm的组合, 详见附件表2, 其中预测株高大于180 cm的亲本组合有8个(表2), 预测株高最高可达183 cm, 相比亲本株高165 cm提高了10.9% (18 cm)。

图11

新窗口打开|下载原图ZIP|生成PPT
图11株高性状所有亲本组合后代预测结果文件
P1、P2分别表示单交组合的2个亲本; MEAN、SD分别表示组合纯合后代群体株高平均数和标准差; P10、P50、P90分别表示组合纯合后代群体株高第10、第50、第90百分位数。


Fig. 11Prediction result file of plant height for all possible single crosses
P1 and P2 are labels of parental accessions; MEAN and SD indicate the mean and standard deviation of homozygous progeny population; P10, P50, and P90 are 10-th, 50-th, and 90-th percentiles of homozygous progeny population.




图12

新窗口打开|下载原图ZIP|生成PPT
图12株高性状所有亲本组合后代预测结果可视化
虚线表示亲本群体株高变异范围(15~165 cm)。


Fig. 12Graphical representation of the prediction result of plant height for all possible single crosses
Dotted lines indicate the range (15-165 cm) of plant height in parental population.




4 讨论

4.1 RTM-GWAS方法功效

和以往GWAS方法专注于个别主效QTL的检测不同, RTM-GWAS方法能够相对全面地解析植物种质/育种群体数量性状的QTL体系。首先, 以往GWAS均基于仅有的2个等位变异的SNP标记, 而无法检测一个遗传位点上多个复等位基因。对于一个遗传位点上存在多个复等位基因的情况, 单个SNP标记测验仅能解释遗传位点的部分遗传变异, 理论上统计功效自然会偏低, 从而会降低GWAS检测功效。RTM-GWAS方法通过构建具有复等位变异的SNPLDB标记来匹配具有复等位基因的位点, 理论上使得GWAS更适用于存在广泛复等位基因的种质资源群体。其次, 以往基于单位点模型的GWAS方法由于忽略了其他位点的影响, 可能导致较高假阳性的检测结果。然而由于GWAS通常基于全基因高密度分子标记, 直接拟合多位点模型将导致计算量过大, 影响计算效率。RTM-GWAS方法结合两阶段分析策略和多位点模型, 不仅能够同时拟合多个具有不等数目等位基因的遗传位点, 提高了检测功效, 还大幅降低了计算量, 提高了计算效率, 使得RTM-GWAS方法可以应用于大规模GWAS数据的分析。

植物中表型鉴定试验通常是多个环境的重复试验, 以往主流GWAS方法通常不支持多环境表型数据联合分析, 而是将基因型多环境调整平均数作为GWAS分析的表型, 不仅无法分析主效QTL与环境互作效应, 更无法检测仅有互作效应而没有主效的QTL。RTM-GWAS方法计算程序支持多环境随机区组试验设计的原始表型数据分析, 能够同时拟合主效和非主效QTL与环境互作效应, 检测结果更加全面。另外, 模拟分析显示应用RTM-GWAS方法的样本容量需要足够大(例如, >400)且性状遗传率也应较高(例如, >0.8), 因此表型鉴定需要进行合理的试验设计以及精确的试验操作[18]

4.2 RTM-GWAS方法应用前景

RTM-GWAS方法不仅适用于种质资源群体, 也适用于多亲本的NAM群体[21]。RTM-GWAS方法通过构建具有亲本单倍型的SNPLDB标记, 将NAM群体内不同RIL群体视为一个自然的整体, 每个标记位点具有不同的等位变异类型, 而不是像以往分析将RIL群体视为彼此独立的群体[27]。由于NAM群体的遗传设计特点, 其群体结构已知, 因此RTM-GWAS方法可以较好地控制群体结构, 从而获得较高的检测功效和较低的假发现率。潘丽媛等(未发表)也将RTM-GWAS方法用于大豆RIL群体的QTL定位, 比较结果显示, RTM-GWAS方法不仅覆盖了复合区间作图法的定位结果, 还检测到更多的已报道微效QTL。由于RTM-GWAS方法是对标记位点进行检验, 无法对标记区间内的任意位置进行检验, 因此NAM和RIL群体必须进行全基因组SNP标记鉴定才可能进行全基因组QTL的检测。

RTM-GWAS方法较高的检测功效使得其检测结果可以全面反映群体数量性状遗传构成, 从而能够进一步从全基因组QTL水平对育种亲本组合进行潜力预测及优化组合设计, 在实际育种前直接对QTL进行育种选择。基于RTM-GWAS方法获得的QTL-allele矩阵, 可通过设计分子标记进一步应用于双亲后代选择, 从而提高选择效率、缩短育种周期。基于全基因组QTL的组合预测和后代选择是对QTL直接选择, 不同于传统全基因组选择(genomic selection, GS)方法对全基因组分子标记进行选择[28]。传统GS需要对选择世代进行全基因组分子标记测定, 目前对于植物育种花费十分高昂。GS训练群体与选择群体的遗传关系以及预测模型构建方法会直接影响选择效率, 主要应用于组合后代的选择, 把GS直接应用于优化组合设计会由于需要对组合后代进行全基因组标记模拟, 进而导致难以接受的计算量。

此外, 基于全基因组QTL及其等位基因构成信息, 还可以从QTL水平上刻画群体遗传特征, 进行群体分化以及群体间进化关系的研究。例如Meng等[20]对RTM-GWAS检测的44个异黄酮含量QTL进行生态区基因频率的分析结果显示, 84.1% (37个)的位点基因频率在生态区间存在显著差异, 而在全基因组29 119个SNPLDB标记水平上, 则只有50.6% (14 735个)的位点基因频率在生态区间存在显著差异, 进一步说明了异黄酮含量遗传构成在生态区上发生了分化。

5 结论

本研究提出的RTM-GWAS方法通过将多个相邻且紧密连锁的SNP分组, 构建具有复等位变异的SNPLDB标记, 然后采用两阶段分析策略, 基于多位点模型检测全基因组QTL及其复等位变异。和以往GWAS方法相比, RTM-GWAS方法能较充分地检测出QTL及其相应的复等位变异, 并能有效地控制假阳性。由其结果建立的QTL-allele矩阵代表了试验群体中所研究性状的全部遗传构成, 不仅可用于设计最优基因型的遗传组成, 预测最优杂交组合, 还能用于群体遗传和特有与新生等位变异的研究。

Supplementary table 1
附表1
附表1大豆株高显著关联的SNPLDB标记位点
Supplementary table 1SNPLDBs significantly associated with soybean plant height
SNPLDB染色体
Chromosome
位置
Position
Model aQTLQTL×Env. b
-lg P-lg PR2 (%)-lg PR2 (%)
LDB_19_449646301944964630-4502958458.26206.017.36152.861.693
LDB_6_44183574644183574-4428124846.46199.507.35911.640.495
LDB_16_8004288168004288-820384542.01171.866.1406.150.280
LDB_13_115392121311539212-1162599028.64120.364.0910.660.053
LDB_17_364748801736474880-3649465222.8196.333.2290.800.059
LDB_4_37782684437782684-3792309322.0683.362.8093.390.170
LDB_8_707513987075139-707709126.1981.672.61616.090.505
LDB_3_26698545326698545-2689826717.2468.852.4234.560.261
LDB_15_167739821516773982-1677401022.0972.652.2735.220.154
LDB_16_288388741628838874-2886811816.4456.111.8871.010.077
LDB_4_11093449411093449-1119212013.6151.181.8003.130.195
LDB_2_586388825863888-597903114.8847.631.4931.060.042
LDB_16_749468116749468116.2148.321.4401.040.018
LDB_1_5027790215027790217.3447.471.4138.980.239
LDB_14_246747514246747515.5745.661.3560.880.014
LDB_4_29936477429936477-2995080314.7438.981.2175.630.186
LDB_3_22147965322147965-2234269911.9832.081.20911.340.516
LDB_7_20253563720253563-2045160713.8336.681.1742.680.108
LDB_14_460956341446095634-4610657012.6735.501.0760.280.008
LDB_3_819777638197776-820246611.5132.011.0261.050.052
LDB_6_22108685622108685-2219136010.9728.351.0045.030.241
LDB_6_127150261271502-127515912.0432.960.9970.610.018
LDB_12_34374105123437410512.3532.820.9560.400.005
LDB_5_27543725527543725-2754737711.2527.310.8511.500.056
LDB_5_408220954082209-412172110.6026.740.8342.200.079
LDB_4_16641482416641482-168397728.7222.210.7862.840.150
LDB_19_392408441939240844-3927696711.0123.490.7595.160.188
LDB_5_1564858851564858810.8525.480.7310.140.001
LDB_4_12429588412429588-126051139.9320.000.7156.460.276
LDB_9_4579134094579134010.3322.420.6380.560.008
LDB_3_36428638336428638-364693368.5619.150.6241.440.066
LDB_16_2807415162807415-28274929.4817.290.5886.020.231
SNPLDB染色体
Chromosome
位置
Position
Model aQTLQTL×Env. b
-lg P-lg PR2 (%)-lg PR2 (%)
LDB_11_356942471135694247-356985849.1318.500.5530.300.009
LDB_7_35303124735303124-353272317.3415.830.5421.430.076
LDB_15_120077681512007768-120115017.6816.510.4920.180.005
LDB_7_32985427732985427-331847126.6813.540.4890.740.057
LDB_2_391138823911388-39281077.1414.070.4661.340.062
LDB_9_20673268920673268-206926096.8713.820.4360.330.016
LDB_16_11517491611517499.3415.530.4324.630.115
LDB_19_446696551944669655-447542878.4013.620.4307.170.233
LDB_5_9070015907001-9070427.8415.210.4230.130.001
LDB_9_806260098062600-81166596.6011.790.3741.120.044
LDB_8_19152075819152075-191521007.4813.340.3670.920.015
LDB_10_5777915105777915-57986267.4610.730.3625.670.204
LDB_19_445505871944550587-445582897.3011.070.3523.420.117
LDB_20_3408918820340891887.5712.420.3401.110.020
LDB_5_2803587528035876.9012.300.3360.350.004
LDB_8_16373446816373446-164208764.9610.500.3340.320.016
LDB_18_574527051857452705-574572396.2610.950.3250.330.010
LDB_7_250968307250968306.8911.750.3200.360.004
LDB_9_38183930938183930-382688005.667.740.3194.070.194
LDB_12_9936416129936416-99945384.458.400.3070.790.050
LDB_3_33432777333432777-334620716.409.520.3051.190.046
LDB_5_1382138513821386.6610.460.2830.700.010
LDB_6_330233216330233216.0310.090.2720.390.004
LDB_11_3516227011351622707.3110.070.2715.100.128
LDB_18_2747814218274781426.1610.060.2710.460.006
LDB_17_337081181733708118-338150887.037.770.2516.960.226
LDB_5_38350041538350041-384323566.767.190.2335.510.182
LDB_11_6370581116370581-64860155.646.030.2313.870.161
LDB_16_319590331631959033-319999634.277.110.2310.460.021
LDB_18_449428781844942878-450645113.285.380.2250.480.044
LDB_2_12669763212669763-126842074.485.690.2041.200.057
LDB_6_415235786415235785.417.710.2030.810.013
LDB_17_157821641715782164-158459773.223.920.1890.780.066
LDB_4_42809656442809656-428096705.275.790.1712.760.081
LDB_9_252442099252442094.206.560.1700.200.002
LDB_3_432453393432453394.916.560.1700.250.002
LDB_15_92026811592026815.406.240.1603.370.079
LDB_20_3498125203498125-36915284.333.820.1592.930.129
LDB_17_113671271711367127-113820093.934.760.1591.850.068
LDB_17_325202751732520275-325528713.032.750.1471.720.107
LDB_16_3291894616329189464.255.590.1421.190.022
LDB_12_34561481234561484.195.440.1371.630.033
LDB_8_6056566860565665.035.370.1361.850.038
LDB_4_955068749550687-95509985.034.570.1353.770.111
SNPLDB染色体
Chromosome
位置
Position
Model aQTLQTL×Env. b
-lg P-lg PR2 (%)-lg PR2 (%)
LDB_5_228526552285265-22852963.314.500.1331.000.030
LDB_15_275233981527523398-277233433.843.020.1323.470.147
LDB_13_45997361345997363.304.780.1190.100.000
LDB_9_10672160910672160-108312075.874.010.1186.770.200
LDB_1_50652928150652928-506691373.702.970.1172.840.113
LDB_18_5266917918526691793.924.710.1171.550.031
LDB_6_188451906188451903.304.490.1111.130.020
LDB_13_1768117013176811703.514.260.1040.250.002
LDB_4_15008106415008106-150302612.722.850.0991.580.059
LDB_20_3955658020395565802.343.880.0940.610.009
LDB_16_2009914116200991412.273.870.0930.180.001
LDB_14_2054166714205416673.053.800.0910.560.008
LDB_15_32428801532428808.413.740.09013.780.380
LDB_1_527503121527503122.933.720.0890.250.002
LDB_16_2124799616212479963.853.590.0851.900.040
LDB_3_284586663284586662.673.260.0760.890.015
LDB_18_39339461839339463.533.140.0731.650.033
LDB_7_15901391715901391-159032814.332.930.0674.150.101
LDB_19_2090261192090261-20906114.432.920.0674.400.108
LDB_7_27922333727922333-281217793.451.470.0673.330.129
LDB_5_416790205416790204.152.900.0673.780.091
LDB_18_574974341857497434-575003293.292.790.0641.070.019
LDB_4_57169345716932.402.630.0591.660.034
LDB_6_46436793646436793-464368332.302.560.0571.050.018
LDB_20_3231470320323147033.452.400.0532.980.069
LDB_8_427525008427525004.232.400.0533.740.090
LDB_4_437577064437577062.862.400.0532.380.052
LDB_11_319943741131994374-319944362.282.310.0510.760.012
LDB_4_164362541643625-17440932.391.650.0493.060.090
LDB_16_82040991682040992.682.080.0441.630.033
LDB_13_3347816413334781642.851.790.0373.210.075
LDB_14_487994911448799491-487994964.121.200.0354.550.134
LDB_7_166304427166304423.581.550.0313.760.090
LDB_3_5207322352073224.231.450.0284.940.123
LDB_7_238699707238699703.281.260.0243.410.081
LDB_1_3351751133517513.420.890.0153.410.081
LDB_7_302788467302788464.230.600.0086.120.157
LDB_1_248958781248958782.400.360.0043.140.073
Total114104 c78.10351 d10.312
a: statistical hypothesis testing performed in QTL detection model; b: QTL-by-environment interaction effect; c : number of QTL with significant main effect; d: number of QTL with significant QTL-by-environment interaction effect.
a: QTL检测模型显著性测验; b: QTL与环境互作效应; c: 主效测验显著位点数目; d: 互作效应测验显著位点数目。

新窗口打开|下载CSV

Supplementary table 2
附表2
附表2高杆大豆育种化组合设计
Supplementary table 2 Optimal cross design of tall soybean breeding
亲本 Parent组合 Cross
P1P2Y1Y2平均数 Mean标准差 SDP10P50P90
4L0604L311136.3132.3135.436.187.5135.5183.3
4L1194L361125.0138.6132.337.281.6131.8182.2
4L2134L361127.6138.6133.535.984.4134.4181.3
4L0604L119136.3125.0130.637.480.5130.3180.8
4L0544L060133.5136.3134.633.491.3133.7180.8
4L3114L361132.3138.6134.335.287.7134.4180.7
4L0544L361133.5138.6136.333.692.8135.3180.7
4L0604L371136.3137.2138.032.593.9138.8180.5
4L3614L371138.6137.2136.932.493.6137.8179.4
4L0604L213136.3127.6131.736.083.4132.1179.3
4L1594L361143.6138.6141.228.9103.5141.6179.1
4L0604L297136.3131.0133.933.490.5132.6178.9
4L0604L159136.3143.6140.329.5101.2140.2178.9
4L3614L367138.6136.5138.229.899.7137.6178.6
4L2344L361132.0138.6134.831.693.8134.1177.8
4L2974L361131.0138.6134.333.091.6134.1177.5
4L0544L114133.5128.3131.933.785.5131.6177.0
4L2744L361131.5138.6135.431.592.5135.8176.4
4L1144L371128.3137.2133.732.390.0133.8176.3
4L1144L311128.3132.3129.435.781.8128.1176.0
4L1144L213128.3127.6128.235.979.9127.9175.5
4L0604L367136.3136.5136.230.894.1136.8175.4
4L1144L159128.3143.6136.129.897.8135.9175.4
4L0604L274136.3131.5133.631.491.7134.1175.3
4L0604L148136.3124.2130.733.687.2130.2175.1
4L1144L119128.3125.0125.637.774.1125.6175.0
4L2484L361123.4138.6131.033.486.9131.0174.6
4L1144L297128.3131.0129.035.083.7128.9174.6
4L1934L361122.8138.6131.533.188.3130.9173.7
4L0604L234136.3132.0132.731.790.2133.4173.6
4L1144L234128.3132.0131.032.289.0131.0173.3
4L1144L367128.3136.5133.130.692.8133.0173.3
4L0274L361118.0138.6128.033.284.9127.6173.2
4L0604L193136.3122.8129.332.785.5129.4172.8
4L0604L248136.3123.4128.633.284.6128.8172.5
4L2604L361107.0138.6124.336.277.5123.5172.3
4L0604L302136.3115.0125.133.380.6124.1172.1
4L3024L361115.0138.6126.334.482.0125.7172.0
4L0604L111136.3115.0126.733.881.9126.2171.9
4L3154L361120.4138.6130.431.988.3130.9171.8
4L1484L361124.2138.6130.831.689.7130.6171.6
4L1114L361115.0138.6126.533.881.7126.8171.5
4L1464L361112.8138.6126.533.283.2125.4171.5
亲本 Parent组合 Cross
P1P2Y1Y2平均数 Mean标准差 SDP10P50P90
4L0494L361110.6138.6124.635.779.0125.1171.3
4B1814L361118.2138.6128.532.086.4128.9171.3
4L2834L361106.2138.6120.837.472.2120.4171.2
4L0604L315136.3120.4128.831.885.9129.0171.2
4L1144L274128.3131.5130.631.189.6131.5171.2
4L2844L361114.2138.6126.833.782.1127.0170.9
4L1144L193128.3122.8125.633.781.4124.3170.6
4L0274L060118.0136.3126.133.782.1125.4170.6
4B1814L060118.2136.3128.032.284.9128.2170.6
4L0604L284136.3114.2126.034.580.1126.7170.5
4L0604L112136.3119.2127.232.085.0126.9170.5
4L1124L361119.2138.6128.531.287.3128.0170.4
4L2424L361117.6138.6128.431.585.9128.3170.0
4L1914L36192.4138.6114.940.760.0114.9169.8
4L1244L361106.0138.6122.935.077.1122.5169.7
4L1144L248128.3123.4126.433.482.9125.3169.6
4L1454L361119.6138.6127.930.986.3127.6169.5
4L0494L060110.6136.3123.335.876.8123.6169.5
4L0604L145136.3119.6128.531.286.5128.2169.4
4L0014L361120.8138.6129.730.190.4128.9169.4
4L2014L361106.2138.6122.835.176.0121.8169.2
4L0604L260136.3107.0121.836.274.3121.4169.1
4L1544L361118.3138.6128.431.686.4128.3168.9
4L3524L361112.0138.6125.931.785.1125.4168.8
4L0224L15971.5143.6108.745.748.7109.2168.8
4L2244L361109.6138.6123.733.579.6124.1168.8
4L1864L361113.6138.6126.731.285.4125.7168.7
4L2544L361110.2138.6124.433.678.5124.2168.6
4L0604L154136.3118.3127.431.187.1127.2168.5
4L2764L361102.3138.6120.135.872.4119.9168.5
4L0014L060120.8136.3129.230.090.0128.9168.4
4L0604L242136.3117.6125.631.983.3124.4168.3
4B1814L114118.2128.3124.332.881.6124.5168.3
4L0224L06071.5136.3104.646.743.0103.2168.1
4L0604L254136.3110.2122.334.176.5122.7168.0
4L1144L148128.3124.2125.632.683.0125.6168.0
4L0604L146136.3112.8124.032.680.4123.9167.8
4L2964L361121.5138.6129.329.291.3129.3167.8
4L0604L191136.392.4114.740.061.9115.2167.8
4L0604L283136.3106.2120.135.773.3120.7167.7
4L1144L284128.3114.2122.233.777.5121.6167.5
4L0604L352136.3112.0124.332.780.7123.9167.5
4L0604L201136.3106.2120.434.675.5120.1167.2
4L0424L361115.5138.6127.130.585.9127.3167.1
亲本 Parent组合 Cross
P1P2Y1Y2平均数 Mean标准差 SDP10P50P90
4L0604L276136.3102.3119.835.872.2120.2167.0
4L1144L315128.3120.4125.231.782.0125.9167.0
4L0274L114118.0128.3123.433.080.7123.9166.8
4L1144L260128.3107.0117.437.369.0116.9166.7
4L0604L186136.3113.6124.231.483.9123.2166.6
4L0494L114110.6128.3120.235.274.5120.4166.6
4L3614L369138.6112.4125.430.984.6125.0166.4
4L0224L05471.5133.5103.947.940.6104.9166.4
4L0424L060115.5136.3126.330.185.9126.1166.2
4L1174L361117.0138.6128.328.591.6127.2166.2
4L1114L114115.0128.3122.134.276.7122.1166.2
4L0604L124136.3106.0120.434.575.6120.6166.2
4L3604L361111.0138.6124.331.382.8125.6166.1
4L1074L361119.4138.6128.828.690.0129.7166.0
P1 and P2 indicate the two parent labels of single cross; Y1 and Y2 indicate the means of observed plant height; P10, P500, and P90 are 10-th, 50-th, and 90-th percentiles of homozygous progeny population.
P1、P2分别表示单交组合亲本代号; Y1、Y2分别表示P1和P2的株高观测平均数; P10、P50、P90分别表示第10、50、90百分位数。

新窗口打开|下载CSV

The authors have declared that no competing interests exist.

作者已声明无竞争性利益关系。


参考文献 原文顺序
文献年度倒序
文中引用次数倒序
被引期刊影响因子

Visscher P M, Wray N R, Zhang Q, Sklar P , McCarthy M I, Brown M A, Yang J . 10 Years of GWAS discovery: biology, function, and translation
Am J Hum Genet, 2017,101:5-22

URL [本文引用: 1]

Pritchard J K, Stephens M, Rosenberg N A, Donnelly P . Association mapping in structured populations
Am J Hum Genet, 2000,67:170-181

DOI:10.1086/302959URLPMID:10827107 [本文引用: 1]
The use, in association studies, of the forthcoming dense genomewide collection of single-nucleotide polymorphisms (SNPs) has been heralded as a potential breakthrough in the study of the genetic basis of common complex disorders. A serious problem with association mapping is that population structure can lead to spurious associations between a candidate marker and a phenotype. One common solution has been to abandon case-control studies in favor of family-based tests of association, such as the transmission/disequilibrium test (TDT), but this comes at a considerable cost in the need to collect DNA from close relatives of affected individuals. In this article we describe a novel, statistically valid, method for case-control association studies in structured populations. Our method uses a set of unlinked genetic markers to infer details of population structure, and to estimate the ancestry of sampled individuals, before using this information to test for associations within subpopulations. It provides power comparable with the TDT in many settings and may substantially outperform it if there are conflicting associations in different subpopulations.

Price A L, Patterson N J, Plenge R M, Weinblatt M E, Shadick N A, Reich D . Principal components analysis corrects for stratification in genome-wide association studies
Nat Genet, 2006,38:904-909

URL [本文引用: 1]

Yu J, Pressoir G, Briggs W H, Vroh Bi I, Yamasaki M, Doebley J F , McMullen M D, Gaut B S, Nielsen D M, Holland J B, Kresovich S, Buckler E S. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness
Nat Genet, 2006,38:203-208

[本文引用: 1]

Kang H M, Zaitlen N A, Wade C M, Kirby A, Heckerman D, Daly M J, Eskin E . Efficient control of population structure in model organism association mapping
Genetics, 2008,178:1709-1723

DOI:10.1534/genetics.107.080101URLPMID:18385116 [本文引用: 1]
Genomewide association mapping in model organisms such as inbred mouse strains is a promising approach for the identification of risk factors related to human diseases. However, genetic association studies in inbred model organisms are confronted by the problem of complex population structure among strains. This induces inflated false positive rates, which cannot be corrected using standard approaches applied in human association studies such as genomic control or structured association. Recent studies demonstrated that mixed models successfully correct for the genetic relatedness in association mapping in maize and Arabidopsis panel data sets. However, the currently available mixed-model methods suffer froth computational inefficiency. In this article, we propose a new method, efficient mixed-model association (EMMA), which corrects for population structure and genetic relatedness in model organism association mapping. Our method takes advantage of the specific nature of the optimization problem in applying mixed models for association snapping, which allows us to substantially increase the computational speed and reliability of the results. We applied EMMA to in silico whole-genome association mapping of inbred mouse strains involving hundreds of thousands of SNPs, in addition to Arabidopsis and maize data sets. We also performed extensive simulation studies to estimate the statistical power of EMMA under various SNP effects, varying degrees of population structure, and differing numbers of multiple measurements per strain. Despite the fruited power of inbred mouse association mapping due to the limited number of available inbred strains, we are able to identify significantly associated SNPs, which fall into known QTL or genes identified through previous studies while avoiding an inflation of false positives. An R package implementation and webserver of our EMMA method are publicly available.

Kang H M, Sul J H, Service S K, Zaitlen N A, Kong S Y, Freimer N B, Sabatti C, Eskin E . Variance component model to account for sample structure in genome-wide association studies
Nat Genet, 2010,42:348-354

DOI:10.1038/ng.548URLPMID:20208533 [本文引用: 1]
Abstract Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.

Zhang Z, Ersoz E, Lai C Q, Todhunter R J, Tiwari H K, Gore M A, Bradbury P J, Yu J, Arnett D K, Ordovas J M, Buckler E S . Mixed linear model approach adapted for genome-wide association studies
Nat Genet, 2010,42:355-360

DOI:10.1038/ng.546URLPMID:20208535 [本文引用: 1]
Abstract Mixed linear model (MLM) methods have proven useful in controlling for population structure and relatedness within genome-wide association studies. However, MLM-based methods can be computationally challenging for large datasets. We report a compression approach, called 'compressed MLM', that decreases the effective sample size of such datasets by clustering individuals into groups. We also present a complementary approach, 'population parameters previously determined' (P3D), that eliminates the need to re-compute variance components. We applied these two methods both independently and combined in selected genetic association datasets from human, dog and maize. The joint implementation of these two methods markedly reduced computing time and either maintained or improved statistical power. We used simulations to demonstrate the usefulness in controlling for substructure in genetic association datasets for a range of species and genetic architectures. We have made these methods available within an implementation of the software program TASSEL.

Pritchard J K, Stephens M, Donnelly P . Inference of population structure using multilocus genotype data
Genetics, 2000,155:945-959

URLPMID:10835412 [本文引用: 1]
Dominant markers such as amplified fragment length polymorphisms (AFLPs) provide an economical way of surveying variation at many loci. However, the uncertainty about the underlying genotypes presents a problem for statistical analysis. Similarly, the presence of null alleles and the limitations of genotype calling in polyploids mean that many conventional analysis methods are invalid for many organisms. Here we present a simple approach for accounting for genotypic ambiguity in studies of population structure and apply it to AFLP data from whitefish. The approach is implemented in the program STRUCTURE version 2.2, which is available from http://pritch.bsd.uchicago.edu/structure.html.

Alexander D H, Novembre J, Lange K . Fast model-based estimation of ancestry in unrelated individuals
Genome Res, 2009,19:1655-1664

URL [本文引用: 1]

Atwell S, Huang Y S, Vilhjalmsson B J, Willems G, Horton M, Li Y, Meng D, Platt A, Tarone A M, Hu T T, Jiang R, Muliyati N W, Zhang X, Amer M A, Baxter I, Brachi B, Chory J, Dean C , Debieu M, de Meaux J, Ecker J R, Faure N, Kniskern J M, Jones J D, Michael T, Nemri A, Roux F, Salt D E, Tang C, Todesco M, Traw M B, Weigel D, Marjoram P, Borevitz J O, Bergelson J, Nordborg M . Genome-wide association study of 107 phenotypes in
Arabidopsis thaliana inbred lines. Nature, 2010,465:627-631

[本文引用: 1]

Huang X, Wei X, Sang T, Zhao Q, Feng Q, Zhao Y, Li C, Zhu C, Lu T, Zhang Z, Li M, Fan D, Guo Y, Wang A, Wang L, Deng L, Li W, Lu Y, Weng Q, Liu K, Huang T, Zhou T, Jing Y, Li W, Lin Z, Buckler E S, Qian Q, Zhang Q F, Li J, Han B . Genome-wide association studies of 14 agronomic traits in rice landraces
Nat Genet, 2010,42:961-967

DOI:10.1038/ng.695URLPMID:20972439 [本文引用: 1]
Uncovering the genetic basis of agronomic traits in crop landraces that have adapted to various agro-climatic conditions is important to world food security. Here we have identified 65 3.6 million SNPs by sequencing 517 rice landraces and constructed a high-density haplotype map of the rice genome using a novel data-imputation method. We performed genome-wide association studies (GWAS) for 14 agronomic traits in the population of Oryza sativa indica subspecies. The loci identified through GWAS explained 65 36% of the phenotypic variance, on average. The peak signals at six loci were tied closely to previously identified genes. This study provides a fundamental resource for rice genetics research and breeding, and demonstrates that an approach integrating second-generation genome sequencing and GWAS can be used as a powerful complementary strategy to classical biparental cross-mapping for dissecting complex traits in rice.

Li H, Peng Z, Yang X, Wang W, Fu J, Wang J, Han Y, Chai Y, Guo T, Yang N, Liu J, Warburton M L, Cheng Y, Hao X, Zhang P, Zhao J, Liu Y, Wang G, Li J, Yan J . Genome-wide association study dissects the genetic architecture of oil biosynthesis in maize kernels
Nat Genet, 2013,45:43-50

[本文引用: 1]

Fang C, Ma Y M, Wu S W, Liu Z, Wang Z, Yang R, Hu G H, Zhou Z K, Yu H, Zhang M, Pan Y, Zhou G A, Ren H X, Du W Q, Yan H R, Wang Y P, Han D Z, Shen Y T, Liu S L, Liu T F, Zhang J X, Qin H, Yuan J, Yuan X H, Kong F J, Liu B H, Li J Y, Zhang Z W, Wang G D, Zhu B G, Tian Z X . Genome-wide association studies dissect the genetic networks underlying agronomical traits in soybean
Genome Biol, 2017,18:161

[本文引用: 1]

Peleman J D , van der Voort J R. Breeding by design
Trends Plant Sci, 2003,8:330-334

[本文引用: 1]

Huang X, Wei X, Sang T, Zhao Q, Feng Q, Zhao Y, Li C, Zhu C, Lu T, Zhang Z, Li M, Fan D, Guo Y, Wang A, Wang L, Deng L, Li W, Lu Y, Weng Q, Liu K, Huang T, Zhou T, Jing Y, Li W, Lin Z, Buckler E S, Qian Q, Zhang Q F, Li J, Han B . Genome-wide association studies of 14 agronomic traits in rice landraces
Nat Genet, 2010,42:961-967

DOI:10.1038/ng.695URLPMID:20972439 [本文引用: 1]
Uncovering the genetic basis of agronomic traits in crop landraces that have adapted to various agro-climatic conditions is important to world food security. Here we have identified 65 3.6 million SNPs by sequencing 517 rice landraces and constructed a high-density haplotype map of the rice genome using a novel data-imputation method. We performed genome-wide association studies (GWAS) for 14 agronomic traits in the population of Oryza sativa indica subspecies. The loci identified through GWAS explained 65 36% of the phenotypic variance, on average. The peak signals at six loci were tied closely to previously identified genes. This study provides a fundamental resource for rice genetics research and breeding, and demonstrates that an approach integrating second-generation genome sequencing and GWAS can be used as a powerful complementary strategy to classical biparental cross-mapping for dissecting complex traits in rice.

Zhao K, Tung C W, Eizenga G C, Wright M H, Ali M L, Price A H, Norton G J, Islam M R, Reynolds A, Mezey J , McClung A M, Bustamante C D, McCouch S R . Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa.
Nat Commun, 2011,2:467

[本文引用: 1]

Zeng Z B . Precision mapping of quantitative trait loci
Genetics, 1994,136:1457-1468

DOI:10.1007/s00122-012-2032-2.URLPMID:8013918 [本文引用: 1]
Adequate separation of effects of possible multiple linked quantitative trait loci (QTLs) on mapping QTLs is the key to increasing the precision of QTL mapping. A new method of QTL mapping is proposed and analyzed in this paper by combining interval mapping with multiple regression. The basis of the proposed method is an interval test in which the test statistic on a marker interval is made to be unaffected by QTLs located outside a defined interval. This is achieved by fitting other genetic markers in the statistical model as a control when performing interval mapping. Compared with the current QTL mapping method (i.e., the interval mapping method which uses a pair or two pairs of markers for mapping QTLs), this method has several advantages. (1) By confining the test to one region at a time, it reduces a multiple dimensional search problem (for multiple QTLs) to a one dimensional search problem. (2) By conditioning linked markers in the test, the sensitivity of the test statistic to the position of individual QTLs is increased, and the precision of QTL mapping can be improved. (3) By selectively and simultaneously using other markers in the analysis, the efficiency of QTL mapping can be also improved. The behavior of the test statistic under the null hypothesis and appropriate critical value of the test statistic for an overall test in a genome are discussed and analyzed. A simulation study of QTL mapping is also presented which illustrates the utility, properties, advantages and disadvantages of the method.

He J, Meng S, Zhao T, Xing G, Yang S, Li Y, Guan R, Lu J, Wang Y, Xia Q, Yang B, Gai J . An innovative procedure of genome-wide association analysis fits studies on germplasm population and plant breeding
Theor Appl Genet, 2017,130:2327-2343

[本文引用: 4]

Zhang Y, He J, Wang Y, Xing G, Zhao J, Li Y, Yang S, Palmer R G, Zhao T, Gai J . Establishment of a 100-seed weight quantitative trait locus-allele matrix of the germplasm population for optimal recombination design in soybean breeding programmes
J Exp Bot, 2015,66:6311-6325

[本文引用: 2]

Meng S, He J, Zhao T, Xing G, Li Y, Yang S, Lu J, Wang Y, Gai J . Detecting the QTL-allele system of seed isoflavone content in Chinese soybean landrace population for optimal cross design and gene system exploration
Theor Appl Genet, 2016,129:1557-1576

[本文引用: 3]

Li S, Cao Y, He J, Zhao T, Gai J . Detecting the QTL-allele system conferring flowering date in a nested association mapping population of soybean using a novel procedure
Theor Appl Genet, 2017,130:2297-2314

[本文引用: 4]

Gabriel S B, Schaffner S F, Nguyen H, Moore J M, Roy J, Blumenstiel B, Higgins J , DeFelice M, Lochner A, Faggart M, Liu-Cordero S N, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander E S, Daly M J, Altshuler D . The structure of haplotype blocks in the human genome
Science, 2002,296:2225-2229

URL [本文引用: 1]

Patterson N, Price A L, Reich D . Population structure and eigenanalysis
PLoS Genet, 2006,2:e190

URL [本文引用: 1]
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general ‘‘phase change’ ’ phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like F ST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.

Price A L, Patterson N J, Plenge R M, Weinblatt M E, Shadick N A, Reich D . Principal components analysis corrects for stratification in genome-wide association studies
Nat Genet, 2006,38:904-909

[本文引用: 1]

VanRaden P M . Efficient methods to compute genomic predictions
J Dairy Sci, 2008,91:4414-4423

DOI:10.3168/jds.2007-0980URLPMID:18946147
Efficient methods for processing genomic data were developed to increase reliability of estimated breeding values and to estimate thousands of marker effects simultaneously. Algorithms were derived and computer programs tested with simulated data for 2,967 bulls and 50,000 markers distributed randomly across 30 chromosomes. Estimation of genomic inbreeding coefficients required accurate estimates of allele frequencies in the base population. Linear model predictions of breeding values were computed by 3 equivalent methods: 1) iteration for individual allele effects followed by summation across loci to obtain estimated breeding values, 2) selection index including a genomic relationship matrix, and 3) mixed model equations including the inverse of genomic relationships. A blend of first- and second-order Jacobi iteration using 2 separate relaxation factors converged well for allele frequencies and effects. Reliability of predicted net merit for young bulls was 63% compared with 32% using the traditional relationship matrix. Nonlinear predictions were also computed using iteration on data and nonlinear regression on marker deviations; an additional (about 3%) gain in reliability for young bulls increased average reliability to 66%. Computing times increased linearly with number of genotypes. Estimation of allele frequencies required 2 processor days, and genomic predictions required <1 d per trait, and traits were processed in parallel. Information from genotyping was equivalent to about 20 daughters with phenotypic records. Actual gains may differ because the simulation did not account for linkage disequilibrium in the base population or selection in subsequent generations.

Scheet P, Stephens M . A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase
Am J Hum Genet, 2006,78:629-644

[本文引用: 1]

Buckler E S, Holland J B, Bradbury P J, Acharya C B, Brown P J, Browne C, Ersoz E, Flint-Garcia S, Garcia A, Glaubitz J C, Goodman M M, Harjes C, Guill K, Kroon D E, Larsson S, Lepak N K, Li H, Mitchell S E, Pressoir G, Peiffer J A, Rosas M O, Rocheford T R, Romay M C, Romero S, Salvo S , Sanchez Villeda H, da Silva H S, Sun Q, Tian F, Upadyayula N, Ware D, Yates H, Yu J, Zhang Z, Kresovich S, McMullen M D. The genetic architecture of maize flowering time
Science, 2009,325:714-718

DOI:10.1126/science.1174276URLPMID:19661422 [本文引用: 1]
Flowering time is a complex trait that controls adaptation of plants to their local environment in the outcrossing species Zea mays (maize). We dissected variation for flowering time with a set of 5000 recombinant inbred lines (maize Nested Association Mapping population, NAM). Nearly a million plants were assayed in eight environments but showed no evidence for any single large-effect quantita...

Meuwissen T H, Hayes B J, Goddard M E . Prediction of total genetic value using genome-wide dense marker maps
Genetics, 2001,157:1819-1829

DOI:10.1017/S0016672301004931URLPMID:11290733 [本文引用: 1]
Recent advances in molecular genetic techniques will make dense marker maps available and genotyping many individuals for these markers feasible. Here we attempted to estimate the effects of approximately 50,000 marker haplotypes simultaneously from a limited number of phenotypic records. A genome of 1000 cM was simulated with a marker spacing of 1 cM. The markers surrounding every 1-cM region were combined into marker haplotypes. Due to finite population size N(e) = 100, the marker haplotypes were in linkage disequilibrium with the QTL located between the markers. Using least squares, all haplotype effects could not be estimated simultaneously. When only the biggest effects were included, they were overestimated and the accuracy of predicting genetic values of the offspring of the recorded animals was only 0.32. Best linear unbiased prediction of haplotype effects assumed equal variances associated to each 1-cM chromosomal segment, which yielded an accuracy of 0.73, although this assumption was far from true. Bayesian methods that assumed a prior distribution of the variance associated with each chromosome segment increased this accuracy to 0.85, even when the prior was not correct. It was concluded that selection on genetic values predicted from markers could substantially increase the rate of genetic gain in animals and plants, especially if combined with reproductive techniques to shorten the generation interval.
相关话题/遗传 计算 程序 数据 结构