Restricted Two-Stage Multi-Locus Genome-Wide Association Analysis and Its Applications to Genetic and Breeding Studies
HE JianBo, LIU FangDong, WANG WuBin, XING GuangNan, GUAN RongZhan, GAI JunYi,Soybean Research Institute, Nanjing Agricultural University/National Center for Soybean Improvement/Key Laboratory of Biology and Genetic Improvement of Soybean (General), Ministry of Agriculture/State Key Laboratory for Crop Genetics and Germplasm Enhancement/Jiangsu Collaborative Innovation Center for Modern Crop Production, Nanjing 210095通讯作者:
责任编辑: 李莉
收稿日期:2019-08-26接受日期:2019-11-30网络出版日期:2020-05-16
基金资助: |
Received:2019-08-26Accepted:2019-11-30Online:2020-05-16
作者简介 About authors
贺建波,E-mail:hjbxyz@gmail.com。
摘要
关键词:
Abstract
Keywords:
PDF (1843KB)元数据多维度评价相关文章导出EndNote|Ris|Bibtex收藏本文
本文引用格式
贺建波, 刘方东, 王吴彬, 邢光南, 管荣展, 盖钧镒. 限制性两阶段多位点全基因组关联分析法在遗传育种中的应用[J]. 中国农业科学, 2020, 53(9): 1704-1716 doi:10.3864/j.issn.0578-1752.2020.09.002
HE JianBo, LIU FangDong, WANG WuBin, XING GuangNan, GUAN RongZhan, GAI JunYi.
作物生产涉及的性状大部分是数量性状,研究和解析数量性状遗传基础不仅对植物遗传研究有意义,而且也是设计育种中组合优化设计和后代精准选择的前提。与质量性状由一个或少数几个基因控制不同,数量性状由大量基因控制,全面准确解析数量性状基因座(quantitative trait locus,QTL)至今仍具有挑战性[1]。目前,基于分子标记的连锁定位(linkage mapping)和全基因组关联分析(genome-wide association study,GWAS)是QTL/基因定位的2种主要方法。连锁定位一般基于双亲分离世代群体,如重组自交系群体(recombinant inbred lines,RIL),利用分子标记遗传连锁图谱和区间作图法进行QTL检测[2,3]。连锁定位通常只涉及2个亲本,该方法所能检测的遗传变异仅限于2个亲本间的遗传差异,例如在RIL群体中,每个遗传位点上最多存在2个等位基因的差异。因此,基于双亲分离世代群体的连锁定位往往只能检测到有限数量的大效应QTL,不能全面检测QTL及其复等位变异。基于多亲分离世代群体的连锁定位方法一定程度上丰富了遗传变异,例如玉米的巢式关联作图(nested association mapping,NAM)群体由25个具有共同亲本的RIL群体组成,则理论上每个遗传位点最多存在26个等位基因的差异[4,5]。然而,以往NAM群体的统计分析方法中将各RIL群体视为彼此独立的子群体,假定每个位点在不同RIL群体中具有不同的等位基因效应[6],例如由25个RIL群体组成的NAM群体中,每个位点具有恒定的50个等位基因。因此,尽管NAM群体通过一个共同亲本将各RIL群体联系起来,但以往NAM群体的统计分析方法仅仅是多个RIL群体的联合分析,没能将NAM群体作为一个统一完整群体,模型中每个位点的基因型不是真实的分子标记基因型,从而导致每个位点等位基因数目与实际情况有所偏差,进而影响QTL的检测及进一步的育种应用。
自然群体/资源群体具有最广泛的遗传变异。GWAS利用自然群体大量的历史重组事件,通过检测全基因组高密度分子标记与表型的相关性,进而筛选与目标性状显著关联的标记位点,比之连锁定位具有更高的检测精度。GWAS能够检测全基因组QTL及其复等位变异,已经成为数量性状遗传解析的重要方法,广泛应用于人类与动植物数量性状遗传基础解析的研究[7,8]。然而,与连锁定位中群体遗传结构单一不同,自然群体由于长期的自然和人工选择等因素往往具有复杂未知的群体结构,而群体结构又可能导致非连锁位点间产生非随机关联,进而导致GWAS检测结果较高的假阳性[9]。目前,研究者已提出多种方法以降低群体结构对GWAS的干扰,其中最常用的方法主要包括结构关联(structured association,SA)[10]、主成分分析(principal components analysis,PCA)[11]和混合线性模型(mixed linear model,MLM)[12]。SA方法是将由STRUCTURE等[13]贝叶斯聚类算法推断的群体结构作为模型协变量以控制群体结构的影响。与SA方法类似,PCA方法是将群体遗传关系矩阵特征向量作为模型协变量。在SA和PCA的基础上,MLM方法又将遗传背景效应作为随机效应加入线性模型,并将亲属关系矩阵作为遗传背景随机效应的协方差结构,从群体结构和家系结构2个方面控制群体偏差对GWAS的影响。
以往GWAS通常基于全基因组SNP分子标记,而SNP分子标记在一个标记位点上仅有2个等位变异,不能检测自然群体中广泛存在的复等位变异,不仅一定程度限制了GWAS在育种中的应用,由于单个SNP分子标记仅能解析一对等位基因间的遗传变异,因而这也可能降低GWAS的检测功效。上述常用的GWAS方法均基于单位点模型,每个标记位点与表型的相关性测验彼此独立进行,因此,每个标记位点的效应估计会受到相邻位点的影响,从而导致位点表型变异解释率过高估计,例如检测的位点总表型变异解释率可能超过100%。由于GWAS涉及海量的分子标记,这将导致多位点模型中变量个数远大于观测值数目,不能直接求解线性模型,这很大程度上限制了多位点模型在GWAS中的应用。此外,为了控制单位点模型多重测验导致的全试验水平错误率增大,以往GWAS通常使用非常严格的显著水平控制假阳性,例如Bonferroni矫正方法。严格的显著水平同时也将导致较高的假阴性,以至于以往GWAS往往仅能检测到少数主要QTL,检测的位点往往仅能解释表型变异的一小部分,不能全面解析全基因组遗传位点。
针对上述GWAS的局限性,HE等[14]将多个相邻且连锁不平衡(linkage disequilibrium,LD)程度高的SNP标记组成具有复等位变异的SNPLDB标记,并基于多位点复等位变异模型进行全基因组QTL检测,提出了限制性两阶段多位点全基因组关联分析方法(restricted two-stage multi-locus genome-wide association analysis,RTM-GWAS),该方法不仅解决了以往GWAS不能估计复等位变异的问题,而且基于多位点模型通过拟合多个QTL,提高检测功效并降低假阳性。RTM-GWAS方法通过全面解析自然群体QTL及其复等基因,建立群体的遗传构成,以进一步应用于基因发掘、群体遗传分化研究以及最优亲本组合的全基因组选择。本文首先总结RTM-GWAS的原理和方法,然后综述其在遗传育种研究中的应用。
1 限制性两阶段全基因组关联分析方法
RTM-GWAS方法包括2个关键创新点以解决以往GWAS不能估计复等位变异和单位点模型的问题。第一点关键创新是基于全基因组高密度SNP分子标记构建具有复等位变异的SNPLDB标记,并利用SNPLDB标记进行QTL检测。SNPLDB标记具有复等位性,因而可以拟合自然群体中丰富的复等位变异。第二点关键创新是建立两阶段多位点复等位变异模型以检测全基因组QTL,并最终构建多QTL遗传模型。多位点模型不仅解决了以往GWAS单位点模型效应估计有偏的问题,而且由于多位点模型不再涉及多重测验问题,从而可以使用常规显著水平,一定程度上能够降低由多重测验矫正导致的假阴性,提高检测功效。由于GWAS通常涉及海量分子标记,直接求解多位点模型将导致模型空间过大而计算困难。RTM- GWAS方法采用两阶段分析策略以解决GWAS多位点模型计算耗时的问题,第一阶段将大量与目标性状无关的分子标记淘汰,第二阶段基于缩减后的分子标记拟合多位点模型。另外,RTM-GWAS计算程序(https://github.com/njau-sri/rtm-gwas/)基于C++编程语言实现,并借助高度优化的高性能线性代数运算库,使得RTM-GWAS方法具有较高的计算效率[15]。
1.1 复等位变异检测
通常SNP标记在全基因组的分布不是均匀的,相邻SNP间的连锁紧密程度显示出基因组的区段特征,区段内的单倍型序列保持不变的一起传递给下一代。因此,区段内SNP间的连锁不平衡程度较高,区段内多个SNP等位变异的不同组合形式则构成了不同的区段单倍型。区段单倍型提供了类似复等位变异的变异特征,比只有2个等位变异的SNP标记,更符合自然群体基因组变异特征。连锁不平衡是度量自然群体重组历史的通用指标,因此可根据SNP间的连锁不平衡程度在全基因组范围内寻找这种基因组区段。RTM-GWAS首先使用基于连锁不平衡置信区间的方法确定全基因组范围内的基因组区段[16]。按设定的连锁不平衡标准,区段内的SNP可能有多个,最少为一个。这些SNP组成的单倍型类型作为该位点的等位变异,群体内个体在该位点的基因型由这些SNP组成单倍型确定。这种基于连锁不平衡区段构建的具有复等位变异的标记类型就称为SNPLDB标记。单个SNP同样被视为一个独立的SNPLDB标记。通过比较不同测序深度的数据比较显示,随着SNP密度的增加,仅包含一个SNP的SNPLDB标记将减少。例如,利用145 558个SNP构建获得36 952个SNPLDB标记,其中70.3%的SNPLDB仅包含单个SNP,而基于平均覆盖深度大于11×的数据,78.2%的SNPLDB标记都包含多个SNP[14]。
SNPLDB标记提供了比SNP标记更丰富的复等位变异信息,由于复等位变异是自然群体/资源群体的自然属性,SNPLDB标记理论上能够拟合不同等位基因数目的QTL,基于SNPLDB的QTL检测也比SNP更加合理。SNPLDB标记还可用于分析位点水平不同等位变异在不同亚群中的频率差异,比SNP标记也更适用于群体的遗传分化特征研究。此外,植物常规育种是一个聚合等位基因的遗传操作过程,将亲本材料互补的等位基因聚合到一个复合改良个体中,使其包含产量、品质或其他所需性状的优异等位基因[17]。因此,设计育种的首要前提就是解析目标性状全基因组QTL及其复等位变异组成,而基于SNPLDB的QTL检测为设计育种提供了潜在方法。
1.2 群体结构控制
以往用于群体结构控制的基于分子标记的遗传关系矩阵只适合于SNP标记[11,18-19],不能用于具有复等位变异的SNPLDB标记。因此,RTM-GWAS利用基于SNPLDB标记的遗传相似系数矩阵以控制群体结构对GWAS的影响。基于SNPLDB标记的个体(假定二倍体)间的遗传相似系数定义为处于状态同样SNPLDB标记的比例,即:$s_{ij}=\sum\limits_{k=1}^m c_{ijk}/(2m)$
其中,cijk定义为在第k个SNPLDB上个体i与个体j的共有等位基因数目(取值为0、1、2),m是SNPLDB总个数。尽管群体结构由于群体混合或近交程度的变化具有不确定性,但遗传相似系数矩阵无需预先设定假设,可以作为一种通用方法来估计群体结构。RTM-GWAS将遗传相似系数矩阵的特征向量作为协变量纳入线性模型以矫正群体结构偏差。这里,群体结构效应被视为固定效应而不是随机效应,因为群体通常是预先确定的,而不是随机形成的[9]。
1.3 多位点关联分析模型
尽管GWAS通常涉及数百万的分子标记,然而大部分标记与目标性状并不相关。为了有效缩减多位点模型空间,RTM-GWAS采用两阶段分析策略。第一阶段,基于单位点模型进行全基因组位点的关联测验,使用常规显著水平(例如0.05)对标记位点进行初步筛选,淘汰与目标性状不相关的标记位点。线性模型可表示如下:$y_i=\mu+\sum\limits_{j=1}^J w_{ij}\alpha_j+\sum\limits_{l=1}^L x_{il}\beta_l+\varepsilon_i$
其中,yi表示个体i的表型观测值;μ表示总体平均数;wij表示遗传相似系数矩阵第j个特征向量在个体i上的系数,αj为第j个特征向量的效应,J为用于群体结构矫正的特征向量的个数;xil为测验标记位点第l个等位基因对于个体i的基因型指示变量,取值0或1;βl为第l个等位基因的效应;L为测验标记位点的等位基因数目;εi为假定服从正态分布的残差效应。该线性模型可以使用回归分析方法直接求解。
第二阶段,利用如下多位点模型对第一阶段筛选得到的标记位点进行分析,检测全基因组QTL并最终建立多QTL模型。
$y_i=\mu+\sum\limits_{j=1}^Jw_{ij}\alpha_j+\sum\limits_{k=1}^K\sum\limits_{l=1}^{Lk}x_{ikl}\beta_{kl}+\varepsilon_i$
其中,xikl为第k个位点的第l个等位基因在个体i上的基因型指示变量,取值0或1;βkl为第k个位点的第l个等位基因的效应;Lk为第k个位点的等位基因数目;K为总QTL数目。其他符号含义同上。该模型可使用逐步回归分析方法进行求解。由于QTL检测基于多位点模型,因此,RTM-GWAS检测的QTL所解释的总遗传变异将小于群体总遗传变异或表型变异解释率不超过性状遗传率。
1.4 全基因组QTL检测显著水平
由于多位点模型内含全试验水平错误控制的特性,因此,RTM-GWAS方法使用常规显著水平0.01和0.05检测全基因组QTL。这与以往基于单位点模型的GWAS方法不同,基于单位点模型的GWAS往往需要按标记一个一个进行大量独立的统计假设测验,即多重测验,此时常规显著水平下的全试验错误率将大大提高。这种情况下,有必要采取适当调整方法对多重测验进行矫正,例如基于Bonferroni方法调整的显著水平0.05×10-8,以控制全试验错误率[20]。但对于RTM-GWAS的多位点模型,所有位点被拟合于一个线性模型中进行联合统计假设测验,因此,使用常规显著水平便可以控制全试验错误率,无需进行多重测验矫正。根据逐步回归方法的特点,除给出全模型显著的位点外,还可以给出每个入选位点的单独概率或显著性,通常和多重测验的校正概率相近,因而研究者还可根据需要采用特殊标准选取位点。例如,HE等[14]使用常规显著水平检测到139个大豆百粒重位点(表1),包括22个大效应(R2≥1%)位点和117个小效应(R2<1%)位点,总表型变异解释率分别为61.8%和36.4%。结果还包括了采用Bonferroni方法矫正显著水平检测的16个位点中的15个。因此,尽管没有必要对RTM-GWAS进行多重测验矫正,研究者仍然可以采用更严格的显著水平从常规显著水平下的结果中筛选个别显著程度高的大效应位点,而无需重新计算。例如,对于一个性状改良的育种方案,育种家可以使用0.05或0.01作为显著水平检测QTL,而对于候选基因克隆,研究者可以使用计算给出的单个位点的概率来筛选最重要的基因座位。
Table 1
表1
表1中国大豆种质资源群体百粒重显著关联的SNPLDB标记位点
Table 1
SNPLDB | 染色体 Chromosome | 物理位置Position (bp) a | 等位基因数目 No. alleles | -lgP | R2 (%) |
---|---|---|---|---|---|
LDB_18_59996683 | 18 | 59996683 | 2 | 129.8 | 9.84 |
LDB_8_5286591 | 8 | 5286591 | 2 | 99.3 | 6.76 |
LDB_16_35761014 | 16 | 35761014—35771300 | 4 | 86.0 | 5.80 |
LDB_6_3703919 | 6 | 3703919 | 2 | 84.1 | 5.43 |
LDB_4_3019467 | 4 | 3019467—3046646 | 3 | 81.7 | 5.35 |
LDB_17_15063207 | 17 | 15063207—15063454 | 4 | 61.0 | 3.80 |
LDB_11_28584788 | 11 | 28584788—28784681 | 8 | 53.6 | 3.53 |
LDB_14_47245011 | 14 | 47245011 | 2 | 37.9 | 2.08 |
LDB_9_6122236 | 9 | 6122236 | 2 | 37.5 | 2.05 |
LDB_13_42639761 | 13 | 42639761 | 2 | 36.5 | 1.99 |
... | ... | ... | ... | ... | ... |
LDB_2_11741211 | 2 | 11741211—11741518 | 3 | 9.1 | 0.47 |
LDB_10_34650810 | 10 | 34650810—34706889 | 5 | 7.6 | 0.46 |
LDB_5_38249682 | 5 | 38249682—38278658 | 5 | 7.3 | 0.45 |
LDB_7_35863030 | 7 | 35863030—35901005 | 6 | 6.9 | 0.45 |
LDB_9_1954783 | 9 | 1954783 | 2 | 9.4 | 0.44 |
... | ... | ... | ... | ... | ... |
LDB_8_44667459 | 8 | 44667459 | 2 | 2.6 | 0.10 |
LDB_13_35141544 | 13 | 35141544 | 2 | 2.6 | 0.10 |
LDB_18_61536415 | 18 | 61536415 | 2 | 2.7 | 0.10 |
... | ... | ... | ... | ... | ... |
LDB_8_16362965 | 8 | 16362965 | 2 | 2.2 | 0.08 |
LDB_19_44814107 | 19 | 44814107 | 2 | 1.9 | 0.07 |
LC QTL | 68 | 22 | 61.8 | ||
SC QTL | 334 | 117 | 36.4 | ||
合计Total | 402 | 139 | 98.2 |
新窗口打开|下载CSV
2 应用于自然群体数量性状遗传解析
自然群体遗传变异丰富,作物种质资源群体更是品种改良的重要基因资源。全面解析自然群体/资源群体大量存在的QTL及其复等位变异将有助于了解数量性状的遗传规律以及植物遗传改良。ZHANG等[21]基于由366份地方大豆材料组成的资源群体,分别使用了RTM-GWAS方法和目前最常用的MLM方法对油脂、油酸和亚麻酸含量进行GWAS分析(表2)。结果显示,在Bonferroni多重测验矫正下,MLM方法对3个性状分别检测到3、18和22个QTL,表型变异解释率分别是19.69%、138.76%和206.52%。可见MLM方法检测的QTL不仅偏少,而且油酸和亚麻酸含量QTL的表型变异解释率还远超过性状遗传率,表明MLM方法中QTL效应估计偏差较大。而RTM-GWAS方法分别检测到50、98和50个QTL,表型变异解释率分别是82.53%、90.29%和83.84%,均小于性状遗传率,结果更为合理。Table 2
表2
表2基于大豆地方品种资源群体的全基因组关联分析方法比较
Table 2
性状 Trait | 遗传率 h2 | RTM-GWAS | MLM | ||
---|---|---|---|---|---|
QTL | R2 (%) | QTL | R2 (%) | ||
油脂含量Oil content | 0.91 | 50 | 82.53 | 3 | 16.69 |
油酸含量Oleic acid content | 0.91 | 98 | 90.29 | 18 | 138.76 |
亚麻酸含量Linolenic acid content | 0.90 | 50 | 83.34 | 22 | 206.52 |
新窗口打开|下载CSV
HE等[14]对包括1 024份大豆材料的中国大豆种质资源群体的百粒重进行了全基因组关联分析,比较了RTM-GWAS方法与PCA和MLM方法。从分析结果Q-Q图可以看出(图1),未进行群体结构控制的单标记分析方法(Naive)中所有标记都大幅偏离理论值,假阳性非常高,这是因为该群体包括了野生大豆、地方大豆和大豆育种品种,不同材料又收集自不同的大豆生态区,形成复杂的群体结构。通过控制群体结构,PCA方法一定程度上降低了假阳性,但仍远远偏离理论值。MLM方法中所有位点均与理论值较为接近,虽然假阳性大幅降低,但是检测功效也随之降低。RTM-GWAS方法表现则比较合理,大部分位点与理论值接近,检测的QTL大幅高于理论值,既降低了假阳性,又保证了检测功效。
图1
新窗口打开|下载原图ZIP|生成PPT图1中国大豆种质资源群体百粒重全基因组关联分析Q-Q图
黑色直线为理论分布参考线
Fig. 1Q-Q plot of genome-wide association study of 100-seed weight in Chinese soybean germplasm population
The black line is the reference line of the theoretical distribution
中国大豆种质资源群体中,RTM-GWAS方法共检测到139个百粒重QTL(表1和图2),包括MLM方法检测的3个QTL中的2个,覆盖前人已报道百粒重QTL的73%[14]。RTM-GWAS方法同时估计出139个百粒重QTL上402个等位变异的遗传效应。百粒重QTL及其等位变异效应反应了百粒重性状在群体的遗传构成,所有QTL在群体内材料上的基因型和等位基因效应可进一步构建为性状在群体的QTL- allele矩阵(图3)。QTL-allele矩阵包括了性状在群体内的所有遗传信息,可进一步应用于基因发掘和设计育种。
图2
新窗口打开|下载原图ZIP|生成PPT图2中国大豆种质资源群体百粒重RTM-GWAS分析Manhattan图
Fig. 2Manhattan plot of genome-wide association analysis results of 100-seed weight in Chinese soybean germplasm population using RTM-GWAS
图3
新窗口打开|下载原图ZIP|生成PPT图3中国大豆种质资源群体百粒重QTL-allele矩阵
横坐标表示材料,按百粒重升序排列,每一列为一个材料的等位基因组成。纵坐标表示QTL,每一行为一个QTL等位基因在材料中的分布。等位基因效应大小使用颜色表示,暖色表示正效,冷色表示负效,颜色深度表示效应大小
Fig. 3The QTL-allele matrix of 100-seed weight in Chinese soybean germplasm population
The horizontal axis represents accessions arranged in rising order of their 100-seed weight (g). each column indicates the allele constitution of an accession over all QTLs. The vertical axis represents QTL, and each row represents the allele distribution among accessions for a QTL. Allele effects are expressed in color cells with warm colors indicating positive effects and cool colors indicating negative effects, and the color depth indicates effect size
3 应用于RIL和NAM群体数量性状遗传解析
越来越多的RIL群体也开始利用重测序技术获得全基因组高密度SNP分子标记,由于标记密度高,不再需要构建遗传连锁图谱便可以进行QTL检测,此时GWAS方法也可应用于RIL群体。同样,RTM-GWAS也适用于由双亲衍生的RIL群体和由多亲衍生的NAM群体。但是RTM-GWAS中SNPLDB标记根据基因组区段单倍型进行构建,而RIL群体和NAM群体中个体的位点基因型直接来自于亲本,此时SNPLDB标记等位变异应从亲本单倍型中构建。针对RIL群体和NAM群体,RTM-GWAS中SNPLDB标记构建方法作如下调整。首先,仍然使用基于连锁不平衡置信区间的方法确定全基因组范围内的基因组区段。然后将区段内的所有SNP在亲本中组成的单倍型类型作为该位点的等位变异,群体内个体在该位点的基因型由亲本单倍型确定。PAN等[22]基于大豆RIL群体的分子标记和开花期数据,比较了不同定位方法(CIM、MLM和RTM-GWAS)和不同标记类型(SSR、BIN和SNPLDB)的应用效果。结果显示,3种方法分别检测到10、36、67个BIN-QTL和23、14、86个SNPLDB-QTL。CIM和MLM方法所检测位点的表型变异解释率均超过100%,而RTM-GWAS方法所检测位点的表型变异解释率均小于但接近性状遗传率。因此,RTM-GWAS方法不仅能检测较多的QTL,而且能合理估计QTL表型变异解释率,更适用于RIL群体的QTL定位研究。
如前所述,尽管NAM群体通过一个共同亲本将多个RIL群体联系起来,提高了群体的遗传变异程度,然而以往分析方法却没将NAM群体作为一个统一完整群体。RTM-GWAS可通过构建SNPLDB标记对NAM群体进行统一分析。LI等[23]基于一个包含4个大豆RIL群体的NAM群体,比较了基于SNP标记的JICIM[6]和MLM方法,以及基于SNPLDB的RTM-GWAS方法(表3)。结果显示,3种方法分别检测到9、7和139个大豆开花期QTL,表型变异解释率分别是74.0%、40.6%和81.7%。该NAM群体有5个亲本,理论上位点上最多存在5个等位基因。而JICIM方法每个位点等位基因数目均为8,MLM方法每个位点等位基因数目均为2,显然不符合实际情况。RTM-GWAS方法每个位点等位基因数目最少2个,最多5个,合理地拟合了群体内的等位基因变异,更适用于NAM群体。
Table 3
表3
表3基于大豆NAM群体的五种QTL定位方法特点归纳比较
Table 3
比较指标 Item | 独立分析 Separate mapping | 联合分析Joint mapping | |||
---|---|---|---|---|---|
CIM[3] | MCIM[24] | JICIM[25] | MLM[26] | RTM-GWAS | |
标记类型 Marker type | BIN | BIN | SNP | SNP | SNPLDB |
定位原理 Mapping mechanism | 连锁定位 Linkage mapping | 连锁定位 Linkage mapping | 连锁定位 Linkage mapping | 关联定位 Association mapping | 关联定位 Association mapping |
QTL数量 Number of QTLs | 8 | 16 | 9 | 7 | 139 |
等位基因数量 Number of alleles | 2 | 2 | 8 | 2 | 2~5 |
遗传贡献率 Genetic contribution (%) | 73.2—96.1 | 48.4—94.5 | 74.0 | 40.6 | 81.7 |
表型数据类型 Phenotype data | 平均数 Entry mean | 小区观测值 Single plot | 平均数 Entry mean | 平均数 Entry mean | 小区观测值 Single plot |
QTL×环境互作 QTL×Env. | 否No | 是Yes | 否No | 否No | 是Yes |
计算机软件 Software | QTL Cartographer | QTLNetwork | QTL IciMapping | TASSEL | RTM-GWAS |
命令行界面 Command line | 是Yes | 否No | 否No | 是Yes | 是Yes |
计算平台Platform | Windows/Linux | Windows | Windows | Windows/Linux/Mac | Windows/Linux/Mac |
新窗口打开|下载CSV
4 应用于基因与环境互作遗传解析
数量性状不仅受多个QTL的作用,而且还受到QTL之间以及QTL与环境之间相互作用的影响。QTL与环境互作通过维持群体遗传变异在植物环境适应性中起着重要作用。例如,基因与环境互作效应对大豆耐旱性的影响非常大,这是由于干旱程度高度依赖于温度、湿度、降雨等环境因素[27]。因此,更好地了解QTL主效应以及QTL与环境互作效应,对不同环境下育种策略的制定至关重要。然而,以往GWAS通常基于个体表型平均数(最佳线性无偏估计或最佳线性无偏预测),无法解析QTL与环境互作效应。而RTM-GWAS方法基于数量性状的小区观测值,通过QTL与环境互作的多位点模型,不仅能检测主效应QTL,还能够检测仅与环境有交互作用的非主效应QTL。QTL与环境互作线性模型如下:$y_{it}=\mu+e_t+\sum\limits_{j=1}^Jw_{ij}\alpha_j+\sum\limits_{k=1}^K\bigg{\lgroup}\sum\limits_{l=1}^{Lk}x_{ikl}\beta_{kl}+\sum\limits_{l=1}^{Lk}x_{ikl}\gamma_{klt}\bigg{\rgroup}+\varepsilon_{is}$
其中,et为第t个环境的效应,γklt为第k个位点上第l个等位基因与第t个环境的互作效应。其他符号含义同上。RTM-GWAS首先基于模型效应检测QTL,即测验模型中QTL主效应及QTL与环境互作效应的总和是否显著。此时,QTL主效应或QTL与环境互作效应中至少有一项显著时,RTM-GWAS便可以检测出QTL。其次,RTM-GWAS分别对QTL主效应及QTL与环境互作效应进行测验,以确定具体的QTL模型。
KHAN等[27]对由2个RIL群体组成的大豆NAM群体进行了苗期耐旱性鉴定,结果显示,对于相对根长和相对茎长,基因型与环境互作效应均极显著。利用RTM-GWAS分别检测到38和73个QTL,其中30和55个QTL主效应解释了26.11%和40.43%的表型变异,16和53个环境互作QTL解释了10.35%的表型变异。结果进一步说明了基因与环境互作效应在大豆耐旱性中起到了重要作用。
5 应用于群体遗传分化与设计育种
RTM-GWAS方法能够较充分的检测出QTL及其相应的复等位变异,由其结果建立的QTL-allele矩阵则代表了群体目标性状的全部遗传组成。因此,QTL-allele矩阵可进一步用于群体目标性状的遗传分化与进化特征与特有与新生等位变异分析。ZHANG等[28]基于包括89个大豆蛋白质含量QTL及其255个等位基因的QTL-allele矩阵,分析了地方大豆在不同生态区间的遗传分化特征,发现有32.09%的等位基因为生态区特有,并总结出生态区间遗传分化的4种模式。如图4所示不同生态区间等位基因频率差异不显著和显著的各4个QTL,4个不显著QTL基因频率在6个生态区间相对一致(左边),4个显著的QTL基因频率在6个生态区间差异较大(右边)。这为进一步阐明QTL/基因的进化规律提供了参考。图4
新窗口打开|下载原图ZIP|生成PPT图4大豆蛋白质含量QTL等位基因在不同生态区的频率分布(张英虎[29])
I:北方一熟制春作生态区;II:黄淮海二熟制春夏作生态区;III:长江中下游二熟制春夏作生态区;IV:中南多熟制春夏秋作生态区;V:西南高原二熟制春夏作生态区;VI:华南热带多熟制四季生态区
Fig. 4The allele frequencies of protein content QTLs among different ecoregions in soybean (ZHANG[29])
I: Northern Single Cropping, Spring Planting Ecoregion; II: Huang-Huai-Hai Double Cropping, Spring and Summer Planting Ecoregion; III: Middle and Lower Changjiang Valley Double Cropping, Spring and Summer Planting Ecoregion; IV, South Central Multiple Cropping, Spring, Summer and Autumn Planting Ecoregion; V: Southwest Plateau Double Cropping, Spring and Summer Planting Ecoregion; VI: South China Tropical Multiple, All Season Planting Ecoregion
亲本组配和后代选择是常规育种的2个主要步骤,QTL-allele矩阵则为亲本组配和后代选择提供了理论依据。HE等[14]使用RTM-GWAS方法对包含1 024份大豆材料的种质资源群体的百粒重进行了遗传解析,获得了包含139个QTL及其402个等位基因的QTL-allele矩阵,并进一步基于QTL-allele矩阵对所有523 776个单交组合纯合后代群体进行了预测(图5),结果显示部分单交组合后代表现出超亲百粒重,最好的20个组合后代百粒重预测值相比亲本群体提高了12.4%—19.9%(表4)。基于全基因组QTL-allele矩阵的优化组合设计与全基因组选择有本质不同,后者假定全基因组标记均与目标性状相关,通过构建全部标记的预测模型对后代进行预测和选择,因此,需要对育种后代群体进行全基因组标记的鉴定,成本高昂。另外,模型构建所用群体与实际育种群体的差异,还可能导致选择出现严重偏差。由于育种条件的限制,目前,全基因组选择主要应用于动物育种研究。而基于QTL-allele矩阵的选择直接对目标性状位点进行独立选择,更符合实际育种需求,理论上比全基因组选择更加直接和高效。
图5
新窗口打开|下载原图ZIP|生成PPT图5所有单交组合后代预测百粒重分布
两条虚线分别表示亲本观测值的最大值(上)和最小值(下)。Min.、P25、P50、P75和Max.分别表示组合后代预测值的最小值、第25百分位数、第50百分位数、第75百分位数和最大值
Fig. 5Distribution of predicted 100-seed weight of simulated progenies for all possible single crosses
The two dashed lines represent maximum (top) and minimum (bottom) observed 100-seed weight of parental lines respectively. Min., P25, P50, P75 and Max. represent the maximum, 25th percentile, 50th percentile, 75th percentile and maximum predicted 100-seed weight
Table 4
表4
表4中国大豆种质资源群体百粒重改良优异组合预测
Table 4
组合 Cross | 观测值 Observation | 99百分位数预测值 99 percentile prediction | |
---|---|---|---|
P1 | P2 | ||
T78205-06×N23548 | 30.4 | 36.0 | 43.1 |
N23745.0×N23548 | 26.6 | 36.0 | 42.9 |
N6141×N23548 | 34.0 | 36.0 | 42.4 |
N04482.1×N23548 | 28.2 | 36.0 | 42.4 |
N25377×N23548 | 25.5 | 36.0 | 41.6 |
N23548×N24190 | 36.0 | 26.6 | 41.6 |
N23548×N05758 | 36.0 | 27.8 | 41.4 |
N24282×N23548 | 24.8 | 36.0 | 41.4 |
T78205-06×N05758 | 30.4 | 27.8 | 41.3 |
N25366×N23548 | 24.4 | 36.0 | 41.2 |
新窗口打开|下载CSV
6 展望
本文中介绍了RTM-GWAS的基本原理和初步应用于植物遗传育种研究的效果。RTM-GWAS方法的最重要特点是能将群体内的QTL及其相应等位变异尽可能多地检测出来,并能给出等位变异的效应及其在群体内的相对频率,因而为全面追踪群体内QTL及其等位变异(基因及其等位基因)的构成和网络结构提供了基本信息,也为群体内QTL及其等位变异(基因及其等位基因)的动态研究(群体遗传学研究)提供了新的工具。目前,RTM-GWAS方法仅考虑了位点主效应及其与环境互作效应,其分析模型没有包括位点间交互作用(上位性效应)及其与环境互作效应(上位性与环境互作效应)。有研究表明上位性效应对数量性状遗传变异的贡献十分重要,考虑上位性效应的分析模型可以提高表型变异的拟合程度[30]。但是当GWAS模型纳入位点交互作用时,百万级的分子标记数量将导致计算困难,因此GWAS中考虑上位性的研究还非常少[31,32]。针对GWAS上位性效应解析中计算困难的问题,研究者也提出一些高效算法,如BOOST[33]、TEAM[34]等。但是这些方法通常针是对人类疾病-对照(case-control)GWAS而建立的,不能直接用于连续型数量性状,限制了其在植物研究中的应用。因此,探索高效的上位性分析模型将是RTM-GWAS方法下一步需要考虑的问题。由于RTM-GWAS方法的SNPLDB标记具有复等位性,单个标记不能使用一个变量进行表示,进一步增加了上位性模型的复杂程度。另外,随着计算机技术的快速发展,尤其是近几年图形处理器技术的普及应用,将有助于在全基因组水平解析数量性状的上位性效应[35,36]。
复杂性状的遗传构成解析是植物遗传育种研究的基础,不仅可用于进一步研究单个基因的功能,还可用于辅助育种。上文介绍了基于RTM-GWAS解析的QTL及其等位变异的育种优化组合设计,但以往基于RTM-GWAS方法的组合优化设计均针对单个目标性状,而实际育种是对多个性状的综合选择。此时,可利用RTM-GWAS获得的多个目标性状的QTL-allele矩阵对亲本组合后代群体进行预测,获得各个目标性状的预测值。最后,根据实际情况,通过设置性状权重建立多个性状的综合选择指数,进而从多个性状上对亲本组合进行综合选择。优化组合设计是常规育种的第一个主要步骤,实际育种中还需要从后代分离群体中选出优良家系。MEUWISSEN等[37]提出的全基因选择(genomic selection,GS)方法首先基于参考群体建立分子标记与表型的线性关系,然后在待选群体中利用同一套分子标记信息预测个体的育种值(genomic estimated breeding values,GEBVs),从而达到后代选择的目的。但植物育种中单个组合通常涉及上千个后代个体,目前,实际育种中使用全基因组选择方法进行后代选择花费高昂。在这种情况下,利用QTL-allele矩阵信息进行后代选择可能是另一种有效途径。将基于SNPLDB标记的QTL-allele矩阵用于分子标记辅助后代选择有多种可能的途径:将SNPLDB标记开发为凝胶电泳标记;寻找与SNPLDB标记紧密连锁的凝胶电泳标记;开发SNPLDB标记芯片。
参考文献 原文顺序
文献年度倒序
文中引用次数倒序
被引期刊影响因子
,
DOI:10.1038/s41576-019-0127-1URLPMID:31068683 [本文引用: 1]
Genome-wide association studies (GWAS) involve testing genetic variants across the genomes of many individuals to identify genotype-phenotype associations. GWAS have revolutionized the field of complex disease genetics over the past decade, providing numerous compelling associations for human complex traits and diseases. Despite clear successes in identifying novel disease susceptibility genes and biological pathways and in translating these findings into clinical care, GWAS have not been without controversy. Prominent criticisms include concerns that GWAS will eventually implicate the entire genome in disease predisposition and that most association signals reflect variants and genes with no direct biological relevance to disease. In this Review, we comprehensively assess the benefits and limitations of GWAS in human populations and discuss the relevance of performing more GWAS.
,
URLPMID:2563713 [本文引用: 1]
The advent of complete genetic linkage maps consisting of codominant DNA markers [typically restriction fragment length polymorphisms (RFLPs)] has stimulated interest in the systematic genetic dissection of discrete Mendelian factors underlying quantitative traits in experimental organisms. We describe here a set of analytical methods that modify and extend the classical theory for mapping such quantitative trait loci (QTLs). These include: (i) a method of identifying promising crosses for QTL mapping by exploiting a classical formula of SEWALL WRIGHT; (ii) a method (interval mapping) for exploiting the full power of RFLP linkage maps by adapting the approach of LOD score analysis used in human genetics, to obtain accurate estimates of the genetic location and phenotypic effect of QTLs; and (iii) a method (selective genotyping) that allows a substantial reduction in the number of progeny that need to be scored with the DNA markers. In addition to the exposition of the methods, explicit graphs are provided that allow experimental geneticists to estimate, in any particular case, the number of progeny required to map QTLs underlying a quantitative trait.
,
URLPMID:8013918 [本文引用: 2]
Adequate separation of effects of possible multiple linked quantitative trait loci (QTLs) on mapping QTLs is the key to increasing the precision of QTL mapping. A new method of QTL mapping is proposed and analyzed in this paper by combining interval mapping with multiple regression. The basis of the proposed method is an interval test in which the test statistic on a marker interval is made to be unaffected by QTLs located outside a defined interval. This is achieved by fitting other genetic markers in the statistical model as a control when performing interval mapping. Compared with the current QTL mapping method (i.e., the interval mapping method which uses a pair or two pairs of markers for mapping QTLs), this method has several advantages. (1) By confining the test to one region at a time, it reduces a multiple dimensional search problem (for multiple QTLs) to a one dimensional search problem. (2) By conditioning linked markers in the test, the sensitivity of the test statistic to the position of individual QTLs is increased, and the precision of QTL mapping can be improved. (3) By selectively and simultaneously using other markers in the analysis, the efficiency of QTL mapping can be also improved. The behavior of the test statistic under the null hypothesis and appropriate critical value of the test statistic for an overall test in a genome are discussed and analyzed. A simulation study of QTL mapping is also presented which illustrates the utility, properties, advantages and disadvantages of the method.
,
DOI:10.1534/genetics.107.074245URLPMID:18202393 [本文引用: 1]
We investigated the genetic and statistical properties of the nested association mapping (NAM) design currently being implemented in maize (26 diverse founders and 5000 distinct immortal genotypes) to dissect the genetic basis of complex quantitative traits. The NAM design simultaneously exploits the advantages of both linkage analysis and association mapping. We demonstrated the power of NAM for high-power cost-effective genome scans through computer simulations based on empirical marker data and simulated traits with different complexities. With common-parent-specific (CPS) markers genotyped for the founders and the progenies, the inheritance of chromosome segments nested within two adjacent CPS markers was inferred through linkage. Genotyping the founders with additional high-density markers enabled the projection of genetic information, capturing linkage disequilibrium information, from founders to progenies. With 5000 genotypes, 30-79% of the simulated quantitative trait loci (QTL) were precisely identified. By integrating genetic design, natural diversity, and genomics technologies, this new complex trait dissection strategy should greatly facilitate endeavors to link molecular variation with phenotypic variation for various complex traits.
,
DOI:10.1126/science.1174320URLPMID:19661427 [本文引用: 1]
Maize genetic diversity has been used to understand the molecular basis of phenotypic variation and to improve agricultural efficiency and sustainability. We crossed 25 diverse inbred maize lines to the B73 reference line, capturing a total of 136,000 recombination events. Variation for recombination frequencies was observed among families, influenced by local (cis) genetic variation. We identified evidence for numerous minor single-locus effects but little two-locus linkage disequilibrium or segregation distortion, which indicated a limited role for genes with large effects and epistatic interactions on fitness. We observed excess residual heterozygosity in pericentromeric regions, which suggested that selection in inbred lines has been less efficient in these regions because of reduced recombination frequency. This implies that pericentromeric regions may contribute disproportionally to heterosis.
,
DOI:10.1126/science.1174276URLPMID:19661422 [本文引用: 2]
Flowering time is a complex trait that controls adaptation of plants to their local environment in the outcrossing species Zea mays (maize). We dissected variation for flowering time with a set of 5000 recombinant inbred lines (maize Nested Association Mapping population, NAM). Nearly a million plants were assayed in eight environments but showed no evidence for any single large-effect quantitative trait loci (QTLs). Instead, we identified evidence for numerous small-effect QTLs shared among families; however, allelic effects differ across founder lines. We identified no individual QTLs at which allelic effects are determined by geographic origin or large effects for epistasis or environmental interactions. Thus, a simple additive model accurately predicts flowering time for maize, in contrast to the genetic architecture observed in the selfing plant species rice and Arabidopsis.
,
DOI:10.1016/j.ajhg.2017.06.005URLPMID:28686856 [本文引用: 1]
Application of the experimental design of genome-wide association studies (GWASs) is now 10 years old (young), and here we review the remarkable range of discoveries it has facilitated in population and complex-trait genetics, the biology of diseases, and translation toward new therapeutics. We predict the likely discoveries in the next 10 years, when GWASs will be based on millions of samples with array data imputed to a large fully sequenced reference panel and on hundreds of thousands of samples with whole-genome sequencing data.
,
DOI:10.1146/annurev-arplant-050213-035715URLPMID:24274033 [本文引用: 1]
Natural variants of crops are generated from wild progenitor plants under both natural and human selection. Diverse crops that are able to adapt to various environmental conditions are valuable resources for crop improvements to meet the food demands of the increasing human population. With the completion of reference genome sequences, the advent of high-throughput sequencing technology now enables rapid and accurate resequencing of a large number of crop genomes to detect the genetic basis of phenotypic variations in crops. Comprehensive maps of genome variations facilitate genome-wide association studies of complex traits and functional investigations of evolutionary changes in crops. These advances will greatly accelerate studies on crop designs via genomics-assisted breeding. Here, we first discuss crop genome studies and describe the development of sequencing-based genotyping and genome-wide association studies in crops. We then review sequencing-based crop domestication studies and offer a perspective on genomics-driven crop designs.
,
DOI:10.1038/nrg2813URLPMID:20548291 [本文引用: 2]
Genome-wide association (GWA) studies are an effective approach for identifying genetic variants associated with disease risk. GWA studies can be confounded by population stratification--systematic ancestry differences between cases and controls--which has previously been addressed by methods that infer genetic ancestry. Those methods perform well in data sets in which population structure is the only kind of structure present but are inadequate in data sets that also contain family structure or cryptic relatedness. Here, we review recent progress on methods that correct for stratification while accounting for these additional complexities.
,
DOI:10.1086/302959URLPMID:10827107 [本文引用: 1]
The use, in association studies, of the forthcoming dense genomewide collection of single-nucleotide polymorphisms (SNPs) has been heralded as a potential breakthrough in the study of the genetic basis of common complex disorders. A serious problem with association mapping is that population structure can lead to spurious associations between a candidate marker and a phenotype. One common solution has been to abandon case-control studies in favor of family-based tests of association, such as the transmission/disequilibrium test (TDT), but this comes at a considerable cost in the need to collect DNA from close relatives of affected individuals. In this article we describe a novel, statistically valid, method for case-control association studies in structured populations. Our method uses a set of unlinked genetic markers to infer details of population structure, and to estimate the ancestry of sampled individuals, before using this information to test for associations within subpopulations. It provides power comparable with the TDT in many settings and may substantially outperform it if there are conflicting associations in different subpopulations.
,
DOI:10.1038/ng1847URLPMID:16862161 [本文引用: 2]
Population stratification--allele frequency differences between cases and controls due to systematic ancestry differences-can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers.
,
DOI:10.1038/ng1702URLPMID:16380716 [本文引用: 1]
As population structure can result in spurious associations, it has constrained the use of association studies in human and plant genetics. Association mapping, however, holds great promise if true signals of functional association can be separated from the vast number of false signals generated by population structure. We have developed a unified mixed-model approach to account for multiple levels of relatedness simultaneously as detected by random genetic markers. We applied this new approach to two samples: a family-based sample of 14 human families, for quantitative gene expression dissection, and a sample of 277 diverse maize inbred lines with complex familial relationships and population structure, for quantitative trait dissection. Our method demonstrates improved control of both type I and type II error rates over other methods. As this new method crosses the boundary between family-based and structured association samples, it provides a powerful complement to currently available methods for association mapping.
,
URLPMID:10835412 [本文引用: 1]
We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. We show that the method can produce highly accurate assignments using modest numbers of loci-e.g. , seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http://www.stats.ox.ac.uk/ approximately pritch/home. html.
,
DOI:10.1007/s00122-017-2962-9URLPMID:28828506 [本文引用: 6]
The innovative RTM-GWAS procedure provides a relatively thorough detection of QTL and their multiple alleles for germplasm population characterization, gene network identification, and genomic selection strategy innovation in plant breeding. The previous genome-wide association studies (GWAS) have been concentrated on finding a handful of major quantitative trait loci (QTL), but plant breeders are interested in revealing the whole-genome QTL-allele constitution in breeding materials/germplasm (in which tremendous historical allelic variation has been accumulated) for genome-wide improvement. To match this requirement, two innovations were suggested for GWAS: first grouping tightly linked sequential SNPs into linkage disequilibrium blocks (SNPLDBs) to form markers with multi-allelic haplotypes, and second utilizing two-stage association analysis for QTL identification, where the markers were preselected by single-locus model followed by multi-locus multi-allele model stepwise regression. Our proposed GWAS procedure is characterized as a novel restricted two-stage multi-locus multi-allele GWAS (RTM-GWAS, https://github.com/njau-sri/rtm-gwas ). The Chinese soybean germplasm population (CSGP) composed of 1024 accessions with 36,952 SNPLDBs (generated from 145,558 SNPs, with reduced linkage disequilibrium decay distance) was used to demonstrate the power and efficiency of RTM-GWAS. Using the CSGP marker information, simulation studies demonstrated that RTM-GWAS achieved the highest QTL detection power and efficiency compared with the previous procedures, especially under large sample size and high trait heritability conditions. A relatively thorough detection of QTL with their multiple alleles was achieved by RTM-GWAS compared with the linear mixed model method on 100-seed weight in CSGP. A QTL-allele matrix (402 alleles of 139 QTL?×?1024 accessions) was established as a compact form of the population genetic constitution. The 100-seed weight QTL-allele matrix was used for genetic characterization, candidate gene prediction, and genomic selection for optimal crosses in the germplasm population.
,
DOI:10.3724/SP.J.1006.2018.01274URL [本文引用: 1]
Genome-wide association studies (GWAS) have been widely used for genetic dissection of quantitative trait loci (QTL), and the previous GWAS procedures were concentrated on finding a handful of major loci, while the plant breeders are more likely interested in exploring the whole QTL system for both forward selection and background control. We proposed the restricted two-stage multi-locus genome-wide association analysis (RTM-GWAS, https://github.com/njau-sri/rtm-gwas/) for a relatively thorough detection of QTL and their multiple alleles. Firstly, RTM-GWAS groups the tightly linked sequential SNPs into linkage disequilibrium blocks (SNPLDBs) to form genomic markers with multiple haplotypes as alleles. Secondly, it utilizes two-stage association analysis based on a multi-locus multi-allele model to save computer space for focusing on genome-wide QTL identification along with their multiple alleles. Compared with the previous GWAS methods, RTM-GWAS takes the trait heritability as the upper limit of detected genetic contribution, which can avoid a large amount of false positives for a precise detection of the QTL system of the trait. The QTL-allele matrix as a compact form of the population genetic constitution can be used to design optimal genotypes, to predict optimal crosses in plant breeding, and to study the genetic properties of the population as well as the novel and newly emerged alleles. In the present study, we first introduced the function and usage of the RTM-GWAS analytical programs, and then used the experimental data from a research program on soybean to illustrate the application details of the RTM-GWAS.
DOI:10.3724/SP.J.1006.2018.01274URL [本文引用: 1]
Genome-wide association studies (GWAS) have been widely used for genetic dissection of quantitative trait loci (QTL), and the previous GWAS procedures were concentrated on finding a handful of major loci, while the plant breeders are more likely interested in exploring the whole QTL system for both forward selection and background control. We proposed the restricted two-stage multi-locus genome-wide association analysis (RTM-GWAS, https://github.com/njau-sri/rtm-gwas/) for a relatively thorough detection of QTL and their multiple alleles. Firstly, RTM-GWAS groups the tightly linked sequential SNPs into linkage disequilibrium blocks (SNPLDBs) to form genomic markers with multiple haplotypes as alleles. Secondly, it utilizes two-stage association analysis based on a multi-locus multi-allele model to save computer space for focusing on genome-wide QTL identification along with their multiple alleles. Compared with the previous GWAS methods, RTM-GWAS takes the trait heritability as the upper limit of detected genetic contribution, which can avoid a large amount of false positives for a precise detection of the QTL system of the trait. The QTL-allele matrix as a compact form of the population genetic constitution can be used to design optimal genotypes, to predict optimal crosses in plant breeding, and to study the genetic properties of the population as well as the novel and newly emerged alleles. In the present study, we first introduced the function and usage of the RTM-GWAS analytical programs, and then used the experimental data from a research program on soybean to illustrate the application details of the RTM-GWAS.
,
DOI:10.1126/science.1069424URLPMID:12029063 [本文引用: 1]
Haplotype-based methods offer a powerful approach to disease gene mapping, based on the association between causal mutations and the ancestral haplotypes on which they arose. As part of The SNP Consortium Allele Frequency Projects, we characterized haplotype patterns across 51 autosomal regions (spanning 13 megabases of the human genome) in samples from Africa, Europe, and Asia. We show that the human genome can be parsed objectively into haplotype blocks: sizable regions over which there is little evidence for historical recombination and within which only a few common haplotypes are observed. The boundaries of blocks and specific haplotypes they contain are highly correlated across populations. We demonstrate that such haplotype frameworks provide substantial statistical power in association studies of common genetic variation across each region. Our results provide a foundation for the construction of a haplotype map of the human genome, facilitating comprehensive genetic association studies of human disease.
,
DOI:10.1270/jsbbs.61.495URLPMID:23136489 [本文引用: 1]
"Breeding by Design" as a concept described by Peleman and van der Voort aims to bring together superior alleles for all genes of agronomic importance from potential genetic resources. This might be achievable through high-resolution allele detection based on precise QTL (quantitative trait locus/loci) mapping of potential parental resources. The present paper reviews the works at the Chinese National Center for Soybean Improvement (NCSI) on exploration of QTL and their superior alleles of agronomic traits for genetic dissection of germplasm resources in soybeans towards practicing "Breeding by Design". Among the major germplasm resources, i.e. released commercial cultivar (RC), farmers' landrace (LR) and annual wild soybean accession (WS), the RC was recognized as the primary potential adapted parental sources, with a great number of new alleles (45.9%) having emerged and accumulated during the 90 years' scientific breeding processes. A mapping strategy, i.e. a full model procedure (including additive (A), epistasis (AA), A x environment (E) and AA x E effects), scanning with QTLNetwork2.0 and followed by verification with other procedures, was suggested and used for the experimental data when the underlying genetic model was usually unknown. In total, 110 data sets of 81 agronomically important traits were analyzed for their QTL, with 14.5% of the data sets showing major QTL (contribution rate more than 10.0% for each QTL), 55.5% showing a few major QTL but more small QTL, and 30.0% having only small QTL. In addition to the detected QTL, the collective unmapped minor QTL sometimes accounted for more than 50% of the genetic variation in a number of traits. Integrated with linkage mapping, association mappings were conducted on germplasm populations and validated to be able to provide complete information on multiple QTL and their multiple alleles. Accordingly, the QTL and their alleles of agronomic traits for large samples of RC, LR and WS were identified and then the QTL-allele matrices were established. Based on which the parental materials can be chosen for complementary recombination among loci and alleles to make the crossing plans genetically optimized. This approach has provided a way towards breeding by design, but the accuracy will depend on the precision of the loci and allele matrices.
,
DOI:10.1371/journal.pgen.0020190URLPMID:17194218 [本文引用: 1]
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general &quot;phase change&quot; phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.
,
DOI:10.3168/jds.2007-0980URLPMID:18946147 [本文引用: 1]
Abstract
Efficient methods for processing genomic data were developed to increase reliability of estimated breeding values and to estimate thousands of marker effects simultaneously. Algorithms were derived and computer programs tested with simulated data for 2,967 bulls and 50,000 markers distributed randomly across 30 chromosomes. Estimation of genomic inbreeding coefficients required accurate estimates of allele frequencies in the base population. Linear model predictions of breeding values were computed by 3 equivalent methods: 1) iteration for individual allele effects followed by summation across loci to obtain estimated breeding values, 2) selection index including a genomic relationship matrix, and 3) mixed model equations including the inverse of genomic relationships. A blend of first- and second-order Jacobi iteration using 2 separate relaxation factors converged well for allele frequencies and effects. Reliability of predicted net merit for young bulls was 63% compared with 32% using the traditional relationship matrix. Nonlinear predictions were also computed using iteration on data and nonlinear regression on marker deviations; an additional (about 3%) gain in reliability for young bulls increased average reliability to 66%. Computing times increased linearly with number of genotypes. Estimation of allele frequencies required 2 processor days, and genomic predictions required <1 d per trait, and traits were processed in parallel. Information from genotyping was equivalent to about 20 daughters with phenotypic records. Actual gains may differ because the simulation did not account for linkage disequilibrium in the base population or selection in subsequent generations.,
DOI:10.1126/science.273.5281.1516URLPMID:8801636 [本文引用: 1]
,
DOI:10.3389/fpls.2018.01793URLPMID:30568668 [本文引用: 1]
Soybean is one of the world's major vegetative oil sources, while oleic acid and linolenic acid content are the major quality traits of soybean oil. The restricted two-stage multi-locus genome-wide association analysis (RTM-GWAS), characterized with error and false-positive control, has provided a potential approach for a relatively thorough detection of whole-genome QTL-alleles. The Chinese soybean landrace population (CSLRP) composed of 366 accessions was tested under four environments to identify the QTL-allele constitution of seed oil, oleic acid and linolenic acid content (SOC, OAC, and LAC). Using RTM-GWAS with 29,119 SNPLDBs (SNP linkage disequilibrium blocks) as genomic markers, 50, 98, and 50 QTLs with 136, 283, and 154 alleles (2-9 per locus) were detected, with their contribution 82.52, 90.31, and 83.86% to phenotypic variance, corresponding to their heritability 91.29, 90.97, and 90.24% for SOC, OAC, and LAC, respectively. The RTM-GWAS was shown to be more powerful and efficient than previous single-locus model GWAS procedures. For each trait, the detected QTL-alleles were organized into a QTL-allele matrix as the population genetic constitution. From which the genetic differentiation among 6 eco-populations was characterized as significant allele frequency differentiation on 28, 56, and 30 loci for the three traits, respectively. The QTL-allele matrices were also used for genomic selection for optimal crosses, which predicted transgressive potential up to 24.76, 40.30, and 2.37% for the respective traits, respectively. From the detected major QTLs, 38, 27, and 25 candidate genes were annotated for the respective traits, and two common QTL covering eight genes were identified for further study.
,
DOI:10.1007/s00122-018-3174-7URLPMID:30167759 [本文引用: 1]
Eighty-six R1 QTLs accounting for 89.92% phenotypic variance in a soybean RIL population were identified using RTM-GWAS with SNPLDB marker which performed superior over CIM and MLM-GWAS with BIN/SNPLDB marker. A population (NJRIKY) composed of 427 recombinant inbred lines (RILs) derived from Kefeng-1?×?NN1138-2 (MGII?×?MGV, MG maturity group) was applied for detecting flowering date (R1) quantitative trait locus (QTL) system in soybean. From a low-depth re-sequencing (~?0.75?×), 576,874 SNPs were detected and organized into 4737 BINs (recombination breakpoint determinations) and 3683 SNP linkage disequilibrium blocks (SNPLDBs), respectively. Using the association mapping procedures &quot;Restricted Two-stage Multi-locus Genome-wide Association Study&quot; (RTM-GWAS), &quot;Mixed Linear Model Genome-wide Association Study&quot; (MLM-GWAS) and the linkage mapping procedure &quot;Composite Interval Mapping&quot; (CIM), 67, 36 and 10 BIN-QTLs and 86, 14 and 23 SNPLDB-QTLs were detected with their phenotypic variance explained (PVE) 88.70-89.92% (within heritability 98.2%), 146.41-353.62% (overflowing) and 88.29-172.34% (overflowing), respectively. The RTM-GWAS with SNPLDBs which showed to be more efficient and reasonable than the others was used to identify the R1 QTL system in NJRIKY. The detected 86 SNPLDB-QTLs with their PVE from 0.02 to 30.66% in a total of 89.92% covered 51 out of 104 R1 QTLs in 18 crosses in SoyBase and 26 out of 139 QTLs in a nested association mapping population, while the rest 29 QTLs were novel ones. From the QTL system, 52 candidate genes were annotated, including the verified gene E1, E2, E9 and J, and grouped into 3 categories of biological processes, among which 24 genes were enriched into three protein-protein interaction networks, suggesting gene networks working together. Since NJRIKY involves only MGII and MGV, the QTL/gene system among MG000-MGX should be explored further.
,
DOI:10.1007/s00122-017-2960-yURLPMID:28799029 [本文引用: 1]
The RTM-GWAS was chosen among five procedures to identify DTF QTL-allele constitution in a soybean NAM population; 139 QTLs with 496 alleles accounting for 81.7% of phenotypic variance were detected. Flowering date (days to flowering, DTF) is an ecological trait in soybean, closely related to its ability to adapt to areas. A nested association mapping (NAM) population consisting of four RIL populations (LM, ZM, MT and MW with M8206 as their common parent) was established and tested for their DTF under five environments. Using restriction-site-associated DNA sequencing the population was genotyped with SNP markers. The restricted two-stage multi-locus (RTM) genome-wide association study (GWAS) (RTM-GWAS) with SNP linkage disequilibrium block (SNPLDB) as multi-allele genomic markers performed the best among the five mapping procedures with software publicly available. It identified the greatest number of quantitative trait loci (QTLs) (139) and alleles (496) on 20 chromosomes covering almost all of the QTLs detected by four other mapping procedures. The RTM-GWAS provided the detected QTLs with highest genetic contribution but without overflowing and missing heritability problems (81.7% genetic contribution vs. heritability of?97.6%), while SNPLDB markers matched the NAM population property of multiple alleles per locus. The 139 QTLs with 496 alleles were organized into a QTL-allele matrix, showing the corresponding DTF genetic architecture of the five parents and the NAM population. All lines and parents comprised both positive and negative alleles, implying a great potential of recombination for early and late DTF improvement. From the detected QTL-allele system, 126 candidate genes were annotated and χ 2 tested as a DTF candidate gene system involving nine biological processes, indicating the trait a complex, involving several biological processes rather than only a handful of major genes.
,
DOI:10.1093/bioinformatics/btm494URLPMID:18202029 [本文引用: 1]
QTLNetwork is a software package for mapping and visualizing the genetic architecture underlying complex traits for experimental populations derived from a cross between two inbred lines. It can simultaneously map quantitative trait loci (QTL) with individual effects, epistasis and QTL-environment interaction. Currently, it is able to handle data from F(2), backcross, recombinant inbred lines and double-haploid populations, as well as populations from specific mating designs (immortalized F(2) and BC(n)F(n) populations). The Windows version of QTLNetwork was developed with a graphical user interface. Alternatively, the command-line versions have the facility to be run in other prevalent operating systems, such as Linux, Unix and MacOS.
,
[本文引用: 1]
,
DOI:10.1093/bioinformatics/btm308URLPMID:17586829 [本文引用: 1]
Association analyses that exploit the natural diversity of a genome to map at very high resolutions are becoming increasingly important. In most studies, however, researchers must contend with the confounding effects of both population and family structure. TASSEL (Trait Analysis by aSSociation, Evolution and Linkage) implements general linear model and mixed linear model approaches for controlling population and family structure. For result interpretation, the program allows for linkage disequilibrium statistics to be calculated and visualized graphically. Database browsing and data importation is facilitated by integrated middleware. Other features include analyzing insertions/deletions, calculating diversity statistics, integration of phenotypic and genotypic data, imputing missing data and calculating principal components.
,
DOI:10.1007/s00425-018-2952-4URLPMID:29980855 [本文引用: 2]
RTM-GWAS identified 111 DT QTLs, 262 alleles with high proportion of QEI and genetic variation accounting for 88.55-95.92% PV in NAM, from which QTL-allele matrices were established and candidate genes annotated. Drought tolerance (DT) is one of the major challenges for world soybean production. A nested association mapping (NAM) population with 403 lines comprising two recombinant inbred line (RIL) populations: M8206?×?TongShan and ZhengYang?×?M8206 was tested for DT using polyethylene-glycol (PEG) treatment under spring and summer environments. The population was sequenced using restriction-site-associated DNA sequencing (RAD-seq) filtered with minor allele frequency (MAF)?≥?0.01, 55,936 single nucleotide polymorphisms (SNPs) were obtained and organized into 6137 SNP linkage disequilibrium blocks (SNPLDBs). The restricted two-stage multi-locus genome-wide association studies (RTM-GWAS) identified 73 and 38 QTLs with 174 and 88 alleles contributed main effect 40.43 and 26.11% to phenotypic variance (PV) and QTL-environment interaction (QEI) effect 24.64 and 10.35% to PV for relative root length (RRL) and relative shoot length (RSL), respectively. The DT traits were characterized with high proportion of QEI variation (37.52-41.65%), plus genetic variation (46.90-58.40%) in a total of 88.55-95.92% PV. The identified QTLs-alleles were organized into main-effect and QEI-effect QTL-allele matrices, showing the genetic and QEI architecture of the three parents/NAM population. From the matrices, the possible best genotype was predicted to have a weighted average value over two indicators (WAV) of 1.873, while the top ten optimal crosses among RILs with 95th percentile WAV 1.098-1.132, transgressive over the parents (0.651-0.773) but much less than 1.873, implying further pyramiding potential. From the matrices, 134 candidate genes were annotated involved in nine biological processes. The present results provide a novel way for molecular breeding in QTL-allele-based genomic selection for optimal cross selection.
,
[本文引用: 1]
[D]. ,
[本文引用: 2]
[D]. ,
[本文引用: 2]
,
DOI:10.1038/ng.3800URLPMID:28250458 [本文引用: 1]
Experiments in model organisms report abundant genetic interactions underlying biologically important traits, whereas quantitative genetics theory predicts, and data support, the notion that most genetic variance in populations is additive. Here we describe networks of capacitating genetic interactions that contribute to quantitative trait variation in a large yeast intercross population. The additive variance explained by individual loci in a network is highly dependent on the allele frequencies of the interacting loci. Modeling of phenotypes for multilocus genotype classes in the epistatic networks is often improved by accounting for the interactions. We discuss the implications of these results for attempts to dissect genetic architectures and to predict individual phenotypes and long-term responses to selection.
,
DOI:10.1038/nrg3627URLPMID:24296533 [本文引用: 1]
The role of epistasis in the genetic architecture of quantitative traits is controversial, despite the biological plausibility that nonlinear molecular interactions underpin the genotype-phenotype map. This controversy arises because most genetic variation for quantitative traits is additive. However, additive variance is consistent with pervasive epistasis. In this Review, I discuss experimental designs to detect the contribution of epistasis to quantitative trait phenotypes in model organisms. These studies indicate that epistasis is common, and that additivity can be an emergent property of underlying genetic interaction networks. Epistasis causes hidden quantitative genetic variation in natural populations and could be responsible for the small additive effects, missing heritability and the lack of replication that are typically observed for human complex traits.
,
DOI:10.1038/nrg3747URLPMID:25200660 [本文引用: 1]
Genome-wide association studies (GWASs) have become the focus of the statistical analysis of complex traits in humans, successfully shedding light on several aspects of genetic architecture and biological aetiology. Single-nucleotide polymorphisms (SNPs) are usually modelled as having additive, cumulative and independent effects on the phenotype. Although evidently a useful approach, it is often argued that this is not a realistic biological model and that epistasis (that is, the statistical interaction between SNPs) should be included. The purpose of this Review is to summarize recent directions in methodology for detecting epistasis and to discuss evidence of the role of epistasis in human complex trait variation. We also discuss the relevance of epistasis in the context of GWASs and potential hazards in the interpretation of statistical interaction terms.
,
DOI:10.1016/j.ajhg.2010.07.021URLPMID:20817139 [本文引用: 1]
Gene-gene interactions have long been recognized to be fundamentally important for understanding genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodologically challenging. In this paper, we introduce a simple but powerful method, named &quot;BOolean Operation-based Screening and Testing&quot; (BOOST). For the discovery of unknown gene-gene interactions that underlie complex diseases, BOOST allows examination of all pairwise interactions in genome-wide case-control studies in a remarkably fast manner. We have carried out interaction analyses on seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). Each analysis took less than 60 hr to completely evaluate all pairs of roughly 360,000 SNPs on a standard 3.0 GHz desktop with 4G memory running the Windows XP system. The interaction patterns identified from the type 1 diabetes data set display significant difference from those identified from the rheumatoid arthritis data set, although both data sets share a very similar hit region in the WTCCC report. BOOST has also identified some disease-associated interactions between genes in the major histocompatibility complex region in the type 1 diabetes data set. We believe that our method can serve as a computationally and statistically useful tool in the coming era of large-scale interaction mapping in genome-wide case-control studies.
,
DOI:10.1093/bioinformatics/btq186URLPMID:20529910 [本文引用: 1]
As a promising tool for identifying genetic markers underlying phenotypic differences, genome-wide association study (GWAS) has been extensively investigated in recent years. In GWAS, detecting epistasis (or gene-gene interaction) is preferable over single locus study since many diseases are known to be complex traits. A brute force search is infeasible for epistasis detection in the genome-wide scale because of the intensive computational burden. Existing epistasis detection algorithms are designed for dataset consisting of homozygous markers and small sample size. In human study, however, the genotype may be heterozygous, and number of individuals can be up to thousands. Thus, existing methods are not readily applicable to human datasets. In this article, we propose an efficient algorithm, TEAM, which significantly speeds up epistasis detection for human GWAS. Our algorithm is exhaustive, i.e. it does not ignore any epistatic interaction. Utilizing the minimum spanning tree structure, the algorithm incrementally updates the contingency tables for epistatic tests without scanning all individuals. Our algorithm has broader applicability and is more efficient than existing methods for large sample study. It supports any statistical test that is based on contingency tables, and enables both family-wise error rate and false discovery rate controlling. Extensive experiments show that our algorithm only needs to examine a small portion of the individuals to update the contingency tables, and it achieves at least an order of magnitude speed up over the brute force approach.
,
DOI:10.1038/nrg2857URLPMID:20717155 [本文引用: 1]
Today we can generate hundreds of gigabases of DNA and RNA sequencing data in a week for less than US$5,000. The astonishing rate of data generation by these low-cost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. Success in the life sciences will depend on our ability to properly interpret the large-scale, high-dimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. Here we discuss how we can master the different types of computational environments that exist - such as cloud and heterogeneous computing - to successfully tackle our big data problems.
,
DOI:10.1038/srep10298URLPMID:26223539 [本文引用: 1]
Precise prediction for genetic architecture of complex traits is impeded by the limited understanding on genetic effects of complex traits, especially on gene-by-gene (GxG) and gene-by-environment (GxE) interaction. In the past decades, an explosion of high throughput technologies enables omics studies at multiple levels (such as genomics, transcriptomics, proteomics, and metabolomics). The analyses of large omics data, especially two-loci interaction analysis, are very time intensive. Integrating the diverse omics data and environmental effects in the analyses also remain challenges. We proposed mixed linear model approaches using GPU (Graphic Processing Unit) computation to simultaneously dissect various genetic effects. Analyses can be performed for estimating genetic main effects, GxG epistasis effects, and GxE environment interaction effects on large-scale omics data for complex traits, and for estimating heritability of specific genetic effects. Both mouse data analyses and Monte Carlo simulations demonstrated that genetic effects and environment interaction effects could be unbiasedly estimated with high statistical power by using the proposed approaches.
,
URLPMID:11290733 [本文引用: 1]
Recent advances in molecular genetic techniques will make dense marker maps available and genotyping many individuals for these markers feasible. Here we attempted to estimate the effects of approximately 50,000 marker haplotypes simultaneously from a limited number of phenotypic records. A genome of 1000 cM was simulated with a marker spacing of 1 cM. The markers surrounding every 1-cM region were combined into marker haplotypes. Due to finite population size N(e) = 100, the marker haplotypes were in linkage disequilibrium with the QTL located between the markers. Using least squares, all haplotype effects could not be estimated simultaneously. When only the biggest effects were included, they were overestimated and the accuracy of predicting genetic values of the offspring of the recorded animals was only 0.32. Best linear unbiased prediction of haplotype effects assumed equal variances associated to each 1-cM chromosomal segment, which yielded an accuracy of 0.73, although this assumption was far from true. Bayesian methods that assumed a prior distribution of the variance associated with each chromosome segment increased this accuracy to 0.85, even when the prior was not correct. It was concluded that selection on genetic values predicted from markers could substantially increase the rate of genetic gain in animals and plants, especially if combined with reproductive techniques to shorten the generation interval.