Research progress in machine learning methods for gene-gene interaction detection
Zheye Peng, Zijun Tang, Minzhu Xie,College of Physics and Information Science, Hunan Normal University, Changsha 410081, China通讯作者:
第一联系人:
唐紫珺,硕士研究生,专业方向:生物信息学。E-mail:
彭哲也和唐紫珺并列第一作者。
编委: 赵方庆
收稿日期:2017-09-20修回日期:2017-12-28网络出版日期:2018-03-20
基金资助: |
Editorial board:
Received:2017-09-20Revised:2017-12-28Online:2018-03-20
Fund supported: |
摘要
关键词:
Abstract
Keywords:
PDF (309KB)元数据多维度评价相关文章导出EndNote|Ris|Bibtex收藏本文
本文引用格式
彭哲也, 唐紫珺, 谢民主. 机器学习方法在基因交互作用探测中的研究进展. 遗传[J], 2018, 40(3): 218-226 doi:10.16288/j.yczz.17-254
Zheye Peng, Zijun Tang, Minzhu Xie.
全基因组关联研究(genome wide association studies, GWAS)在全基因组范围内检测DNA变异与特定疾病或者性状之间的相关性,进而发现与之相关的遗传变异。目前,全基因组关联研究已经发现了与各种疾病或性状(表型)相关的数千个单核苷酸多态性(single nucleotide polymorphism, SNP)位点。然而,对于绝大部分复杂疾病而言,这些SNP位点上的变异导致的患病风险增加往往较小,即只有少部分人可以用这些位点上的变异解释其患病状态,这种现象被称为“遗传性缺失(missing heritability)”[1]。对于这种现象,研究人员提出了多种解释,其中被广泛认同的解释是:复杂疾病是由基因与基因,基因与环境之间的交互作用引起的,其中基因与基因的交互作用通常表现为SNP位点的上位性效应,即两个或两个以上的SNP位点对表型具有协同影响[2]。单个SNP通过改变单个基因的表达对疾病发病率的影响通常会很小,新出现的证据表明,许多稀有的DNA变异与多个风险等位基因的交互作用会导致患病风险增加[3]。而目前的全基因组关联研究主要探测单个SNP位点与疾病的相关性,缺乏探测多个基因交互作用的能力。
探测基因交互作用有助于识别基因功能,对发现潜在的药物靶点和人类复杂疾病的遗传机理尤为重要[4]。探测基因交互作用通常采用的方法是计算多个SNP位点上的等位基因组合与表型的统计相关性。但随着交互作用基因数目的增加,对应SNP位点上可能的等位基因组合数量呈指数增长,假定每个SNP位点上可能的基因型不同取值为3,则n个SNP位点上不同的基因型组合数量高达3n。探测高维基因交互作用在计算上面临巨大的挑战[5]。机器学习(machine learning, ML)是让计算机模拟人类认知过程对问题进行求解的一种方法,利用机器学习方法探测基因交互作用的优点是不需要事先假定位点或基因间交互作用的模型,不是通过穷尽搜索而是让模拟人类认知过程的计算机算法通过大量数据进行学习,从而获得发现非线性高维交互作用的能力[6]。近20年来,众多机器学习方法已被用于基因交互作用探测,并取得了一定的成功[7],然而遗传异质性、群体分层和涉及交互作用的SNP位点数量众多是影响机器学习方法探测基因交互作用性能的主要因素。本文将对探测基因交互作用的机器学习方法进行综述,并对未来研究方向进行展望。
1 机器学习方法的原理和特点
在过去的20年中,一系列机器学习方法被用来探测基因与基因的交互作用。目前,已经应用于基因交互探测的机器学习方法主要包括神经网络(neural networks, NN),随机森林(random forest, RF),支持向量机(support vector machines, SVM)和多因子降维法(multifactor dimensionality reduction, MDR),本节将综述这些机器学习方法的原理和特点,及其在基因交互作用的探测上取得的一些成果。1.1 神经网络
神经网络基于神经元模型,其中“前馈/反向传播”神经网络最为常见,它具有出色的模式识别和模式分类能力,并且能够处理大量数据[8]。神经网络的结构采用多层有向图,前馈神经网络由输入层、隐藏层和输出层组成[9],输入层和隐藏层包含众多节点,而输出层只有一个节点。当用于探测基因和基因的交互作用时,输入层的节点代表遗传变异,通常是SNP,输入层节点通过弧(有向边)连接隐藏层的节点,最后隐藏层的节点通过弧连接输出节点,控制输出节点的输出。神经网络中每条弧对应一个不同的权重,神经网络中弧的不同权重配置对应着SNP之间不同的交互作用,弧的权重是通过大量数据对神经网络进行训练得到。对神经网络进行训练时,每个弧被分配一个随机的初始权重,使用训练数据已知的基因型给输入层的节点设置输入值,然后观察神经网络输出节点的状态,根据该状态与已知的表型的差异,对弧的权重进行调整,期望使神经网络输出的错误率达到最小[10]。最终分析训练后的神经网络的内部权重结构,识别真实数据中所隐含的基因交互模式[11,12]。Tomita等[13]利用神经网络对172个患过敏性哮喘的儿童和172个对照组的正常人的17个基因的25个SNPs进行基因交互作用分析,发现了日本人群中与过敏性哮喘相关的10个易感SNPs,测试结果显示总准确率达到了74.4%。
构建合适的神经网络内部权重结构是探测基因与基因交互作用成功的关键。目前神经网络内部权重的构造方法有反向传播(back propagation, BP)、遗传编程(genetic programming, GP)和语法演化(grammatical evolution, GE)[14,15,16,17,18]。Ritchie等[15]比较了遗传编程神经网络(GPNN)和反向传播神经网络(BPNN)探测基因与基因交互作用的能力,其结果表明当测试数据包含功能性和非功能性的SNPs时,GPNN表现优于BPNN[16]。Motsinger等[17]也对GPNN探测基因-基因交互作用的能力进行了测试:对1600个样本(case和control各占1/2, SNP位点总数为10个)中的2个SNP位点交互作用的探测结果显示,GPNN对遗传效应(heritability)低至0.5%的基因-基因交互模型的探测能力也达到了86%;在真实的帕金森病的数据上GPNN也探测到线粒体基因与性别(mitochondrial gene-sex)的交互作用,该交互作用导致帕金森病发病率显著上升。Campos等[18]利用语法演化技术对GPNN进行改进提出了GENN神经网络,用于存在噪声情况下基因-基因交互作用的探测。GENN利用进化搜索策略,并在语法中使用布尔运算,在模拟数据上的测试显示GENN在处理基因分型错误和数据遗漏等问题上具有很强的鲁棒性。
1.2 随机森林
随机森林是由Leo Breiman提出[19],是一种由随机向量生成的分类树或回归树的集合所构成的高维非参数预测模型,包括4个主要部分:随机选择样本;随机选择特征;构建决策树;随机森林投票分类。随机森林通过自助法(bootstrap)重采样技术进行采样,给定一个训练样本集,数量为N,使用有放回的采样得到N个样本,从而构成一个新的训练集。随机森林的优点在于它们不会“过度拟合(overfit)”数据,随着随机森林中的树的数量增加,预测误差将不会超过一个给定值[20]。随机森林为每个SNP提供重要性分数,使其能识别与表型相关的SNPs,进而探测交互作用的SNPs[21]。随机森林方法在基因交互作用的探测中有很多成功的应用[21,22,23,24,25]。Chen等[23]使用随机森林方法对遗传性球形细胞增多症(hereditary spherocytosis, HS)的相关数据进行分析,探测到了41个已知的与HS相关的基因,发现了150个新的与HS相关的基因及这些基因构成的交互网络中的核心基因。Bureau等[24]利用随机森林从131个哮喘病人和217 个正常人的42个SNP数据中找到了能有效预测哮喘病的SNP对ST+4和BC+1。
随机森林算法是一种有效的分类工具,具有发现没有强主效应的基因之间交互作用的潜力,在低维数据(100个SNP和10 000个观测值)中已经显示出较好地性能,然而,它们探测交互作用的能力实际上取决于主效应是否存在,不管存在的主效应是多么弱,因此,这种方法可能缺乏发现没有任何主效应的基因之间的交互作用的能力[26]。
SNPInterForest是对随机森林方法改进而来的,它在发现与疾病相关的SNP的能力比随机森林更强,并且具有同时识别多种交互作用的能力,SNPInterForest对具有主效应的SNP比随机森林更为敏感[27]。Pan等[28]把随机森林和互信息网络(mutual information network, MIN)集成,提出了互信息网络引导的随机森林方法MINGRF(MIN guided RF),其目的是减少边际效应对RF的影响。
1.3 支持向量机
支持向量机,也称为支持向量网络[29],是一种监督式的机器学习方法,用于求解二分问题(binary classification),广泛应用于分类和回归分析(regression)。支持向量机通常是设计一个合理的核函数,对数据进行变换,通过已知类别的数据对向量机进行训练,在变换的空间寻找一个超平面,期望能最大限度地把不同类别的数据隔离在超平面的两侧。支持向量机的学习过程其实是寻求一个既能最小化经验损失、又能最大化不同类别数据之间的几何间距的超平面的过程,因此SVM又被称为最大间距分类器。SVM可以通过学习已知存在交互作用的基因的特点,来预测哪些基因在遗传上有交互作用。为了实现这一点,支持向量机的训练数据是两组特征向量,它们被标记为阳性(存在遗传交互作用)和阴性(无遗传交互作用),在模拟数据集和真实数据集上的测试都显示出SVM具有较强的探测基因交互作用能力[30,31,32,33,34]。早在2004年,Listgarten等[32]利用SVM鉴定出许多与乳腺癌风险相关的基因变异,该文结果表明,当使用具有二次核函数的SVM预测乳腺癌患者时,多个SNP位点的组合比单一SNP位点预测乳腺癌患者的精度更高。Chen等[33]把SVM和局部搜索、遗传算法结合起来构建了一个探测基因交互作用的平台,在大量模拟数据上的测试结果表明该平台虽然需要较大的计算资源,但该平台能在case和control两组人数严重不对称的数据也能有效探测高维的基因交互作用。
Shen等[34]提出了一种两阶段探测基因与基因的交互作用的方法。第一阶段,Shen等利用L1惩罚SVM(模型选择法)识别最有可能有交互作用的SNP位点;第二阶段在第一阶段识别出的SNP位点的基础上,应用逻辑回归(logistic regression)和Bonferroni校正排除非候选SNPs。结果表明,L1惩罚SVM在病例对照组数据上的SNP交互作用探测是有效的,多变量logistic回归分析比传统的logistic回归分析对SNP的交互作用分析效果要好。Ban等[35]利用SVM方法分析韩国462个2型的糖尿病人和456个正常人在87个基因上的408个SNP位点上基因型的数据集,获得了一个由14个SNP交互作用的组合,该组合识别糖尿病的准确率大于70%。
1.4 多因子降维
2001年Ritchie等[36]提出了一种分析基因交互作用方法-多因子降维法。MDR是一种非参数的分析方法,适用于病例-对照组(case-control)研究,只需提供各遗传变异位点的遗传数据(如SNP等),即可进行基因交互作用分析。在MDR的第一阶段,从数据集中选择x个变异位点(在GWAS中为SNP位点),其中x为需要分析的交互作用的维数。对于SNP位点上的基因型数据而言,这x个位点上有3x个不同的基因型组合,MDR的第二阶段则用一个3x行2列的列联表统计出在这x个变异位点上所有不同取值组合的病例人数和对照组人数。第三阶段,利用列联表,计算出每个基因型组合对应的病例人数与对照组人数的比值,若该比值大于某个阈值t (例如t =总病例人数/总对照组人数),则标记为高危因子,反之则标记为低危因子,这样就把x维的数据精简到一维两水平(即高危或低危)的数据,获得了一个基于这x个变异位点预测疾病状态的基因交互作用模型,然后通过交叉验证该模型的精确度,选择预测误差最小的模型作为最终的模型。最后通过置换测试(permutation test)评价最终模型的统计显著性。MDR是一种无模式(model-free)的方法,不需提前对疾病模型进行假设,这使得MDR被大量用于分析发病机制未知的复杂疾病的遗传数据,获得了许多与复杂疾病相关的基因交互作用模型[37,38,39,40,41,42,43,44,45],例如,Tsai等[40]利用MDR方法发现了房颤中交互作用的基因对(RAS-ACE),MDR获得的最佳模型是由3个SNP组成,其中2个SNP来自RAS基因,1个SNP来自ACE基因。这3个SNP的10重交叉验证显示有很好的一致性,100次的置换测试得到的P-value为0.001。
然而,在分析表型-遗传异质性率偏高(>50%)的遗传数据集时,MDR发现基因交互作用模块的性能大大降低,尽管基因型组合分为“高危”或“低危”,但没有定量评价他们是危险程度,获得的最终模型很难解释[42]。MDR可以很便捷地发现交互作用,但MDR却无法揭示主效应[43]。当基因型组合中的病例对照率与整个数据集病例对照率相近时,MDR具有较高的假阳性和假阴性错误率[44]。为了解决这一问题,Leem等[44]用最大似然度方法确定基因型组合的风险级别,提出了EF-MDR(empirical fuzzy MDR, EF-MDR)。EF-MDR在WTCCC的克罗恩病(Crohn's disease, CD)和躁郁症(bipolar disorder, BD)数据集中探测到了一些有趣的多SNP交互。
Gui等[45]将x个位点上的基因型组合分为3组:高风险,低风险和未知风险,如果该组合上病例人数与对照组人数之比与所有病例人数与对照组人数之比相同或接近,则将其标记为未知风险,并从模型中排除,在此基础上提出了RMDR(Robust MDR)。Gui等使用膀胱癌数据集对RMDR和MDR进行测试,结果表明RMDR发现的基因交互模型更容易解释,其计算速度也较快。
为了使MDR能处理连续表型数据,Lou等[46]提出了对MDR进行了扩展,提出了GMDR(generalized MDR)。GMDR用一个通用的线性模型表示表型数据,利用最大似然度估计确定多个位点上的基因型组合的风险类别,当数据除了包含基因型数据还包含其他协变量数据时,GMDR能提高探测基因交互作用的能力,并且能适用于随机采样获得的数据集。在此基础上,为了处理数据中的群体层化问题,Chen等[47]提出了UGMDR(unified GMDR)。
新窗口打开|下载CSV
表1总结了目前用于探测基因交互作用的机器学习方法以及它们的优势和局限性。
Table 1
表1
表1 机器学习方法的优势和局限性
Table 1
方法 | 优势 | 局限性 | 参考文献 |
---|---|---|---|
Neural networks (NNs) | 1. 优秀的模式识别/分类功能 2. 有能力处理大数据 3. 适应遗传异质性/多基因遗传/高表型率/不完全外显率 | 不能枚举所有可能的神经网络架构,并且改变架构会改变数据分析的结果,无法确定正在使用的架构是否是最佳的 | [8] |
GPNN | 1. GP优化的NN体系结构 2. 在非功能性SNP存在下,探测交互作用时具有较高效能 3. 当功能性SNP未知,且变量选择和模型拟合所需一样时,优选结果 4. 不会过度拟合数据 5. 在弱边际效应的上位模型中具有较高的效能 6. 模型灵活:不需要选择最优的输入,权重,连接或是隐形层 | 1. 在三位点的模型中具有高假阳性率 2. 需要并行计算环境 3. 输出是二元表示树,它可能很大(多至500个节点),并难以解释 | [15] |
GENN | 1. GE优化的NN体系结构 2. 可用于从有噪声(例如,基因分型错误,缺失数据,拟表型,遗传异质性)的高维遗传病学数据中发现基因-基因交互作用 | 1. 数据集中拟表型的存在导致GENN的效果大大降低 | [18] |
RF | 1. 能发现没有强主效应的基因之间的交互作用 2. 不会过度拟合数据,且误差收敛有上限值 3. 能鉴定预测表型的SNP | 1. 探测交互作用的能力取决于主效应 2. 无法探测没有边际效应的基因之间的相互作用 3. 从随机森林中提取有用的生物信息时相对困难 | [19] |
SNPInterForest | 1. 可同时识别多个交互作用 2. 在没有边际效应时,不会低估SNP的重要性分数 3. 没有边际效应的情况下,每个节点上的多个SNP选择提高了探测疾病相关SNP的能力 4. 能评估SNP组合的交互作用强度 5. 具有较高的召回率和较低的假阳性率 6. 能发现存在遗传异质性的交互作用 | 计算量很大 | [27] |
SVM | 1. 比MDR有更多可解释的输出结果 2. 可以应用到新的数据结构 3. 分类时无需用户自定义 | 1. 无法处理不完整的数据 2. 处理存在遗传异质性的数据时效能降低 | [33] |
MDR | 1. 同时探测多个基因位点,保持低误报率 2. 无模式,适应于机制未知的遗传基因数据 | 1. 在高(50%)表型/遗传异质性下,检验效能显著降低 2. 当SNP的数量超过10时,需要大量的计算资源 | [36] |
RMDR | 1. 获得的交互模型比较容易解释 2. 多位点上基因型组合模型分类为高风险、未知风险和低风险三类,降低了假阳性率 | 比MDR需要更大的计算资源 | [45] |
GMDR | 1. 使用最大似然法给基因型组合模型分类 2. 给基因型组合模型分类是能考虑协变量的影响,可提高分类的准确性 | 比MDR需要更大的计算资源 | [46~48] |
新窗口打开|下载CSV
2 现阶段模型的应用
全基因组关联研究在探测疾病相关的SNP上取得了大量的研究结果,但是在探测多基因的交互作用上还存在很多困难,这是由于基因组遗传数据具有高度的异质性,还有拟表型、表型变异性和不完全外显率等诸多因素造成的[49]。机器学习法在探测基因交互作用上可以用来解决这些局限性,例如,随机森林方法能够成功处理某些类型的异质性的问题,神经网络的一些特性能够解决遗传异质性,多基因遗传,高拟表率和不完全外显的问题[50]。帕金森病(Parkinson’s disease, PD)是老年人常见的一种神经退化性疾病,在65岁以上的人口有约2%的发病率,在85岁以上的老年人中,发病率上升至约5%,目前帕金森病的发病机制尚不清楚,但有假设认为帕金森病是由影响能量代谢和蛋白质合成的复杂的基因-环境的交互作用导致的,Mellick等[51]对306个PD病人和321个正常人测定了与线粒体复合体I相关的31个基因上的70个SNP数据,并进行了分析,没有发现单个SNP与PD有显著的统计相关性,而遗传编程神经网络(GPNN)则在该数据集中,探测到了DLST基因与性别之间的交互作用[17]。
唇裂,伴有或不伴有腭裂(CL/P),是人类最常见的一种脸部先天性缺陷,非综合征型CL/P得到了广泛的研究,发现了大量与CL/P相关的候选基因组区域。Li等[52]对891个亚洲裔Trio(一个Trio由父亲、母亲和患有非综合征型CL/P的小孩组成)和681欧洲裔Trio的SNP数据进行了分析,他们利用随机森林(RF)探测与WNT信号通路相关的18个基因上360个SNP和其他候选基因组区域上153个SNP位点之间的交互作用,结果发现WNT5B和MAFB有显著的交互作用(亚洲裔Trio的P =0.0076,欧洲裔Trio的P = 0.018)。类风湿关节炎(rheumatoid arthritis,RA)是一种慢性的主要体现为炎性滑膜炎的系统性疾病。WTCCC有一个RA数据集,该数据集包含了3499个人(1999个RA患者,2000个正常人)的500K个SNP数据。Yoshida等[27]首先利用单位点关联分析方法从该数据集的500K SNP中选出10K个SNP位点,然后利用SNPInterForest探测这些SNP之间的交互作用。SNPInterForest在1台6GB内存的计算机上运行98个小时候后发现了两个新的SNP交互作用(rs17665418, rs2121526)和(rs17665418, rs4799934)。rs17665418位于3p13, rs2121526位于10q21.1,而rs4799934位于18q12.2。
在欧美国家,前列腺癌(prostatic cancer)的发病率高居男性肿瘤的首位,死亡率仅次于肺癌、结直肠癌。Chen等[33]利用SVM方法分析来自瑞典的前列腺癌数据集,该数据集包含1355个病例和765个对照个体的位于18个基因中的57个SNP位点上的基因型数据,其中数据的缺失率低于5%。由于对照个体数少于病例数,他们从对照组中随机选择590个对照个体,加上原来的对照个体获得平衡数据集。分析结果显示,SVM方法即使在存在5%基因分型错误,5%缺失数据或两种错误都存在的情况下也具有较好的探测基因-基因交互作用的能力,在分析4阶或5阶交互作用时,SVM方法也展示较好的性能。
MDR、RMDR和GMDR也在真实生物数据上有成功的应用,但是由于其计算复杂度较高,通常用于SNP个数不是很多的场合。乳腺癌(breast cancer)最常见的形式是散发性乳腺癌,其致病原理仍然不明,但是有临床证据显示雌激素会影响其发病率。Ritchie等[36]将MDR应用于散发性乳腺癌的病例对照数据集,该数据集包含200个白人病例和对照个体的位于COMT、CYP1A1、CYP1B1、GSTM1和GSTT1基因上的10个SNP位点上的基因型数据,分析结果显示位于3个不同雌激素代谢基因COMT、CYP1A1和CYP1B1上的4个SNP位点之间存在高度交互作用,与散发性乳腺癌的犯病风险显著相关。膀胱癌(Bladder cancer)是泌尿系统中常见的恶性肿瘤,其发病机制十分复杂。Gui等[45]利用MDR与RMDR对美国新罕布什尔州355例膀胱癌病例和559例对照个体的数据集进行研究。该数据包含了与DNA修复有关的5个基因上7个SNP位点的基因型。分析结果发现MDR与RMDR都能找到相同的最佳多位点交互作用模型,但RMDR标记为高风险或低风险的基因型组合数量比MDR少很多,使模型更易解释,RMDR能比MDR提供了更加清晰的多位点交互作用模型。
Lou等[46]利用GMDR和MDR对191名吸烟者和191名不吸烟者的脑源性神经营养因子(BDNF [MIM 113505])、Ⅱ型神经营养性酪氨酸激酶受体(NTRK2[MIM 600456])、胆碱能受体烟碱α4(CHRNA4 [MIM 118504])和胆碱能受体烟碱β2(CHRNB2 [MIM 118507])这4个基因的23个SNP位点基因型数据进行分析。分析结果发现了CHRNA4的1个SNP (rs2229959)和NTRK2的3个SNP(rs993315,rs1122530和rs736744)的交互作用与尼古丁依赖症有显著的统计相关性。GMDR和MDR都能发现该4位点交互作用模型,但在模拟数据上的测试结果显示GMDR具有更好的预测能力。
表2汇总了上述机器学习方法在真实遗传数据上的应用及相关的结果。
Table 2
表2
表2 机器学习方法在真实遗传数据的应用
Table 2
方法 | 应用案例 | 参考文献 |
---|---|---|
GPNN | 应用于帕金森病数据集,该数据集包含与线粒体复合体I相关基因的70个SNPs,探测到了DLST基因与性别之间的交互作用 | [17] |
RF | 应用于非综合征性唇腭裂(CL/P)的真实数据,发现了WNT5B-MAFB等有统计显著性的基因交互 | [52] |
SNPInterForest | 应用于风湿关节炎的GWAS数据(约500000 SNPs),发现了两个新的交互作用 | [27] |
SVM | 应用于前列腺癌研究中18个基因中的57个SNP位点,识别高达5个SNP之间的高阶交互作用 | [33] |
MDR | 应用于与乳腺组织中雌激素代谢相关的5个基因中的10个SNP位点,确定了与乳腺癌风险相关的四位点交互作用 | [36] |
RMDR | 测试了与DNA修复有关的5个基因中的7个SNP位点;结果与使用相同数据的MDR研究相同,但是提供了更清晰的高风险交互作用模型 | [45] |
GMDR | 应用于4个基因中的23个SNP位点,以鉴定尼古丁依赖症的易感基因;GMDR和MDR确定了相同的交互作用 | [46] |
新窗口打开|下载CSV
3 结语与展望
在全基因组关联研究中,多种机器学习方法被用来探测基因-基因交互作用,这些方法在模拟数据中能够成功地发现基因-基因交互作用,有些方法也用来分析一些真实遗传数据并发现了一些相关的多基因交互作用(表2)。机器学习算法在识别非线性复杂关系中具有优势,但机器学习算法也存在很多共性问题如计算资源需求大、可扩展性不强、给出的最优模型难以解释等局限性。探测基因-基因交互作用所需的计算量随着需要考虑的SNP位点数交互的维数指数增长,本文所讨论的大多数方法能从包含几百个SNP的数据集中探测多基因交互作用,但无法扩展到包含几十万SNP位点的数据集,当尝试发现大于2的高阶交互作用时,许多方法的效能显著降低。另外通过神经网络、随机森林、支持向量机等发现的基因交互作用模块很难给出合理的生物学解释。为了解决这些问题,可以考虑采用多阶段策略,在不同的阶段采用不同的机器学习方法,在前面的阶段采用神经网络、随机森林、支持向量机等寻找可能具有交互作用的候选SNP位点集,后续阶段则在这些SNP位点集的基础上,采用基于MDR的方法发现高阶基因交互作用,形成具有可扩展且结果容易解释的基因交互作用探测框架。参考文献 原文顺序
文献年度倒序
文中引用次数倒序
被引期刊影响因子
,
URLPMID:19812666 [本文引用: 1]
Genome-wide association studies have identified hundreds of genetic variants associated with complex human diseases and traits, and have provided valuable insights into their genetic architecture. Most variants identified so far confer relatively small increments in risk, and explain only a small proportion of familial clustering, leading many to question how the remaining, 'missing' heritability can be explained. Here we examine potential sources of missing heritability and propose research strategies, including and extending beyond current genome-wide association approaches, to illuminate the genetics of complex diseases and enhance its potential to enable effective disease prevention or treatment.
,
URLPMID:28334077 [本文引用: 1]
Abstract For over a decade functional gene-to-gene interaction (epistasis) has been suspected to be a determinant in the "missing heritability" of complex traits. However, searching for epistasis on the genome-wide scale has been challenging due to the prohibitively large number of tests which result in a serious loss of statistical power as well as computational challenges. In this article, we propose a two-stage method applicable to existing case-control data sets, which aims to lessen both of these problems by pre-assessing whether a candidate pair of genetic loci is involved in epistasis before it is actually tested for interaction with respect to a complex phenotype. The pre-assessment is based on a two-locus genotype independence test performed in the sample of cases. Only the pairs of loci that exhibit non-equilibrium frequencies are analyzed via a logistic regression score test, thereby reducing the multiple testing burden. Since only the computationally simple independence tests are performed for all pairs of loci while the more demanding score tests are restricted to the most promising pairs, genome-wide association study (GWAS) for epistasis becomes feasible. By design our method provides strong control of the type I error. Its favourable power properties especially under the practically relevant misspecification of the interaction model are illustrated. Ready-to-use software is available. Using the method we analyzed Parkinson's disease in four cohorts and identified possible interactions within several SNP pairs in multiple cohorts.
URLMagsci [本文引用: 1]
利用高密度单核苷酸多态(Single nucleotide polymorphism, SNP)标记在全基因组范围内检测影响复杂疾病/性状的染色体区段或基因, 已经成为目前遗传学领域新的突破点之一。在全基因组关联研究(Genome-wide association study, GWAS)取得大量成果之后, 研究者们对在全基因范围内研究交互作用产生了极大的热情。近几年, 对交互作用的研究, 无论是在方法的研发、实际的应用以及统计学上的交互向生物学上的交互转化, 还是在信息组学的整合, 都呈现快速发展的趋势。已有很多策略和方法被尝试用于进行全基因组交互作用分析, 这些研究推动了对复杂疾病/性状遗传机制的进一步认识。基于目前全基因组交互分析所采用的各类数据处理方法的理论与算法的异同, 文章拟对目前使用较为广泛的回归类方法、机器学习方法、贝叶斯模型法、SNP筛选类方法和基于并行程序的方法等5类方法加以评述, 着重介绍了这些方法的算法原理、计算效率以及差别之处, 以期能够为相关领域的研究者提供参考。
,
URLMagsci [本文引用: 1]
利用高密度单核苷酸多态(Single nucleotide polymorphism, SNP)标记在全基因组范围内检测影响复杂疾病/性状的染色体区段或基因, 已经成为目前遗传学领域新的突破点之一。在全基因组关联研究(Genome-wide association study, GWAS)取得大量成果之后, 研究者们对在全基因范围内研究交互作用产生了极大的热情。近几年, 对交互作用的研究, 无论是在方法的研发、实际的应用以及统计学上的交互向生物学上的交互转化, 还是在信息组学的整合, 都呈现快速发展的趋势。已有很多策略和方法被尝试用于进行全基因组交互作用分析, 这些研究推动了对复杂疾病/性状遗传机制的进一步认识。基于目前全基因组交互分析所采用的各类数据处理方法的理论与算法的异同, 文章拟对目前使用较为广泛的回归类方法、机器学习方法、贝叶斯模型法、SNP筛选类方法和基于并行程序的方法等5类方法加以评述, 着重介绍了这些方法的算法原理、计算效率以及差别之处, 以期能够为相关领域的研究者提供参考。
,
URLPMID:4822295 [本文引用: 1]
Background Identifying gene-gene interactions is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Here, we aimed at developing a...
,
URLPMID:28007839 [本文引用: 1]
Characterizing genetic interactions is crucial to understanding cellular and organismal response to gene-level perturbations. Such knowledge can inform the selection of candidate disease therapy targets, yet experimentally determining whether genes interact is technically nontrivial and time-consuming. High-fidelity prediction of different classes of genetic interactions in multiple organisms would substantially alleviate this experimental burden. Under the hypothesis that functionally related genes tend to share common genetic interaction partners, we evaluate a computational approach to predict genetic interactions inHomo sapiens,Drosophila melanogaster, andSaccharomyces cerevisiae. By leveraging knowledge of functional relationships between genes, we cross-validate predictions on known genetic interactions and observe high predictive power of multiple classes of genetic interactions in all three organisms. Additionally, our method suggests high-confidence candidate interaction pairs that can be directly experimentally tested. A web application is provided for users to query genes for predicted novel genetic interaction partners. Finally, by subsampling the known yeast genetic interaction network, we found that novel genetic interactions are predictable even when knowledge of currently known interactions is minimal.
URL [本文引用: 1]
基因间SNP-SNP的交互作用较单一SNP对于疾病的预警作用可能会达到更优的检测效能。本研究探讨了核苷酸切除修复(NER)系统基因中SNP交互作用对移植排斥反应发病风险的预警作用。通过Sequenom Mass ARRAY平台进行基因分型,对8个NER基因中的38个多态进行了检测,包括XPA、XPC、DDB2、XPB(ERCC3)、XPD(ERCC2)、ERCC1、XPF(ERCC4)和XPG(ERCC5)基因。单体型分析结果显示,XPA rs3176629-rs2808668 C-T单体型以及ERCC5 G-C-C-T和G-C-T-C单体型可以增加移植排斥反应的发病风险(分别为OR=1.81,OR=7.72和OR=3.46),而ERCC5 rs2094258-rs751402-rs2296147-rs1047768 A-C-T-T单体型降低了该风险(OR=0.35)。多因素Logistic回归与多因子降维(MDR)分析均表明,ERCC2 rs50871、ERCC5rs1047768和XPC rs2228001多态对于发生移植排斥反应存在基因间SNP-SNP的交互作用。因此,XPC rs2228001、ERCC2 rs50871、ERCC5 rs1047768三者的交互作用与移植排斥反应的发病风险相关。
,
URL [本文引用: 1]
基因间SNP-SNP的交互作用较单一SNP对于疾病的预警作用可能会达到更优的检测效能。本研究探讨了核苷酸切除修复(NER)系统基因中SNP交互作用对移植排斥反应发病风险的预警作用。通过Sequenom Mass ARRAY平台进行基因分型,对8个NER基因中的38个多态进行了检测,包括XPA、XPC、DDB2、XPB(ERCC3)、XPD(ERCC2)、ERCC1、XPF(ERCC4)和XPG(ERCC5)基因。单体型分析结果显示,XPA rs3176629-rs2808668 C-T单体型以及ERCC5 G-C-C-T和G-C-T-C单体型可以增加移植排斥反应的发病风险(分别为OR=1.81,OR=7.72和OR=3.46),而ERCC5 rs2094258-rs751402-rs2296147-rs1047768 A-C-T-T单体型降低了该风险(OR=0.35)。多因素Logistic回归与多因子降维(MDR)分析均表明,ERCC2 rs50871、ERCC5rs1047768和XPC rs2228001多态对于发生移植排斥反应存在基因间SNP-SNP的交互作用。因此,XPC rs2228001、ERCC2 rs50871、ERCC5 rs1047768三者的交互作用与移植排斥反应的发病风险相关。
,
URLPMID:26173972 [本文引用: 1]
The critical barrier in interaction analysis for next-generation sequencing (NGS) data is that the traditional pairwise interaction analysis that is suitable for common variants is difficult to apply to rare variants because of their prohibitive computational time, large number of tests and low power. The great challenges for successful detection of interactions with NGS data are (1) the demands in the paradigm of changes in interaction analysis; (2) severe multiple testing; and (3) heavy computations. To meet these challenges, we shift the paradigm of interaction analysis between two SNPs to interaction analysis between two genomic regions. In other words, we take a gene as a unit of analysis and use functional data analysis techniques as dimensional reduction tools to develop a novel statistic to collectively test interaction between all possible pairs of SNPs within two genome regions. By intensive simulations, we demonstrate that the functional logistic regression for interaction analysis has the correct type 1 error rates and higher power to detect interaction than the currently used methods. The proposed method was applied to a coronary artery disease dataset from the Wellcome Trust Case Control Consortium (WTCCC) study and the Framingham Heart Study (FHS) dataset, and the early-onset myocardial infarction (EOMI) exome sequence datasets with European origin from the NHLBI's Exome Sequencing Project. We discovered that 6 of 27 pairs of significantly interacted genes in the FHS were replicated in the independent WTCCC study and 24 pairs of significantly interacted genes after applying Bonferroni correction in the EOMI study.
[本文引用: 1]
,
URLPMID:4862166 [本文引用: 1]
The future of medicine is moving towards the phase of precision medicine, with the goal to prevent and treat diseases by taking inter-individual variability into account. A large part of the variability lies in our genetic makeup. With the fast paced improvement of high-throughput methods for genome sequencing, a tremendous amount of genetics data have already been generated. The next hurdle for precision medicine is to have sufficient computational tools for analyzing large sets of data. Genome-Wide Association Studies (GWAS) have been the primary method to assess the relationship between single nucleotide polymorphisms (SNPs) and disease traits. While GWAS is sufficient in finding individual SNPs with strong main effects, it does not capture potential interactions among multiple SNPs. In many traits, a large proportion of variation remain unexplained by using main effects alone, leaving the door open for exploring the role of genetic interactions. However, identifying genetic interactions in large-scale genomics data poses a challenge even for modern computing. For this study, we present a new algorithm, Grammatical Evolution Bayesian Network (GEBN) that utilizes Bayesian Networks to identify interactions in the data, and at the same time, uses an evolutionary algorithm to reduce the computational cost associated with network optimization. GEBN excelled in simulation studies where the data contained main effects and interaction effects. We also applied GEBN to a Type 2 diabetes (T2D) dataset obtained from the Marshfield Personalized Medicine Research Project (PMRP). We were able to identify genetic interactions for T2D cases and controls and use information from those interactions to classify T2D samples. We obtained an average testing area under the curve (AUC) of 86.8/%. We also identified several interacting genes such asINADLandLPPthat are known to be associated with T2D. Developing the computational tools to explore genetic associations beyond main effects remains a critically important challenge in human genetics. Methods, such as GEBN, demonstrate the utility of considering genetic interactions, as they likely explain some of the missing heritability.
,
URLPMID:4099183 [本文引用: 1]
Objective To model the potential interaction between previously identified biomarkers in children sarcomas using artificial neural network inference (ANNI). Method To concisely demonstrate the biological interactions between correlated genes in an interaction network map, only 2 types of sarcomas in the children small round blue cell tumors (SRBCTs) dataset are discussed in this paper. A backpropagation neural network was used to model the potential interaction between genes. The prediction weights and signal directions were used to model the strengths of the interaction signals and the direction of the interaction link between genes. The ANN model was validated using Monte Carlo cross-validation to minimize the risk of over-fitting and to optimize generalization ability of the model. Results Strong connection links on certain genes (TNNT1 and FNDC5 in rhabdomyosarcoma (RMS); FCGRT and OLFM1 in Ewing sarcoma (EWS)) suggested their potency as central hubs in the interconnection of genes with different functionalities. The results showed that the RMS patients in this dataset are likely to be congenital and at low risk of cardiomyopathy development. The EWS patients are likely to be complicated by EWS-FLI fusion and deficiency in various signaling pathways, including Wnt, Fas/Rho and intracellular oxygen. Conclusions The ANN network inference approach and the examination of identified genes in the published literature within the context of the disease highlights the substantial influence of certain genes in sarcomas.
,
URLPMID:25873079 [本文引用: 1]
Bioinformatics has emerged as an important tool to analyze the large amount of data generated by research in different diseases. In this study, gene expression for radicular cysts (RCs) and periapical granulomas (PGs) was characterized based on a leader gene approach. A validated bioinformatics algorithm was applied to identify leader genes for RCs and PGs. Genes related to RCs and PGs were first identified in PubMed, GenBank, GeneAtlas, and GeneCards databases. The Web-available STRING software (The European Molecular Biology Laboratory [EMBL], Heidelberg, Baden-W眉rttemberg, Germany) was used in order to build the interaction map among the identified genes by a significance score named weighted number of links. Based on the weighted number of links, genes were clustered using k-means. The genes in the highest cluster were considered leader genes. Multilayer perceptron neural network analysis was used as a complementary supplement for gene classification. For RCs, the suggested leader genes were TP53 and EP300, whereas PGs were associated with IL2RG, CCL2, CCL4, CCL5, CCR1, CCR3, and CCR5 genes. Our data revealed different gene expression for RCs and PGs, suggesting that not only the inflammatory nature but also other biological processes might differentiate RCs and PGs.
,
URLPMID:3222218265411 [本文引用: 1]
The detection of genotypes that predict common, complex disease is a challenge for human geneticists. The phenomenon of epistasis, or gene-gene interactions, is particularly problematic for traditional statistical techniques. Additionally, the explosion of genetic information makes exhaustive searches of multilocus combinations computationally infeasible. To address these challenges, neural networks (NN), a pattern recognition method, have been used. One limitation of the NN approach is that its success is dependent on the architecture of the network. To solve this, machine-learning approaches have been suggested to evolve the best NN architecture for a particular data set. In this study we provide a detailed technical description of the use of grammatical evolution to optimize neural networks (GENN) for use in genetic association studies. We compare the performance of GENN to that of a previous machine-learning NN application - genetic programming neural networks in both simulated and real data. We show that GENN greatly outperforms genetic programming neural networks in data sets with a large number of single nucleotide polymorphisms. Additionally, we demonstrate that GENN has high power to detect disease-risk loci in a range of high-order epistatic models. Finally, we demonstrate the scalability of the GENN method with increasing numbers of variables - as many as 500,000 single nucleotide polymorphisms. Genet. Epidemiol . 2008. 2008 Wiley-Liss, Inc.
,
URLPMID:15339344 [本文引用: 1]
Background Screening of various gene markers such as single nucleotide polymorphism (SNP) and correlation between these markers and development of multifactorial disease have previously been studied. Here, we propose a susceptible marker-selectable artificial neural network (ANN) for predicting development of allergic disease. Results To predict development of childhood allergic asthma (CAA) and select susceptible SNPs , we used an ANN with a parameter decreasing method (PDM) to analyze 25 SNPs of 17 genes in 344 Japanese people, and select 10 susceptible SNPs of CAA. The accuracy of the ANN model with 10 SNPs was 97.7% for learning data and 74.4% for evaluation data. Important combinations were determined by effective combination value (ECV) defined in the present paper. Effective 2-SNP or 3-SNP combinations were found to be concentrated among the 10 selected SNPs . Conclusion ANN can reliably select SNP combinations that are associated with CAA. Thus, the ANN can be used to characterize development of complex diseases caused by multiple factors. This is the first report of automatic selection of SNPs related to development of multifactorial disease from SNP data of more than 300 patients.
,
URLPMID:18237992 [本文引用: 1]
This paper presents the tuning of the structure and parameters of a neural network using an improved genetic algorithm (GA). It is also shown that the improved GA performs better than the standard GA based on some benchmark test functions. A neural network with switches introduced to its links is proposed. By doing this, the proposed neural network can learn both the input-output relationships of an application and the network structure using the improved GA. The number of hidden nodes is chosen manually by increasing it from a small number until the learning performance in terms of fitness value is good enough. Application examples on sunspot forecasting and associative memory are given to show the merits of the improved GA and the proposed neural network.
,
URLPMID:12846935 [本文引用: 2]
Abstract BACKGROUND: Appropriate definition of neural network architecture prior to data analysis is crucial for successful data mining. This can be challenging when the underlying model of the data is unknown. The goal of this study was to determine whether optimizing neural network architecture using genetic programming as a machine learning strategy would improve the ability of neural networks to model and detect nonlinear interactions among genes in studies of common human diseases. RESULTS: Using simulated data, we show that a genetic programming optimized neural network approach is able to model gene-gene interactions as well as a traditional back propagation neural network. Furthermore, the genetic programming optimized neural network is better than the traditional back propagation neural network approach in terms of predictive ability and power to detect gene-gene interactions when non-functional polymorphisms are present. CONCLUSION: This study suggests that a machine learning strategy for optimizing neural network architecture may be preferable to traditional trial-and-error approaches for the identification and characterization of gene-gene interactions in common, complex human diseases.
,
URL [本文引用: 2]
,
URLPMID:16436204 [本文引用: 3]
Background The identification and characterization of genes that influence the risk of common, complex multifactorial disease primarily through interactions with other genes and environmental factors remains a statistical and computational challenge in genetic epidemiology. We have previously introduced a genetic programming optimized neural network (GPNN) as a method for optimizing the architecture of a neural network to improve the identification of gene combinations associated with disease risk. The goal of this study was to evaluate the power of GPNN for identifying high-order gene-gene interactions. We were also interested in applying GPNN to a real data analysis in Parkinson's disease. Results We show that GPNN has high power to detect even relatively small genetic effects (2鈥3% heritability) in simulated data models involving two and three locus interactions. The limits of detection were reached under conditions with very small heritability (<1%) or when interactions involved more than three loci. We tested GPNN on a real dataset comprised of Parkinson's disease cases and controls and found a two locus interaction between the DLST gene and sex. Conclusion These results indicate that GPNN may be a useful pattern recognition approach for detecting gene-gene and gene-environment interactions.
,
URL [本文引用: 2]
Abstract This paper proposes a hybrid neuro-evolutive algorithm (NEA) that uses a compact indirect encoding scheme (IES) for representing its genotypes (a set of ten production rules of a Lindenmayer System with memory), moreover has the ability to reuse the genotypes and automatically build modular, hierarchical and recurrent neural networks. A genetic algorithm (GA) evolves a Lindenmayer System (L-System) that is used to design the neural network architecture. This basic neural codification confers scalability and search space reduction in relation to other methods. Furthermore, the system uses a parallel genome scan engine that increases both the implicit parallelism and convergence of the GA. The fitness function of the NEA rewards economical artificial neural networks (ANNs) that are easily implemented. The NEA was tested on five real-world classification datasets and three well-known datasets for time series forecasting (TSF). The results are statistically compared against established state-of-the-art algorithms and various forecasting methods (ADANN, ARIMA, UCM, and Forecast Pro). In most cases, our NEA outperformed the other methods, delivering the most accurate classification and time series forecasting with the least computational effort. These superior results are attributed to the improved effectiveness and efficiency of NEA in the decision-making process. The result is an optimized neural network architecture for solving classification problems and simulating dynamical systems.
,
URL [本文引用: 1]
,
URLPMID:23795347 [本文引用: 1]
ABSTRACT Genome wide association studies (GWAS) have identified numerous single nucleotide polymorphisms (SNPs) that are associated with a variety of common human diseases. Due to the weak marginal effect of most disease-associated SNPs, attention has recently turned to evaluating the combined effect of multiple disease-associated SNPs on the risk of disease. Several recent multigenic studies show potential evidence of applying multigenic approaches in association studies of various diseases including lung cancer. But the question remains as to the best methodology to analyze single nucleotide polymorphisms in multiple genes. In this work, we consider four methods-logistic regression, logic regression, classification tree, and random forests-to compare results for identifying important genes or gene-gene and gene-environmental interactions. To evaluate the performance of four methods, the cross-validation misclassification error and areas under the curves are provided. We performed a simulation study and applied them to the data from a large-scale, population-based, case-control study.
,
URLPMID:23384592 [本文引用: 2]
The use of tree-based methods offers superior performance over conventional classification and regression trees for predicting and classifying HF subtypes in a population-based sample of patients from Ontario, Canada. However, these methods do not offer substantial improvements over logistic regression for predicting the presence of HFPEF.
,
URLPMID:25708662 [本文引用: 1]
Abstract BACKGROUND: Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree. RESULTS: This approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction errors and outperformed most existing the-state-of-the-art random forests. The top 25 SNPs in Parkinson data set were identified by the proposed model including four interesting genes associated with neurological disorders. CONCLUSION: The presented approach has shown to be effective in selecting informative sub-groups of SNPs potentially associated with diseases that traditional statistical approaches might fail. The new RF works well for the data where the number of case-control objects is much smaller than the number of SNPs, which is a typical problem in gene data and GWAS. Experiment results demonstrated the effectiveness of the proposed RF model that outperformed the state-of-the-art RFs, including Breiman's RF, GRRF and wsRF methods.
,
URL [本文引用: 2]
,
URLPMID:15593090 [本文引用: 2]
Abstract There has been a great interest and a few successes in the identification of complex disease susceptibility genes in recent years. Association studies, where a large number of single-nucleotide polymorphisms (SNPs) are typed in a sample of cases and controls to determine which genes are associated with a specific disease, provide a powerful approach for complex disease gene mapping. Genes of interest in those studies may contain large numbers of SNPs that classical statistical methods cannot handle simultaneously without requiring prohibitively large sample sizes. By contrast, high-dimensional nonparametric methods thrive on large numbers of predictors. This work explores the application of one such method, random forests, to the problem of identifying SNPs predictive of the phenotype in the case-control study design. A random forest is a collection of classification trees grown on bootstrap samples of observations, using a random subset of predictors to define the best split at each node. The observations left out of the bootstrap samples are used to estimate prediction error. The importance of a predictor is quantified by the increase in misclassification occurring when the values of the predictor are randomly permuted. We extend the concept of importance to pairs of predictors, to capture joint effects, and we explore the behavior of importance measures over a range of two-locus disease models in the presence of a varying number of SNPs unassociated with the phenotype. We illustrate the application of random forests with a data set of asthma cases and unaffected controls genotyped at 42 SNPs in ADAM33, a previously identified asthma susceptibility gene. SNPs and SNP pairs highly associated with asthma tend to have the highest importance index value, but predictive importance and association do not always coincide. Genet. Epidemiol . 28:171-182, 2005. 2004 Wiley-Liss, Inc.
,
URLPMID:23129299 [本文引用: 1]
Pathway or gene set analysis has been widely applied to genomic data. Many current pathway testing methods use univariate test statistics calculated from individual genomic markers, which ignores the correlations and interactions between candidate markers. Random forests-based pathway analysis is a promising approach for incorporating complex correlation and interaction patterns, but one limitation of previous approaches is that pathways have been considered separately, thus pathway cross-talk information was not considered.In this article, we develop a new pathway hunting algorithm for survival outcomes using random survival forests, which prioritize important pathways by accounting for gene correlation and genomic interactions. We show that the proposed method performs favourably compared with five popular pathway testing methods using both synthetic and real data. We find that the proposed methodology provides an efficient and powerful pathway modelling framework for high-dimensional genomic data.The R code for the analysis used in this article is available upon request.
,
URLPMID:3463421 [本文引用: 1]
Background Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional. Results RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions. Conclusions While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.
,
URL [本文引用: 2]
[本文引用: 1]
,
[本文引用: 1]
,
URLPMID:24098862 [本文引用: 1]
Numerous studies used microarray gene expression data to extract metastasis-driving gene signatures for the prediction of breast cancer relapse. However, the accuracy and generality of the previously introduced biomarkers are not acceptable for reliable usage in independent datasets. This inadequacy is attributed to ignoring gene interactions by simple feature selection methods, due to their computational burden. In this study, an integrated approach with low computational cost was proposed for identifying a more predictive gene signature, for prediction of breast cancer recurrence. First, a small set of genes was primarily selected as signature by an appropriate filter feature selection (FFS) method. Then, a binary sub-class of protein-protein interaction (PPI) network was used to expand the primary set by adding adjacent proteins of each gene signature from the PPI-network. Subsequently, the support vector machine-based recursive feature elimination (SVMRFE) method was applied to the expression level of all the genes in the expanded set. Finally, the genes with the highest score by SVMRFE were selected as the new biomarkers. Accuracy of the final selected biomarkers was evaluated to classify four datasets on breast cancer patients, including 800 cases, into two cohorts of poor and good prognosis. The results of the five-fold cross validation test, using the support vector machine as a classifier, showed more than 13% improvement in the average accuracy, after modifying the primary selected signatures. Moreover, the method used in this study showed a lower computational cost compared to the other PPI-based methods. The proposed method demonstrated more robust and accurate biomarkers using the PPI network, at a low computational cost. This approach could be used as a supplementary procedure in microarray studies after applying various gene selection methods.
,
URL [本文引用: 1]
In this paper, we proposed a new robust twin support vector machine (called R-TWSVM R - TWSVM mathContainer Loading Mathjax ) via second order cone programming formulations for classification, which can deal with data with measurement noise efficiently. Preliminary experiments confirm the robustness of the proposed method and its superiority to the traditional robust SVM in both computation time and classification accuracy. Remarkably, since there are only inner products about inputs in our dual problems, this makes us apply kernel trick directly for nonlinear cases. Simultaneously we does not need to solve the extra inverse of matrices, which is totally different with existing TWSVMs. In addition, we also show that the TWSVMs are the special case of our robust model and simultaneously give a new dual form of TWSVM by degenerating R-TWSVM, which successfully overcomes the existing shortcomings of TWSVM.
,
URLPMID:15102677 [本文引用: 2]
Abstract Hereditary predisposition and causative environmental exposures have long been recognized in human malignancies. In most instances, cancer cases occur sporadically, suggesting that environmental influences are critical in determining cancer risk. To test the influence of genetic polymorphisms on breast cancer risk, we have measured 98 single nucleotide polymorphisms (SNPs) distributed over 45 genes of potential relevance to breast cancer etiology in 174 patients and have compared these with matched normal controls. Using machine learning techniques such as support vector machines (SVMs), decision trees, and nave Bayes, we identified a subset of three SNPs as key discriminators between breast cancer and controls. The SVMs performed maximally among predictive models, achieving 69% predictive power in distinguishing between the two groups, compared with a 50% baseline predictive power obtained from the data after repeated random permutation of class labels (individuals with cancer or controls). However, the simpler nave Bayes model as well as the decision tree model performed quite similarly to the SVM. The three SNP sites most useful in this model were (a) the +4536T/C site of the aldosterone synthase gene CYP11B2 at amino acid residue 386 Val/Ala (T/C) (rs4541); (b) the +4328C/G site of the aryl hydrocarbon hydroxylase CYP1B1 at amino acid residue 293 Leu/Val (C/G) (rs5292); and (c) the +4449C/T site of the transcription factor BCL6 at amino acid 387 Asp/Asp (rs1056932). No single SNP site on its own could achieve more than 60% in predictive accuracy. We have shown that multiple SNP sites from different genes over distant parts of the genome are better at identifying breast cancer patients than any one SNP alone. As high-throughput technology for SNPs improves and as more SNPs are identified, it is likely that much higher predictive accuracy will be achieved and a useful clinical tool developed.
,
URLPMID:17968988 [本文引用: 3]
Although genetic factors play an important role in most human diseases, multiple genes or genes and environmental factors may influence individual risk. In order to understand the underlying biological mechanisms of complex diseases, it is important to understand the complex relationships that control the process. In this paper, we consider different perspectives, from each optimization, complexity analysis, and algorithmic design, which allows us to describe a reasonable and applicable computational framework for detecting gene-gene interactions. Accordingly, support vector machine and combinatorial optimization techniques (local search and genetic algorithm) were tailored to fit within this framework. Although the proposed approach is computationally expensive, our results indicate this is a promising tool for the identification and characterization of high order gene-gene and gene-environment interactions. We have demonstrated several advantages of this method, including the strong power for classification, less concern for overfitting, and the ability to handle unbalanced data and achieve more stable models. We would like to make the support vector machine and combinatorial optimization techniques more accessible to genetic epidemiologists, and to promote the use and extension of these powerful approaches. Genet. Epidemiol . 2008. 2007 Wiley-Liss, Inc.
,
URL [本文引用: 2]
,
[本文引用: 1]
,
URLPMID:11404819 [本文引用: 2]
One of the greatest challenges facing human geneticists is the identification and characterization of susceptibility genes for common complex multifactorial human diseases. This challenge is partly due to the limitations of parametric-statistical methods for detection of gene effects that are dependent solely or partially on interactions with other genes and with environmental exposures. We introduce multifactor-dimensionality reduction (MDR) as a method for reducing the dimensionality of multilocus information, to improve the identification of polymorphism combinations associated with disease risk. The MDR method is nonparametric (i.e., no hypothesis about the value of a statistical parameter is made), is model-free (i.e., it assumes no particular inheritance model), and is directly applicable to case-control and discordant-sib-pair studies. Using simulated case-control data, we demonstrate that MDR has reasonable power to identify interactions among two or more loci in relatively small samples. When it was applied to a sporadic breast cancer case-control data set, in the absence of any statistically significant independent main effects, MDR identified a statistically significant high-order interaction among four polymorphisms from three different estrogen-metabolism genes. To our knowledge, this is the first report of a four-locus interaction associated with a common complex multifactorial disease.
,
URLPMID:28154507 [本文引用: 1]
Abstract Although a large number of genetic variants have been identified to be associated with common diseases through genome-wide association studies, there still exits limitations in explaining the missing heritability. One approach to solving this missing heritability problem is to investigate gene-gene interactions, rather than a single-locus approach. For gene-gene interaction analysis, the multifactor dimensionality reduction (MDR) method has been widely applied, since the constructive induction algorithm of MDR efficiently reduces high-order dimensions into one dimension by classifying multi-level genotypes into high- and low-risk groups. The MDR method has been extended to various phenotypes and has been improved to provide a significance test for gene-gene interactions. In this paper, we propose a simple method, called accelerated failure time (AFT) UM-MDR, in which the idea of a unified model-based MDR is extended to the survival phenotype by incorporating AFT-MDR into the classification step. The proposed AFT UM-MDR method is compared with AFT-MDR through simulation studies, and a short discussion is given.
,
URLPMID:17429103 [本文引用: 1]
We propose using a variant of logistic regression (LR) with (L)_(2)-regularization to fit gene-gene and gene-environment interaction models. Studies have shown that many common diseases are influenced by interaction of certain genes. LR models with quadratic penalization not only correctly characterizes the influential genes along with their interaction structures but also yields additional benefits in handling high-dimensional, discrete factors with a binary response. We illustrate the advantages of using an (L)_(2)-regularization scheme and compare its performance with that of "multifactor dimensionality reduction" and "FlexTree," 2 recent tools for identifying gene-gene interactions. Through simulated and real data sets, we demonstrate that our method outperforms other methods in the identification of the interaction structures as well as prediction accuracy. In addition, we validate the significance of the factors selected through bootstrap analyses.
,
URLPMID:23805232 [本文引用: 1]
We present an extension of the two-class multifactor dimensionality reduction () algorithm that enables detection and characterization of epistatic SNP-SNP interactions in the context of a quantitative trait. The proposed Quantitative (QMDR) method handles continuous data by modifying 's constructive induction algorithm to use a T-test. QMDR replaces the balanced accuracy metric with a T-test statistic as the score to determine the best interaction model. We used a simulation to identify the empirical distribution of QMDR's testing score. We then applied QMDR to genetic data from the ongoing prospective Prevention of Renal and Vascular (PREVEND) study.
,
URLPMID:15023884 [本文引用: 2]
Background— The activated local atrial renin-angiotensin system (RAS) has been reported to play an important role in the pathogenesis of atrial fibrillation (AF). We hypothesized that RAS genes might be among the susceptibility genes of nonfamilial structural AF and conducted a genetic case-control study to demonstrate this.Methods and Results— A total of 250 patients with documented nonfamilial structural AF and 250 controls were selected. The controls were matched to cases on a 1-to-1 basis with regard to age, gender, presence of left ventricular dysfunction, and presence of significant valvular heart disease. The ACE gene insertion/deletion polymorphism, the T174M, M235T, G-6A, A-20C, G-152A, and G-217A polymorphisms of the angiotensinogen gene, and the A1166C polymorphism of the angiotensin II type I receptor gene were genotyped. In multilocus haplotype analysis, the angiotensinogen gene haplotype profile was significantly different between cases and controls (χ2=62.5, P=0.0002). In single-locus analysis, M235T, G-6A, and G-217A were significantly associated with AF. Frequencies of the M235, G-6, and G-217 alleles were significantly higher in cases than in controls (P=0.000, 0.005, and 0.002, respectively). The odds ratios for AF were 2.5 (95% CI 1.7 to 3.3) with M235/M235 plus M235/T235 genotype, 3.3 (95% CI 1.3 to 10.0) with G-6/G-6 genotype, and 2.0 (95% CI 1.3 to 2.5) with G-217/G-217 genotype. Furthermore, significant gene-gene interactions were detected by the multifactor-dimensionality reduction method and multilocus linkage disequilibrium tests.Conclusions— This study demonstrates the association of RAS gene polymorphisms with nonfamilial structural AF and may provide the rationale for clinical trials to investigate the use of ACE inhibitor or angiotensin II antagonist in the treatment of structural AF.
,
URLPMID:22355322 [本文引用: 1]
Background The importance of gene-gene and gene-environment interactions on asthma is well documented in literature, but a systematic analysis on the interaction between various genetic and environmental factors is still lacking. Methodology/Principal Findings We conducted a population-based, case-control study comprised of seventh-grade children from 14 Taiwanese communities. A total of 235 asthmatic cases and 1,310 non-asthmatic controls were selected for DNA collection and genotyping. We examined the gene-gene and gene-environment interactions between 17 single-nucleotide polymorphisms in antioxidative, inflammatory and obesity-related genes, and childhood asthma. Environmental exposures and disease status were obtained from parental questionnaires. The model-free and non-parametrical multifactor dimensionality reduction (MDR) method was used for the analysis. A three-way gene-gene interaction was elucidated between the gene coding glutathione S-transferase P (GSTP1), the gene coding interleukin-4 receptor alpha chain (IL4Ra) and the gene coding insulin induced gene 2 (INSIG2) on the risk of lifetime asthma. The testing-balanced accuracy on asthma was 57.83% with a cross-validation consistency of 10 out of 10. The interaction of preterm birth and indoor dampness had the highest training-balanced accuracy at 59.09%. Indoor dampness also interacted with many genes, including IL13, beta-2 adrenergic receptor (ADRB2), signal transducer and activator of transcription 6 (STAT6). We also used likelihood ratio tests for interaction and chi-square tests to validate our results and all tests showed statistical significance. Conclusions/Significance The results of this study suggest that GSTP1, INSIG2 and IL4Ra may influence the lifetime asthma susceptibility through gene-gene interactions in schoolchildren. Home dampness combined with each one of the genes STAT6, IL13 and ADRB2 could raise the asthma risk.
,
URLPMID:2800840 [本文引用: 2]
Background There is a growing awareness that interaction between multiple genes play an important role in the risk of common, complex multi-factorial diseases. Many common diseases are affected by certain genotype combinations (associated with some genes and their interactions). The identification and characterization of these susceptibility genes and gene-gene interaction have been limited by small sample size and large number of potential interactions between genes. Several methods have been proposed to detect gene-gene interaction in a case control study. The penalized logistic regression (PLR), a variant of logistic regression with L 2 regularization, is a parametric approach to detect gene-gene interaction. On the other hand, the Multifactor Dimensionality Reduction (MDR) is a nonparametric and genetic model-free approach to detect genotype combinations associated with disease risk. Methods We compared the power of MDR and PLR for detecting two-way and three-way interactions in a case-control study through extensive simulations. We generated several interaction models with different magnitudes of interaction effect. For each model, we simulated 100 datasets, each with 200 cases and 200 controls and 20 SNPs . We considered a wide variety of models such as models with just main effects, models with only interaction effects or models with both main and interaction effects. We also compared the performance of MDR and PLR to detect gene-gene interaction associated with acute rejection(AR) in kidney transplant patients. Results In this paper, we have studied the power of MDR and PLR for detecting gene-gene interaction in a case-control study through extensive simulation. We have compared their performances for different two-way and three-way interaction models. We have studied the effect of different allele frequencies on these methods. We have also implemented their performance on a real dataset. As expected, none of these methods were consistently better for all data scenarios, but, generally MDR outperformed PLR for more complex models. The ROC analysis on the real dataset suggests that MDR outperforms PLR in detecting gene-gene interaction on the real dataset. Conclusion As one might expect, the relative success of each method is context dependent. This study demonstrates the strengths and weaknesses of the methods to detect gene-gene interaction.
,
URLPMID:27587680 [本文引用: 2]
Abstract Motivation: Geneene interaction (GGI) is one of the most popular approaches for finding and explaining the missing heritability of common complex traits in genome-wide association studies. The multifactor dimensionality reduction (MDR) method has been widely studied for detecting GGI effects. However, there are several disadvantages of the existing MDR-based approaches, such as the lack of an efficient way of evaluating the significance of multi-locus models and the high computational burden due to intensive permutation. Furthermore, the MDR method does not distinguish marginal effects from pure interaction effects. Methods: We propose a two-step unified model based MDR approach (UM-MDR), in which, the significance of a multi-locus model, even a high-order model, can be easily obtained through a regression framework with a semi-parametric correction procedure for controlling Type I error rates. In comparison to the conventional permutation approach, the proposed semi-parametric correction procedure avoids heavy computation in order to achieve the significance of a multi-locus model. The proposed UM-MDR approach is flexible in the sense that it is able to incorporate different types of traits and evaluate significances of the existing MDR extensions. Results: The simulation studies and the analysis of a real example are provided to demonstrate the utility of the proposed method. UM-MDR can achieve at least the same power as MDR for most scenarios, and it outperforms MDR especially when there are some single nucleotide polymorphisms that only have marginal effects, which masks the detection of causal epistasis for the existing MDR approaches. Conclusions: UM-MDR provides a very good supplement of existing MDR method due to its efficiency in achieving significance for every multi-locus model, its power and its flexibility of handling different types of traits. Availability and implementation: A R package mMDR and other source codes are freely available at http://statgen.snu.ac.kr/software/umMDR/ . Contact:tspark@stats.snu.ac.kr Supplementary information:Supplementary data are available at Bioinformatics online.
,
URLPMID:28361694 [本文引用: 3]
Abstract BACKGROUND: Detection of gene-gene interaction (GGI) is a key challenge towards solving the problem of missing heritability in genetics. The multifactor dimensionality reduction (MDR) method has been widely studied for detecting GGIs. MDR reduces the dimensionality of multi-factor by means of binary classification into high-risk (H) or low-risk (L) groups. Unfortunately, this simple binary classification does not reflect the uncertainty of H/L classification. Thus, we proposed Fuzzy MDR to overcome limitations of binary classification by introducing the degree of membership of two fuzzy sets H/L. While Fuzzy MDR demonstrated higher power than that of MDR, its performance is highly dependent on the several tuning parameters. In real applications, it is not easy to choose appropriate tuning parameter values. RESULT: In this work, we propose an empirical fuzzy MDR (EF-MDR) which does not require specifying tuning parameters values. Here, we propose an empirical approach to estimating the membership degree that can be directly estimated from the data. In EF-MDR, the membership degree is estimated by the maximum likelihood estimator of the proportion of cases(controls) in each genotype combination. We also show that the balanced accuracy measure derived from this new membership function is a linear function of the standard chi-square statistics. This relationship allows us to perform the standard significance test using p-values in the MDR framework without permutation. Through two simulation studies, the power of the proposed EF-MDR is shown to be higher than those of MDR and Fuzzy MDR. We illustrate the proposed EF-MDR by analyzing Crohn's disease (CD) and bipolar disorder (BD) in the Wellcome Trust Case Control Consortium (WTCCC) dataset. CONCLUSION: We propose an empirical Fuzzy MDR for detecting GGI using the maximum likelihood of the proportion of cases(controls) as the membership degree of the genotype combination. The program written in R for EF-MDR is available at http://statgen.snu.ac.kr/software/EF-MDR .
,
URLPMID:3057873 [本文引用: 3]
A central goal of human genetics is to identify susceptibility genes for common human diseases. An important challenge is modelling gene-gene interaction or epistasis that can result in nonadditivity of genetic effects. The multifactor dimensionality reduction (MDR) method was developed as a machine learning alternative to parametric logistic regression for detecting interactions in the absence of significant marginal effects. The goal of MDR is to reduce the dimensionality inherent in modelling combinations of polymorphisms using a computational approach called constructive induction. Here, we propose a Robust Multifactor Dimensionality Reduction (RMDR) method that performs constructive induction using a Fisher's Exact Test rather than a predetermined threshold. The advantage of this approach is that only statistically significant genotype combinations are considered in the MDR analysis. We use simulation studies to demonstrate that this approach will increase the success rate of MDR when there are only a few genotype combinations that are significantly associated with case-control status. We show that there is no loss of success rate when this is not the case. We then apply the RMDR method to the detection of gene-gene interactions in genotype data from a population-based study of bladder cancer in New Hampshire.
,
URLPMID:17503330 [本文引用: 2]
The determination of gene-by-gene and gene-by-environment interactions has long been one of the greatest challenges in genetics. The traditional methods are typically inadequate because of the problem referred to as the "curse of dimensionality." Recent combinatorial approaches, such as the multifactor dimensionality reduction (MDR) method, the combinatorial partitioning method, and the restricted partition method, have a straightforward correspondence to the concept of the phenotypic landscape that unifies biological, statistical genetics, and evolutionary theories. However, the existing approaches have several limitations, such as not allowing for covariates, that restrict their practical use. In this study, we report a generalized MDR (GMDR) method that permits adjustment for discrete and quantitative covariates and is applicable to both dichotomous and continuous phenotypes in various population-based study designs. Computer simulations indicated that the GMDR method has superior performance in its ability to identify epistatic loci, compared with current methods in the literature. We applied our proposed method to a genetics study of four genes that were reported to be associated with nicotine dependence and found significant joint action between CHRNB4 and NTRK2. Moreover, our example illustrates that the newly proposed GMDR approach can increase prediction ability, suggesting that its use is justified in practice. In summary, GMDR serves the purpose of identifying contributors to population variation better than do the other existing methods.
,
URLPMID:24057800 [本文引用: 1]
Abstract Gene-gene and gene-environment interactions govern a substantial portion of the variation in complex traits and diseases. In convention, a set of either unrelated or family samples are used in detection of such interactions; even when both kinds of data are available, the unrelated and the family samples are analyzed separately, potentially leading to loss in statistical power. In this report, to detect gene-gene interactions we propose a generalized multifactor dimensionality reduction method that unifies analyses of nuclear families and unrelated subjects within the same statistical framework. We used principal components as genetic background controls against population stratification, and when sibling data are included, within-family control were used to correct for potential spurious association at the tested loci. Through comprehensive simulations, we demonstrate that the proposed method can remarkably increase power by pooling unrelated and offspring's samples together as compared with individual analysis strategies and the Fisher's combining p value method while it retains a controlled type I error rate in the presence of population structure. In application to a real dataset, we detected one significant tetragenic interaction among CHRNA4, CHRNB2, BDNF, and NTRK2 associated with nicotine dependence in the Study of Addiction: Genetics and Environment sample, suggesting the biological role of these genes in nicotine dependence development.
,
[本文引用: 1]
[本文引用: 1]
,
URL [本文引用: 1]
Abstract: Much of the natural variation for a complex trait can be explained by variation in DNA sequence levels. As part of sequence variation, gene-gene interaction has been ubiquitously observed in nature, where its role in shaping the development of an organism has been broadly recognized. The identification of interactions between genetic factors has been progressively pursued via statistical or machine learning approaches. A large body of currently adopted methods, either parametrically or nonparametrically, predominantly focus on pairwise single marker interaction analysis. As genes are the functional units in living organisms, analysis by focusing on a gene as a system could potentially yield more biologically meaningful results. In this work, we conceptually propose a gene-centric framework for genome-wide gene-gene interaction detection. We treat each gene as a testing unit and derive a model-based kernel machine method for two-dimensional genome-wide scanning of gene-gene interactions. In addition to the biological advantage, our method is statistically appealing because it reduces the number of hypotheses tested in a genome-wide scan. Extensive simulation studies are conducted to evaluate the performance of the method. The utility of the method is further demonstrated with applications to two real data sets. Our method provides a conceptual framework for the identification of gene-gene interactions which could shed novel light on the etiology of complex diseases.
,
URLPMID:14767722 [本文引用: 1]
Abstract Genetic factors play an important role in the aetiology of Parkinson's disease (PD). We have screened nuclear genes encoding subunits of mitochondrial complex I for associations between single nucleotide polymorphisms (SNPs) and PD. Abnormal functioning of complex I is well documented in human PD. Moreover, toxicological inhibition of complex I can lead to parkinsonism in animals. Thus, commonly occurring variants in these genes could potentially influence complex I function and the risk of developing PD. A sub-set of 70 potential SNPs in 31 nuclear complex I genes were selected and association analysis was performed on 306 PD patients plus 321 unaffected control subjects. Genotyping was performed using the DASH method. There was no evidence that the examined SNPs were significant genetic risk factors for PD, although this initial screen could not exclude the possibility that other disease-influencing variations exist within these genes.
,
[本文引用: 1]