Applications of machine learning in clinical decision support in the omic era
Xuetong Zhao,1,2, Yadong Yang1,2, Hongzhu Qu1,2, Xiangdong Fang,1,2通讯作者:
编委: 赵方庆
收稿日期:2018-05-17修回日期:2018-07-23网络出版日期:2018-09-20
基金资助: |
Editorial board:
Received:2018-05-17Revised:2018-07-23Online:2018-09-20
Fund supported: |
作者简介 About authors
赵学彤,博士研究生,专业方向:复杂疾病多组学数据整合与解析E-mail:
摘要
关键词:
Abstract
Keywords:
PDF (372KB)元数据多维度评价相关文章导出EndNote|Ris|Bibtex收藏本文
本文引用格式
赵学彤, 杨亚东, 渠鸿竹, 方向东. 组学时代下机器学习方法在临床决策支持中的应用[J]. 遗传, 2018, 40(9): 693-703 doi:10.16288/j.yczz.18-139
Xuetong Zhao, Yadong Yang, Hongzhu Qu, Xiangdong Fang.
在高通量测序技术的快速发展下,涌现出了大量的多组学数据,如基因组、转录组、表观组、代谢组和蛋白组等,同时也出现了许多具有代表意义的国际项目,如DNA 元件百科全书计划(The Encyclopedia of DNA Elements, ENCODE)[1]和国际人类基因组单体型图计划(Haplotype Map, HapMap)[2]等。随着数据的不断积累和基础研究的不断突破,人类对疾病的诊治迈入了精准医学时代。医生可以将患者的组学数据[3]、表型数据、临床诊疗数据、电子病历档案数据及影像数据等进行结合对疾病进行精准把控。基于此,众多优秀的临床决策支持系统纷纷涌现,使得以机器学习为代表的人工智能方法在基础医学研究领域迅速发展。利用机器学习等方法,从数据中生成模型,然后用大量数据来改善模型本身的性能,以达到对疾病预测和分类、用药指导、疾病诊断等目的,为临床决策支持提供技术基础。本文主要对临床决策支持领域中常用的机器学习方法进行综述,对机器学习方法的应用进行相应总结和分类,并提出不足,旨在为临床决策支持中机器学习等人工智能方法的选择提供参考。
1 机器学习的类型和算法
人工智能含有多个分支,包括计算机视觉、自然语言处理和机器学习等。计算机视觉[4]是通过自动提取和分析图像中的信息实现自动视觉理解。自然语言处理(natural-language processing, NLP)[5]属于计算机与人类(自然)语言之间的交互,特别是对计算机进行编程以成功处理大量自然语言数据。机器学习是人工智能的一个重要分支,一般是指利用统计方法求解最优化问题,通过学习输入数据的数据结构和其内在模式,选择对应的学习方式和训练方法以构建最优的数学模型,并不断调整模型参数,通过数学方法求解模型最优化后的反馈结果,以提高泛化能力防止发生过拟合。图1展示的是使用机器学习方法进行数据处理的基本流程。图1
新窗口打开|下载原图ZIP|生成PPT图1使用机器学习方法进行数据处理的基本流程
Fig. 1The basic flow of data processing using machine learning methods
根据训练数据是否带有标记,机器学习分为监督学习、无监督学习、半监督学习和弱监督学习。监督学习一般是利用带标注的数据集学习出最优函数,再将新的数据输入此函数,根据函数预测出结果,常见的监督学习方法有回归分析和统计分类 等[6]。无监督学习,其训练集不带标注,常见的无监督学习方法如聚类[7]。监督学习与无监督学习的结合体是半监督学习,半监督学习是在有监督的分类算法中加入无标签的数据来实现半监督分类,其目的是增强分类效果。弱监督学习是指标签含有的信息不完全、不确切或者不准确,但仍试图用这些较弱的标签构建模型。强化学习、集成学习和深度学习是近年来广受关注的机器学习方法。强化学习是指根据当前状态通过不断调整参数找到最优策略的一种方法,常用的算法有动态规划和蒙特卡罗方法等。集成学习是结合几个弱分类器,构建成一个强分类器的机器学习方法,集成学习的泛化误差小,并且不易发生过拟合,可用于处理不平衡分类问题和回归问题,但集成之后的效果未必会比单个分类器好。深度学习起源于人工神经网络,常见的深度学习如卷积神经网络和递归神经网络,其中使用了人工神经网络和人工神经单元的概念,是由多个处理层对数据进行高层抽象处理的算法[8]。
从机器学习算法看,常用的有监督学习算法有回归算法和分类算法,回归算法如线性回归、最小二乘法和贝叶斯回归等,分类算法如朴素贝叶斯、支持向量机(support vector machine, SVM)和最近邻算法等。回归和分类都是通过拟合得到最优函数再对输入的数据做出决策,但回归和分类在输入变量的类型、算法的目的和评价的方法上又有所区别:(1)回归的输出是定量的,是连续数据,而分类的输出是定性的,是离散数据;(2)回归的目的是寻找最优拟合,而分类则是为了找到决策边界;(3)回归的评价方法有拟合优度和误差平方和,分类则是靠精度和混淆矩阵评价分类效果。常用的无监督学习算法有聚类,如层次聚类算法、K均值算法和期望最大化算法(expectation maximization, EM)等。
应用机器学习等人工智能方法进行临床决策支持,一般是以疾病组和对照组的数据为输入,数据类型可以是基因表达数据[9, 10]、相关位点甲基化水平、单核苷酸多态性、插入缺失变异、结构变异信息或者医疗影像数据等其中的一种或几种。在进行数据挖掘之前,待输入的数据是以各种形式存在的,在一定程度上存在不一致性和不完整性,不能直接用于拟定的算法进行有意义的数据挖掘,所以要对数据进行前期预处理过滤不合格数据。随后,由于数据的样本量一般较少且变量较多,导致数据维度大,难以直接进行分析,所以会使用一些针对于特征选择减少变量的方法,如LASSO (least absolute shrinkage and selection operator)[11]、岭回归(ridge regression)、桥回归(bridge regression)、弹性网络(elastic net)、主成分分析(principal component analysis, PCA)、偏最小二乘、等距映射、线性判别分析和独立成分分析等[12,13,14,15,16]剔除不相关或冗余的特征,从而实现减少特征个数、提高模型精确度并且有效减少运行时间。之后,将筛选好的特征输入模型进行训练,构建临床决策支持模型。用于训练的数据称为训练集,同理有测试集和验证集。测试集和/或验证集的数据带入临床决策支持模型,对模型进行测试和验证。常见的评估模型好坏的指标是AUC (area under curve)即ROC曲线(receiver operating characteristic Curve)下的面积[17],面积值通常在0.5和1.0之间,AUC越接近于1.0,说明模型效果越 好。在计算患者生存分析中,常用一致性指数,即C-index (concordance index),计算COX模型预测值与真实值之间的区分度,用于表示肿瘤患者预后模型的预测精度[18]。
表1例举了目前应用于临床决策支持的几种常用机器学习方法。
Table 1
表1
表1 用于临床决策支持的机器学习方法
Table 1
方法 | 描述 | R包 | 文献 |
---|---|---|---|
SVM | 支持向量机是监督学习模型,用于处理分类和回归分析问题。 | e1071 | [8, 19~23] |
逻辑回归 | 逻辑回归常用于解决预测和判别问题,其因变量为二分类或多分类。 | glm function | [24~26] |
聚类分析 | 聚类是对一组对象进行分组,使得同一组中的对象彼此之间比其他 组中的对象更相似。 | mclust、cluster | [21, 27~29] |
Bagging | Bagging是一种集成学习方法,试图通过对随机样本的某些特征而 不是整个特征集进行训练来减少集合中估计量之间的相关性。 | ipred | [25, 30] |
随机森林 | 随机森林是一种集成学习方法,常用于处理分类和回归问题。 | randomforest、party | [31, 32] |
深度学习 | 深度学习使用多层非线性处理单元的级联进行特征提取和转换, 每个连续的图层使用前一层的输出作为输入。 | nnet、AMORE、 neuralnet、RSNNS | [33~36] |
新窗口打开|下载CSV
2 机器学习等人工智能方法在临床决策支持中的研究进展
2.1 分类算法
常用于临床决策支持的分类算法主要有朴素贝叶斯、SVM、决策树、逻辑回归和最近邻算法等,其中以SVM和逻辑回归方法最具有代表性。2.1.1 SVM
SVM是一种监督学习算法,常用于处理二分类预测、多分类预测和回归问题,其优化目标是找到一个最优分离超平面,使从数据点到超平面的最小距离之和最大化,距离超平面最近的训练样本则被称为支持向量。SVM很容易发生核化,因此能够解决非线性分类问题,即将样本从原始样本空间通过核函数投影到高维特征空间,通过这个高维度特征空间使得原来线性不可分的数据变成线性可分。核函数的合适与否直接决定了支持向量机与核方法的性能好坏。常用核函数有线性核、多项式核、高斯核和sigmoid核,其中线性核和高斯核应用最为广泛。SVM泛化能力优秀,结构化风险小,但其对空间消耗和时间消耗较大,当数据量增大时,SVM的训练时间显著增加。SVM方法已经成为近年来机器学习领域的常用方法,常应用于临床决策支持等精准医学数据挖掘问题。
在一项癌症检测和分类的研究中,Ramaswamy等[8]使用从血小板中获取的mRNA进行多种肿瘤类型和健康供体的多类别癌症预测,使用SVM / LOOCV分类算法,多类癌症预测的结果显示出71%的平均准确度,证明血小板中的mRNA具有显著的多类别癌症鉴别能力。之后,他们选择了健康供体(n=55)和在肺(n = 154)、脑(n = 114)以及肝脏(n = 127)中具有原发性或转移性肿瘤负荷的患者,并使用SVM/LOOCV算法确定肺、脑和肝脏是否存在癌症,分别得到96%、91%和96%的准确度。利用SVM方法,血小板中的mRNA能够较为准确地获知癌症所在位置,为癌症诊断应用提供了可靠的方法[21]。
CANScript是一个基于ROC曲线下部分区域优化的支持向量机算法(即SVMpAUC学习算法)的功能性检测平台,可用于预测抗癌药物的临床疗效。SVMpAUC是优化局部AUC的方法[23],要求达到至少75%的特异性,即最多25%的假阳性率。研究人员使用SVMpAUC学习预测模型,用一组新数据测试CANScript模型,并使用临床数据加以验证,得到了很好的效果,在由55例患者组成的预测临床反应的测试集上达到100%的灵敏度和91.67%的特异性。CANScript平台通过捕获肿瘤内异质性[23],模拟患者的肿瘤微环境,衡量肿瘤在不同疗法下的应答,预测化疗方案的结果并提供有针对性的疾病诊断和治疗方案。其关键特点在于可以捕获肿瘤内异质性,辅助医生对疾病进行合理诊断。
2.1.2 逻辑回归
逻辑回归属于监督学习方法,常用于解决概率预测、二分类或多分类问题,其因变量的取值范围在0~1之间。逻辑回归最基本的学习算法是最大似然,通过最大似然推导损失函数,然后使用梯度下降、坐标轴下降等方法优化损失函数,其具有计算代价低、速度快、易于理解和实现等优点,但仅适用于解决线性问题。
在研究假基因的表达和临床相关的肿瘤亚型泛癌分析中,为了评估假基因表达谱对于两种子宫内膜样腺癌组织学亚型的预测效力,研究人员使用多种机器学习算法,其中逻辑回归算法得到的假基因表达谱可以准确区分两种组织学亚型,其AUC达到了0.892。在独立测试集上,逻辑回归算法展示出最佳性能,表明利用逻辑回归得到的假基因表达谱能够有效捕捉临床相关信息,获得有意义的肿瘤亚型,帮助医生和患者选择适当的临床治疗方案[24]。
在预测上皮性卵巢癌淋巴侵袭与患者生存情况相关性的研究中,使用200折交叉验证的逻辑回归、线性判别分析和支持向量机模型,利用组学标志物预测样本淋巴侵袭并建立预后模型。其中逻辑回归方法的C-index最大,可以很好地预测上皮性卵巢癌淋巴侵袭的分子特征,用于后续开发上皮性卵巢癌淋巴侵袭的预后模型[25]。
逻辑回归易于实现,并且对于线性可分的数据分类性能良好,适用于通过历史数据的表现对未来结果的发生概率进行预测。逻辑回归不仅能预测类别,还可以一并输出具体的概率值。
2.2 聚类分析
聚类是常见的无监督学习方法,指的是当训练样本无标记信息时,将训练样本划分为若干不相交的子集,每个子集被称为一个簇。划分依据的是组内距离最小化而组间距离最大化的原则。聚类的代表算法是K均值聚类、层级聚类、高斯混合聚类和密度聚类等。K均值聚类需要指定聚类数K,对所有数据点寻找最近的随机初始化的K个聚类中心,采用贪心算法逐步更新聚类中心为所有类中点的均值,通过迭代优化求得最优解,但此算法聚类效果受K值约束,并且容易陷入局部最优。层次聚类分为“自下而上”和“自上而下”两种:“自下而上”是将每个样本各视为一类,根据两类之间的距离合并最近的类,再迭代计算各个类之间的距离,直到所有样本都聚到一类,可通过设置距离阈值决定是否继续合并类;“自上而下”则与之相反。层次聚类不需要指定最终聚类个数,但由于其距离度量的方法多种多样,聚类的效果受其影响较大,并且计算复杂度高。密度聚类是根据给定的距离半径和每类的最小个数,通过迭代更换类心形成各类,密度聚类可以解决不规则形状的样本聚类问题,对于处理噪声数据有显著优势,但对于稀疏的高维数据聚类效果不理想。聚类分析已成为近几年来机器学习领域的研究热点,应用范围极其广阔,尤其生命科学领域应用聚类提供临床决策支持[28, 29]。研究人员通过比较来自健康志愿者和分别来自非小细胞肺癌、结直肠癌、胶质母细胞瘤、胰腺癌、肝胆管癌和乳腺癌患者的血小板中的差异mRNA,利用了无监督层级聚类的方法,成功区分出健康供体和患有特定类型癌症的患者,Fisher’s检验结果显示P < 0.0001,证明了无监督层级聚类方法可以将健康供体和特定肿瘤类型患者区分开。研究人员利用此方法还得到了肿瘤特异性基因,这些基因可用于后续的特异性肿瘤模型的训练和验证[21, 27]。这为后续研究奠定了基础,也是疾病诊断决策过程中必不可少的环节之一。
2.3 集成学习
在临床决策支持中,集成学习方法也表现出色。常见的集成学习方法有:Bagging、随机森林、Voting、XGBoost、Stacking和Blending等,但Bagging和随机森林方法使用率较高。下面主要介绍这两种方法。2.3.1 Bagging
Bagging (bootstrap aggregating)是一种并行式的集成学习方法,其使用自助采样法训练基学习器,再将训练好的基学习器结合起来。Bagging对于含有p个样本的数据集,先以随机的方式取出任意一个样本放入到采样集中,完成采样后,再把此样本放回到原始的数据集中,使得下次采样时此样本仍有可能被挑中,经过这样的p次随机采样操作,便可得到含有p个样本的采样集。在偏差-方差分解的角度上来看,Bagging注重于降低方差,其在易受扰动的学习器上效果将会更明显。
研究人员利用2 186例TCGA数据库中肺腺癌和鳞状细胞癌患者的癌组织图像和周围良性组织图像[37],294例来自斯坦福组织芯片数据库的肿瘤影像,结合每例肿瘤的级别和每例患者的存活时间等信息[38],对肿瘤患者进行诊断预测。研究者使用了多种机器学习方法相互比较,Bagging分类树可将得到的癌症特异性特征用于准确区分恶性肿瘤与邻近致密正常组织。这些癌症特异性特征包括肿瘤细胞的大小和形状以及与相邻肿瘤细胞的空间关系等,模型的AUC达到了0.87。Bagging分类树还可以用于区分肺腺癌的组织病理学图像与肺鳞状细胞癌的图像,其AUC达到了0.75[39]。利用Bagging分类树识别肿瘤影像评估肺癌组织切片比病理学家更加准确,对于区分肺癌中的腺癌与鳞癌有重要的临床意义,此方法也可以扩展到其他器官组织病理学图像分析中[30, 39]。
2.3.2 随机森林
随机森林(random forests)是Bagging的变体之一,是在以决策树构建Bagging的基础之上,在树的训练中引入了随机属性选择。一个随机森林模型集成了多棵决策树[40]。随机森林分类表现优秀、扩展性好且使用方法简单。
在肝癌的早期筛查模型中,研究人员利用肝癌患者和正常人血液样本中的DNA甲基化数据和生存数据,通过随机森林和LASSO等机器学习方法,得到了用于肝癌早期筛查、风险评估和预后监测的模型[31]。在随机森林分析中,利用OOB (out of bag)误差[41]作为最小化准则,从变量森林中进行变量消去,通过设置变量每次迭代的下降分数为0.3,将变量从随机森林中缩减,最终从450 000个DNA甲基化位点中筛选出了10个甲基化位点作为生物标志物,进而使用逻辑回归构建了肝癌诊断预测模型,辅助医生进行临床决策。此模型还可用于区分不同的肝癌和肝脏疾病,如脂肪肝和酒精肝等,这些肝脏疾病都是肝癌的主要风险因素[31]。
此外,随机森林还可用于辅助医生进行慢性疾病的诊断及其风险评估。研究人员使用宏基因组测序方法,将405份粪便样本进行了微生物组成和生物信息学计算分析,包括宏基因组连锁群(metagenomic linkage groups, MLGs)聚类分析和随机森林,总共筛选得到47个MLGs标志物,这些标志物可用于区分冠心病患者和健康人,利用该标志物绘制出的ROC曲线可以用来评判标志物的分类能力[32]。同时研究人员将这些标志物与已有的标志物的分类能力进行了比较[42],发现这47个生物标志物的分类能力要高于公认的粥样硬化性心血管疾病标志物三甲胺的分类能力。这些新发现的生物标志物很可能发展成为未来冠心病的诊治标志物。研究者同时还使用随机森林研究了治疗粥样硬化性心血管疾病的药 物[43]。在区分有无药物治疗的动脉粥样硬化性心血管疾病患者和健康对照中,随机森林分类器的分类效果最好;当两种药物一起分析时,随机森林分类器同样达到较高水平。结果表明,使用随机森林算法得到的显著富集在疾病组的宏基因连锁群,在一定程度上可以作为诊断和预防粥样硬化性心血管疾病的生物标志物[32]。
2.4 深度学习
深度学习是指通过基于非监督或有/半监督的特征学习和分层特征提取等算法获取特征的一种机器学习方法,其包含一个输入层、一个输出层和一个或多个隐含层,属于深层的神经网络。深度学习的输入层接收输入数据,将输入传递给第一个隐含层,隐含层会针对其输入进行数学运算,在隐含层内部进行传输,最后一个隐含层会将输出结果传递给输出层,输出层输出最终数据。其中每层都有若干的神经元,每两个神经元之间都有对应阈值,其决定输入数据的重要性。每个神经元之间又有一个激励函数,激励函数用于引入非线性因素,常见的激励函数有sigmod函数、relu函数和tanh函数等。深度学习至今已有数种学习框架,如TensorFlow、Caffe、Deeplearning4j和Theano等,这些框架被广泛应用于图像处理、自然语言处理和生物信息学等领域,取得很好的效果[44]。常见的深度学习模型有卷积神经网络(convolutional neural network, CNN)、递归神经网络(recurrent neural network, RNN)、自编码器(auto-encoder, AE)、生成对抗网络(generative adversarial networks, GAN)和受限玻尔兹曼机(restricted boltzmann machines, RBM)等。CNN是通过卷积、池化(pooling) 和全连通过程将高维数据压缩,常用于处理结构化数据或网格状数据,例如图像数据;常见的RNN如长短时记忆网络(long short-term memory, LSTM),分为前向传播和反向传播,使用3个门控开关控制单元状态,适用于处理有序列问题,例如时间序列和语音分析等;AE通过编码器和解码器将数据进行低维压缩,常用于提取数据特征和降低数据维度等;GAN包含两个训练网络,分别为生成网络和判别网络,使用无监督学习方法同时对其进行训练,分别捕捉样本数据的分布和样本来自训练集的概率,常用于处理图像数据和音频视频等;RBM属于生成式随机神经网络,包含可见单元和隐藏单元,分别用于描述数据和提取特征,常使用最大似然法学习目标,用于进行预测分析等。CNN、AE和DBN在癌症检测中最为常用,它们常用于分析图像数据(如X光 片、CT图像)和分子数据(如基因突变、基因表达数据)等。
2017年2月,美国斯坦福大学的研究人员成功地利用深度学习进行了黑色素瘤皮肤病的诊断预测。他们的模型在学习了129 450张皮肤图像之后,对黑色素瘤皮肤病的诊断准确度超过94%,击败了21位专业的皮肤病学家[33]。另有研究人员利用CNN检测乳腺癌的发生,使用45 000多张乳房X光图像训练CNN模型,使模型的诊断准确度接近于人类专家的水平[34]。同样使用CNN模型,谷歌团队利用计算机视觉从100万像素的组织显微图像中检测并定位出100×100像素的肿瘤,其灵敏度高达92.4%,并且每张图片的假阳性结果平均只有8个,实现了乳腺癌检测领域的又一突破[35]。研究人员还使用多层CNN实现了从胸部CT扫描图像中识别肺结节,此模型的准确度达到了86.84%[36]。之后,研究人员还另外设计了3D-CNN,其能使计算机通过分析结肠镜拍摄到的视频找到肠息肉所在位置[45]。
研究者们还推出"DeepPatient"系统[46],使用无监督深层特征学习方法,利用自然语言处理和3层去噪自编码器捕捉电子病历中的信息,随后使用随机森林分类器实现对每种疾病的诊断分类,实现了读入一份电子病历就可以预测出病历主人一年后的健康情况。研究人员使用76 214个来自不同临床领域和时间阶段的78种疾病的测试病人进行评估,在糖尿病、精神分裂症和多种癌症中的预测性能较好。在结直肠癌的诊断中,其准确率超过了88%,但对于一些复杂疾病,预测准确性还有待提高。这些结果表明将深度学习方法应用于电子健康档案数据中可辅助临床诊断预测。
Watson for Oncology[47]使用自然语言处理和高级认知算法收集了约1500万份医学资料,例如医学文献、指南、电子病历和患者数据等,其可在数分钟内生成患者的病历报告并实现数据可视化,辅助医生快速识别病历中的关键信息,提供相应证据并给出对应治疗方案供医生参考。Watson现已应用于淋巴瘤、黑色素瘤、胰腺癌、卵巢癌、脑癌、肺癌[48]、乳腺癌[49]和结直肠癌等疾病。
IDx-DR[50]系统利用人工智能技术读取了100万张眼部图像学习视网膜病变症状,再次输入视网膜图像即可检测糖尿病视网膜病变,支持眼科专家临床决策,对患者进行早期干预,甚至可独立诊断眼病,实现眼科疾病的实时检测。
另外,美国心脏病学院与美国心脏病协会(ACC/ AHA)机构经过多年的研究,将胆固醇、高血压、吸烟、年龄和糖尿病等一系列因素列为心血管疾病的高风险因素,并推出了ACC/AHA诊断预测模型[51, 52]。在这些建立的心血管病风险评估模型中,每个风险因素与心血管疾病之间的关系都是线性的,这简化了疾病与风险因素之间的关系。为寻找更为合适的机器学习算法,科研人员利用4种机器学习算法—随机森林、逻辑回归、梯度提升和神经网络,通过分析295 267份电子病历数据构建了关于心血管疾病的预测模型。研究人员将经过标准化后的验证数据输入到新建立的四个模型中,得到了与心血管疾病相关性较高的风险因子。之后,研究者将使用这四种机器学习算法建立的模型与ACC/AHA模型进行比较,分别预测82 989名居民10年后会患心血管疾病的人员名单,结果表明随机森林模型、逻辑回归模型、梯度提升机和神经网络模型的敏感度比ACC/AHA模型高出2.6%、4.4%、4.8%和4.8%[53]。然而,即使引入机器学习算法还是有三成左右的高风险居民没有被预测出来,这表明深度学习方法还需要进行大幅提升。
3 结语与展望
机器学习等人工智能方法在临床决策支持中 的应用目前主要分为三个方面:基于医学影像数 据[33,34,35]的图像识别、基于电子病历数据[46]的数据整理和基于组学数据的数据挖掘[24, 27, 31]。基于医学影像数据,如利用获得的大量患者的X光片、视网膜图片、结节图片、皮肤病图片等输入到人工智能模型中,以训练其对病变组织进行筛查诊断的能力,之后对于新的数据,模型能够提供一定的诊断信息。基于电子病历数据,如利用深度学习加权的循环神经网络等捕捉病历信息,之后利用机器学习等人工智能方法对病历主人的患病情况做出一定预测,给出适宜的治疗方案。基于组学数据,如将筛选得到差异标志物进行特征选择,之后选择合适的机器学习模型,用于辅助医生进行疾病的诊断、预后模型的评估等。从已有的数据角度来看,对于带标签的数据进行分类和回归分析,可使用有监督学习方法,例如对多种肿瘤类型和健康供体的多类别癌症进行辅助诊断[8];对于不带标签的数据进行聚类,可以使用无监督学习方法,例如对患者的血小板中的差异mRNA使用无监督聚类方法,区分出是否为患者[27];对于既含有带标签数据也含有不带标签数据时,可以使用半监督学习方法,例如在计算机辅助医学图像分析中,图像数据只有一部分是带有标签的;当使用单一机器学习方法对数据的建模效果不好时,可使用集成学习方法。在众多优秀的机器学习方法中,值得一提的是最近火热的深度学习,深度学习并不是一种独立的学习方法,它本身会用到有监督和无监督的学习方法来训练深度神经网络,如黑色素瘤的诊断预测[33],疾病的类别预测[21]和患者的预后生存预测[31, 53]等。深度学习的盛行并不意味着机器学习的终结,相较于传统算法,深度学习效率高,模型可塑性高,还可以根据问题自动建立模型,具有一定的普适性。但其训练成本高,所需数据量大,在临床决策支持中遇到的并不都是大样本,当小样本量时深度学习效果不佳,容易发生过拟合,反而简单的机器学习可以解决问题;深度学习的模型训练时间长,验证模型的正确性较为复杂,随着深度的增加也会使得非凸的目标函数产生局部最优解;深度学习目前能较好的处理图像、语音数据,但缺乏一定的反馈机制,需要加入强化学习、迁移学习才能更好的解决相应问题。
在生物医学领域,机器学习等人工智能方法虽然解决了很多问题,但疾病发生发展的过程十分复杂,目前还有许多问题需要我们进行不断探索和方法上的优化,比如模型的精确度有待提高、泛化能力还需要加强、模型的可解释性偏弱等,因此,利用人工智能方法实现临床决策支持任重道远。
(责任编委: 赵方庆)
参考文献 原文顺序
文献年度倒序
文中引用次数倒序
被引期刊影响因子
,
URLPMID:23722115 [本文引用: 1]
The ENCyclopedia Of DNA Elements (ENCODE) project is an international research consortium that aims to identify all functional elements in the human genome sequence. The second phase of the project comprised 1640 datasets from 147 different cell types, yielding a set of 30 publications across several journals. These data revealed that 80.4% of the human genome displays some functionality in at least one cell type. Many of these regulatory elements are physically associated with one another and further form a network or three-dimensional conformation to affect gene expression. These elements are also related to sequence variants associated with diseases or traits. All these findings provide us new insights into the organization and regulation of genes and genome, and serve as an expansive resource for understanding human health and disease.
,
URLPMID:20811451 [本文引用: 1]
Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains unexplained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called 'HapMap 3', includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of
URL [本文引用: 1]
随着高通量测序技术的不断发展与完善,对于不同层次和类型的生物组学数据的获取及分析方法也日趋成熟与完善。基于单组学数据的疾病研究已经发现了诸多新的疾病相关因子,而整合多组学数据研究疾病靶点的工作方兴未艾。生命体是一个复杂的调控系统,疾病的发生与发展涉及基因变异、表观遗传改变、基因表达异常以及信号通路紊乱等诸多层次的复杂调控机制,利用单一组学数据分析致病因子的局限性愈发显著。通过对多种层次和来源的高通量组学数据的整合分析,系统地研究临床发病机理、确定最佳疾病靶点已经成为精准医学研究的重要发展方向,将为疾病研究提供新的思路,并对疾病的早期诊断、个体化治疗和指导用药等提供新的理论依据。本文详细介绍了基因组、转录组和表观组等系统组学研究在疾病靶点筛选方面出现的新技术手段和研究进展,并对它们之间的整合分析新策略和优势进行了讨论。
,
URL [本文引用: 1]
随着高通量测序技术的不断发展与完善,对于不同层次和类型的生物组学数据的获取及分析方法也日趋成熟与完善。基于单组学数据的疾病研究已经发现了诸多新的疾病相关因子,而整合多组学数据研究疾病靶点的工作方兴未艾。生命体是一个复杂的调控系统,疾病的发生与发展涉及基因变异、表观遗传改变、基因表达异常以及信号通路紊乱等诸多层次的复杂调控机制,利用单一组学数据分析致病因子的局限性愈发显著。通过对多种层次和来源的高通量组学数据的整合分析,系统地研究临床发病机理、确定最佳疾病靶点已经成为精准医学研究的重要发展方向,将为疾病研究提供新的思路,并对疾病的早期诊断、个体化治疗和指导用药等提供新的理论依据。本文详细介绍了基因组、转录组和表观组等系统组学研究在疾病靶点筛选方面出现的新技术手段和研究进展,并对它们之间的整合分析新策略和优势进行了讨论。
,
URL [本文引用: 1]
,
URL [本文引用: 1]
date-added={2012-09-25 09:54:31 +0200}, date-modified={2012-09-25 09:54:31 +0200}, project={fremdliteratur},
,
[本文引用: 1]
,
URL [本文引用: 1]
Summary: Feature subset selection is important not only for the insight gained from determining relevant modeling variables but also for the improved understandability, scalability, and possibly, accuracy of the resulting models. Feature selection has traditionally been studied in supervised learning situations, with some estimate of accuracy used to evaluate candidate subsets. However, we often cannot apply supervised learning for lack of a training signal. For these cases, we propose a new feature selection approach based on clustering. A number of heuristic criteria can be used to estimate the quality of clusters built from a given feature subset. Rather than combining such criteria, we use ELSA, an evolutionary local selection algorithm that maintains a diverse population of solutions that approximate the Pareto front in a multi-dimensional objective space. Each evolved solution represents a feature subset and a number of clusters; two representative clustering algorithms, K-means and EM, are applied to form the given number of clusters based on the selected features. Experimental results on both real and synthetic data show that the method can consistently find approximate Pareto-optimal solutions through which we can identify the significant features and an appropriate number of clusters. This results in models with better and clearer semantic relevance.
,
URL [本文引用: 3]
[本文引用: 1]
,
[本文引用: 1]
[本文引用: 1]
,
[本文引用: 1]
,
URL [本文引用: 1]
,
URL [本文引用: 1]
Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
URL [本文引用: 1]
Feature selection is an important step in data mining and is used in various domains including genetics, medicine, and bioinformatics. Choosing the important features (genes) is essential for the discovery of new knowledge hidden within the genetic code as well as the identification of important biomarkers. Although feature selection methods can help sort through large numbers of genes based on their relevance to the problem at hand, the results generated tend to be unstable and thus cannot be reproduced in other experiments. Relatedly, research interest in the stability of feature ranking methods has grown recently and researchers have produced experimental designs for testing the stability of feature selection, creating new metrics for measuring stability and new techniques designed to improve the stability of the feature selection process. In this paper, we will introduce the role of stability in feature selection with DNA microarray data. We list various ways of improving feature ranking stability, and discuss feature selection techniques, specifically explaining ensemble feature ranking and presenting various ensemble feature ranking aggregation methods. Finally, we discuss experimental procedures such as dataset perturbation, fixed overlap partitioning, and cross validation procedures that help researchers analyze and measure the stability of feature ranking methods. Throughout this work, we investigate current research in the field and discuss possible avenues of continuing such research efforts.
,
URLPMID:17720704 [本文引用: 1]
Abstract Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques. In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.
,
URLPMID:27141884 [本文引用: 1]
61This study provides an integrated analysis of the TCGA cutaneous melanoma data and insights into this important disease.61It describes how to conduct effective integrated analysis using advanced statistical techniques and software packages.61Data analysis suggests that integrating multiple types of measurements leads to improved prognostic models.
,
URLPMID:18562478 [本文引用: 1]
In bioinformatics studies, supervised classification with high-dimensional input variables is frequently encountered. Examples routinely arise in genomic, epigenetic and proteomic studies. Feature selection can be employed along with classifier construction to avoid over-fitting, to generate more reliable classifier and to provide more insights into the underlying causal relationships. In this article, we provide a review of several recently developed penalized feature selection and classification techniques0961which belong to the family of embedded feature selection methods0961for bioinformatics studies with high-dimensional input. Classification objective functions, penalty functions and computational algorithms are discussed. Our goal is to make interested researchers aware of these feature selection and classification methods that are applicable to high-dimensional bioinformatics data.
,
[本文引用: 1]
, 20:
URL [本文引用: 1]
In this paper, we show that classical survival analysis involving censored data can naturally be cast as a ranking problem. The concordance index (CI), which quantifies the quality of rankings, is the standard performance measure for model assessment in survival analysis. In contrast, the standard approach to learning the popular proportional hazard (PH) model is based on Cox’s partial likelihood. We devise two bounds on CI–one of which emerges directly from the properties of PH models–and optimize them directly. Our experimental results suggest that all three methods perform about equally well, with our new approach giving slightly better results. We also explain why a method designed to maximize the Cox’s partial likelihood also ends up (approximately) maximizing the CI.
,
,
URL
,
URL [本文引用: 3]
,
,
URL [本文引用: 2]
,
URL [本文引用: 2]
,
URLPMID:28546526 [本文引用: 1]
Abstract Purpose: We aimed to develop molecular classifier that can predict lymphatic invasion and their clinical significance in epithelial ovarian cancer (EOC) patients. Materials and methods: We analyzed gene expression (mRNA, methylated DNA) in data from The Cancer Genome Atlas. To identify molecular signatures for lymphatic invasion, we found differentially expressed genes. The performance of classifier was validated by receiver operating characteristics analysis, logistic regression, linear discriminant analysis (LDA), and support vector machine (SVM). We assessed prognostic role of classifier using random survival forest (RSF) model and pathway deregulation score (PDS). For external validation, we analyzed microarray data from 26 EOC samples of Samsung Medical Center and curatedOvarianData database. Results: We identified 21 mRNAs, and 7 methylated DNAs from primary EOC tissues that predicted lymphatic invasion, and created prognostic models. The classifier predicted lymphatic invasion well, which was validated by logistic regression, LDA, and SVM algorithm (C-index of 0.90, 0.71, and 0.74 for mRNA and C-index of 0.64, 0.68, and 0.69 for DNA methylation). Using RSF model, incorporating molecular data with clinical variables improved prediction of progression-free survival compared with using only clinical variables (P < 0.001 and P = 0.008). Similarly, PDS enabled us to classify patients into high-risk and low-risk group, which resulted in survival difference in mRNA profiles (Log rank P value = 0.011). In external validation, gene signature was well correlated with prediction of lymphatic invasion and patients' survival. Conclusions: Molecular signature model predicting lymphatic invasion was well performed, and also associated with survival of EOC patients.
,
URL
,
URL [本文引用: 3]
,
URLPMID:26357330 [本文引用: 1]
Performing clustering analysis is one of the important research topics in cancer discovery using gene expression profiles, which is crucial in facilitating the successful diagnosis and treatment of cancer. While there are quite a number of research works which perform tumor clustering, few of them considers how to incorporate fuzzy theory together with an optimization process into a consensus clustering framework to improve the performance of clustering analysis. In this paper, we first propose a random double clustering based cluster ensemble framework (RDCCE) to perform tumor clustering based on gene expression data. Specifically, RDCCE generates a set of representative features using a randomly selected clustering algorithm in the ensemble, and then assigns samples to their corresponding clusters based on the grouping results. In addition, we also introduce the random double clustering based fuzzy cluster ensemble framework (RDCFCE), which is designed to improve the performance of RDCCE by integrating the newly proposed fuzzy extension model into the ensemble framework. RDCFCE adopts the normalized cut algorithm as the consensus function to summarize the fuzzy matrices generated by the fuzzy extension models, partition the consensus matrix, and obtain the final result. Finally, adaptive RDCFCE (A-RDCFCE) is proposed to optimize RDCFCE and improve the performance of RDCFCE further by adopting a self-evolutionary process (SEPP) for the parameter set. Experiments on real cancer gene expression profiles indicate that RDCFCE and A-RDCFCE works well on these data sets, and outperform most of the state-of-the-art tumor clustering algorithms.
,
URLPMID:20150676 [本文引用: 1]
Finding subtypes of heterogeneous diseases is the biggest challenge in the area of biology. Often, clustering is used to provide a hypothesis for the subtypes of a heterogeneous disease. However, there are usually discrepancies between the clusterings produced by different algorithms. This work introduces a simple method which provides the most consistent clusters across three different clustering algorithms for a melanoma and a breast cancer data set. The method is validated by showing that the Silhouette, Dunne's and Davies-Bouldin's cluster validation indices are better for the proposed algorithm than those obtained by k-means and another consensus clustering algorithm. The hypotheses of the consensus clusters on both the data sets are corroborated by clear genetic markers and 100 percent classification accuracy. In Bittner et al.'s melanoma data set, a previously hypothesized primary cluster is recognized as the largest consensus cluster and a new partition of this cluster into two subclusters is proposed. In van't Veer et al.'s breast cancer data set, previously proposed "basal” and "luminal A” subtypes are clearly recognized as the two predominant clusters. Furthermore, a new hypothesis is provided about the existence of two subgroups within the "basal” subtype in this data set. The clusters of van't Veer's data set is also validated by high classification accuracy obtained in the data set of van de Vijver et al.
,
[本文引用: 1]
,
URL [本文引用: 4]
,
URLPMID:5635030 [本文引用: 2]
The gut microbiota has been linked to cardiovascular diseases. However, the composition and functional capacity of the gut microbiome in relation to cardiovascular diseases have not been systematically examined. Here, we perform a metagenome-wide association study on stools from 218 individuals with atherosclerotic cardiovascular disease (ACVD) and 187 healthy controls. The ACVD gut microbiome deviates from the healthy status by increased abundance ofEnterobacteriaceaeandStreptococcusspp. and, functionally, in the potential for metabolism or transport of several molecules important for cardiovascular health. Although drug treatment represents a confounding factor, ACVD status, and not current drug use, is the major distinguishing feature in this cohort. We identify common themes by comparison with gut microbiome data associated with other cardiometabolic diseases (obesity and type 2 diabetes), with liver cirrhosis, and rheumatoid arthritis. Our data represent a comprehensive resource for further investigations on the role of the gut microbiome in promoting or preventing ACVD as well as other related diseases. The gut microbiota may play a role in cardiovascular diseases. Here, the authors perform a metagenome-wide association study on stools from individuals with atherosclerotic cardiovascular disease and healthy controls, identifying microbial strains and functions associated with the disease.
,
[本文引用: 3]
,
URLPMID:27497072 [本文引用: 2]
Recent advances in machine learning yielded new techniques to train deep neural networks, which resulted in highly successful applications in many pattern recognition tasks such as object detection and speech recognition. In this paper we provide a head-to-head comparison between a state-of-the art in mammography CAD system, relying on a manually designed feature set and a Convolutional Neural Network (CNN), aiming for a system that can ultimately read mammograms independently. Both systems are trained on a large data set of around 45,000 images and results show the CNN outperforms the traditional CAD system at low sensitivity and performs comparable at high sensitivity. We subsequently investigate to what extent features such as location and patient information and commonly used manual features can still complement the network and see improvements at high specificity over the CNN especially with location and context features, which contain information not available to the CNN. Additionally, a reader study was performed, where the network was compared to certified screening radiologists on a patch level and we found no significant difference between the network and the readers.
URL [本文引用: 2]
,
[本文引用: 1]
,
URLPMID:25079552 [本文引用: 1]
Adenocarcinoma of the lung is the leading cause of cancer death worldwide. Here we report molecular profiling of 230 resected lung adenocarcinomas using messenger RNA, microRNA and DNA sequencing integrated with copy number, methylation and proteomic analyses. High rates of somatic mutation were seen (mean 8.9 mutations per megabase). Eighteen genes were statistically significantly mutated, including RIT1 activating mutations and newly described loss-of-function MGA mutations which are mutually exclusive with focal MYC amplification. EGFR mutations were more frequent in female patients, whereas mutations in RBM10 were more common in males. Aberrations in NF1, MET, ERBB2 and RIT1 occurred in 13% of cases and were enriched in samples otherwise lacking an activated oncogene, suggesting a driver role for these events in certain tumours. DNA and mRNA sequence from the same tumour highlighted splicing alterations driven by somatic genomic changes, including exon 14 skipping in MET mRNA in 4% of cases. MAPK and PI(3)K pathway activity, when measured at the protein level, was explained by known mutations in only a fraction of cases, suggesting additional, unexplained mechanisms of pathway activation. These data establish a foundation for classification and further investigations of lung adenocarcinoma molecular pathogenesis.
,
URLPMID:17989087 [本文引用: 1]
The Stanford Tissue Microarray Database (TMAD;http://tma.stanford.edu) is a public resource for disseminating annotated tissue images and associated expression data. Stanford University pathologists, researchers and their collaborators worldwide use TMAD for designing, viewing, scoring and analyzing their tissue microarrays. The use of tissue microarrays allows hundreds of human tissue cores to be simultaneously probed by antibodies to detect protein abundance (Immunohistochemistry; IHC), or by labeled nucleic acids (in situhybridization; ISH) to detect transcript abundance. TMAD archives multi-wavelength fluorescence and bright-field images of tissue microarrays for scoring and analysis. As of July 2007, TMAD contained 205 161 images archiving 349 distinct probes on 1488 tissue microarray slides. Of these, 31 306 images for 68 probes on 125 slides have been released to the public. To date, 12 publications have been based on these raw public data. TMAD incorporates the NCI Thesaurus ontology for searching tissues in the cancer domain. Image processing researchers can extract images and scores for training and testing classification algorithms. The production server uses the Apache HTTP Server, Oracle Database and Perl application code. Source code is available to interested researchers under a no-cost license.
,
URL [本文引用: 2]
,
URL [本文引用: 1]
,
URL [本文引用: 1]
,
URLPMID:25993296 [本文引用: 1]
Peripheral artery disease (PAD) is due to the blockage of the arteries supplying blood to the lower limbs usually secondary to atherosclerosis. The most severe clinical manifestation of PAD is critical limb ischemia (CLI), which is associated with a risk of limb loss and mortality due to cardiovascular events. Currently CLI is mainly treated by surgical or endovascular revascularization, with few other treatments in routine clinical practice. There are a number of problems with current PAD management strategies, such as the difficulty in selecting the appropriate treatments for individual patients. Many patients undergo repeated attempts at revascularization surgery, but ultimately require an amputation. There is great interest in developing new methods to identify patients who are unlikely to benefit from revascularization and to improve management of patients unsuitable for surgery. Circulating biomarkers that predict the progression of PAD and the response to therapies could assist in the management of patients. This review provides an overview of the pathophysiology of PAD and examines the association between circulating biomarkers and PAD presence, severity and prognosis. While some currently identified circulating markers show promise, further larger studies focused on the clinical value of the biomarkers over existing risk predictors are needed.
,
[本文引用: 1]
,
URL [本文引用: 1]
,
URLPMID:26886975 [本文引用: 1]
Cerebral microbleeds (CMBs) are small haemorrhages nearby blood vessels. They have been recognized as important diagnostic biomarkers for many cerebrovascular diseases and cognitive dysfunctions. In current clinical routine, CMBs are manually labelled by radiologists but this procedure is laborious, time-consuming, and error prone. In this paper, we propose a novel automatic method to detect CMBs from magnetic resonance (MR) images by exploiting the 3D convolutional neural network (CNN). Compared with previous methods that employed either low-level hand-crafted descriptors or 2D CNNs, our method can take full advantage of spatial contextual information in MR volumes to extract more representative high-level features for CMBs, and hence achieve a much better detection accuracy. To further improve the detection performance while reducing the computational cost, we propose a cascaded framework under 3D CNNs for the task of CMB detection. We first exploit a 3D fully convolutional network (FCN) strategy to retrieve the candidates with high probabilities of being CMBs, and then apply a well-trained 3D CNN discrimination model to distinguish CMBs from hard mimics. Compared with traditional sliding window strategy, the proposed 3D FCN strategy can remove massive redundant computations and dramatically speed up the detection process. We constructed a large dataset with 320 volumetric MR scans and performed extensive experiments to validate the proposed method, which achieved a high sensitivity of 93.16% with an average number of 2.74 false positives per subject, outperforming previous methods using low-level descriptors or 2D CNNs by a significant margin. The proposed method, in principle, can be adapted to other biomarker detection tasks from volumetric medical data.
,
URL [本文引用: 2]
,
URL [本文引用: 1]
[本文引用: 1]
,
URL [本文引用: 1]
,
URL [本文引用: 1]
,
URL [本文引用: 1]
,
URL [本文引用: 1]
,
URLPMID:5380334 [本文引用: 2]
Abstract BACKGROUND: Current approaches to predict cardiovascular risk fail to identify many people who would benefit from preventive treatment, while others receive unnecessary intervention. Machine-learning offers opportunity to improve accuracy by exploiting complex interactions between risk factors. We assessed whether machine-learning can improve cardiovascular risk prediction. METHODS: Prospective cohort study using routine clinical data of 378,256 patients from UK family practices, free from cardiovascular disease at outset. Four machine-learning algorithms (random forest, logistic regression, gradient boosting machines, neural networks) were compared to an established algorithm (American College of Cardiology guidelines) to predict first cardiovascular event over 10-years. Predictive accuracy was assessed by area under the 'receiver operating curve' (AUC); and sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) to predict 7.5% cardiovascular risk (threshold for initiating statins). FINDINGS: 24,970 incident cardiovascular events (6.6%) occurred. Compared to the established risk prediction algorithm (AUC 0.728, 95% CI 0.723-0.735), machine-learning algorithms improved prediction: random forest +1.7% (AUC 0.745, 95% CI 0.739-0.750), logistic regression +3.2% (AUC 0.760, 95% CI 0.755-0.766), gradient boosting +3.3% (AUC 0.761, 95% CI 0.755-0.766), neural networks +3.6% (AUC 0.764, 95% CI 0.759-0.769). The highest achieving (neural networks) algorithm predicted 4,998/7,404 cases (sensitivity 67.5%, PPV 18.4%) and 53,458/75,585 non-cases (specificity 70.7%, NPV 95.7%), correctly predicting 355 (+7.6%) more patients who developed cardiovascular disease compared to the established algorithm. CONCLUSIONS: Machine-learning significantly improves accuracy of cardiovascular risk prediction, increasing the number of patients identified who could benefit from preventive treatment, while avoiding unnecessary treatment of others.