摘要/Abstract
摘要: 目的 ·根据粪便样本宏基因组学数据建立肠道菌群标签,探索用于筛查与诊断大肠癌的非侵入性方法。方法 ·共纳入 285例样本,根据随机森林分类算法筛选出与大肠癌发生密切相关的特征细菌;利用 6种机器学习分类模型建立大肠癌的诊断模型,并进行内部和外部验证。结果 ·首先筛选出了 9种与大肠癌发生密切相关的特征细菌,利用这 9种细菌建立了 6种诊断模型。其中随机森林模型准确率最高(达 0.847 7),其在内部验证集和外部验证集中的准确率分别为 0.815 8和 0.734 4,在全集中受试者工作特征(receiver operating characteristic,ROC)曲线下面积( area under curve,AUC)为 0.894。结论 ·根据粪便样本的宏基因组学数据,利用随机森林算法建立了由 9种细菌组成的诊断大肠癌的菌群标签,能够有效对健康者与大肠癌患者进行区分。
关键词: 大肠癌, 诊断, 肠道菌群, 机器学习, 随机森林
Abstract:
Objective · To construct bacterial signaturesanalyzing fecal metagenomics for the screening and diagnosis of colorectal cancer (CRC). Methods · A total of 285 samples were included in the study. Diagnostic models for CRC according to six different machine learning algorithms were developed using the featured bacteria selectedrandom forest algorithm, and validated in validation sets. Results · Nine bacteria that differentiated CRC and the control were identified, with which 6 models were established. The best model was random forest model, with an accuracy of 0.847 7 in the training set. Its accuracy in two test sets was 0.815 8 and 0.734 4, respectively. The area under curve (AUC) of receiver operating characteristic of the random forest model in the set including all samples was 0.894. Conclusion · Bacterial signatures based on random forest algorithm for the diagnosis of CRC can differentiate patients with CRC and the control effectively, which suggests the potential clinical value of the bacterial signatures.
Key words: colorectal cancer, diagnosis, intestinal bacteria, machine learning, random forest
PDF全文下载地址:
点我下载PDF