作者:林倩闽
Authors:LIN Qianmin摘要:针对基因芯片技术带来的海量基因表达数据 ,为了充分挖掘其蕴含的生物信息和潜在的生物机制 ,提 出 一种基于 CCA - 层次聚类的基因聚类算法( CCA-Hc) 。该算法在层次聚类的基础上引入典型相关分析 ,优化相 似性矩阵计算方法 。首先 ,利用典型相关分析方法结合基因的多个特征信息进行基因相关性度量 ,得到基因相似 性矩阵 。然后将该相似性矩阵作为层次聚类的邻近矩阵进行凝聚层次聚类 。在 Oryza sativa L. ( 水稻) 的基因表达 数据集上进行 CCA-Hc 聚类效果测试实验 ,结果表明 , 与采用欧式距离的传统层次聚类算法( EUC-Hc) 相比 ,CCA- Hc 的内部稳定性指标和生物功能性指标均优于 EUC-Hc ,具有更佳的鲁棒性和聚类准确性 , 更有利于去发现基因 间的共表达关系。
Abstract:Aiming at the massive gene expression data brought by gene chip technology , in order to fully mine the biological information and potential biological mechanisms contained in it , this paper proposes a gene clustering algorithm based on CCA- hierarchical clustering ( CCA-Hc) . The algorithm introduces canonical correlation analysis on the basis of hierarchical clustering , and optimizes the calculation method of similarity matrix. First , the canonical correlation analysis method is used to measure the gene correlation by combining the multiple feature information of the gene , and the gene similarity matrix is obtained. Then the similarity matrix is used as the neighbor matrix of hierarchical clustering for agglomerative hierarchical clustering. The CCA-Hc clustering effect test experiment was performed on the gene expression dataset of Oryza sativa L. ( rice ) . The results show that , compared with the traditional hierarchical clustering algorithm using Euclidean distance ( EUC-Hc ) , CCA-Hc is superior to EUC-Hc in both internal stability index and biological functional index , and has better robustness and clustering accuracy. It is more conducive to discovering the co-expression relationship between genes.
PDF全文下载地址:
可免费Download/下载PDF全文
删除或更新信息,请邮件至freekaoyan#163.com(#换成@)