何婷婷1
1.华中师范大学国家数字化学习工程技术研究中心 武汉 430079
2.河南大学计算机与信息工程学院 开封 475001
基金项目:河南省科技攻关计划(162102210168)
详细信息
作者简介:李旻:男,1976年生,副教授,主要研究方向为数据挖掘、自然语言处理、教育信息技术等
何婷婷:女,1964年生,教授,主要研究方向为网络媒体监测、自然语言处理、教育信息技术等
通讯作者:李旻 limin_ha139@139.com
中图分类号:TP391; TP181计量
文章访问数:414
HTML全文浏览量:179
PDF下载量:30
被引次数:0
出版历程
收稿日期:2020-01-13
修回日期:2020-07-28
网络出版日期:2020-08-21
刊出日期:2021-04-20
An Efficient and Robust Algorithm to Generate Initial Center of Bisecting K-means for High-dimensional Big Data Based on Random Integer Triangular Matrix Mappings
Min LI1, 2,,,Tingting HE1
1. National Engineering Research Center for E-Learning (Central China Normal University), Wuhan 430079, China
2. Computer and Information Engineering College, Henan University, Kaifeng 475001, China
Funds:The Science and Technology Research Plan in Henan Province (162102210168)
摘要
摘要:Bisecting K-means算法通过使用一组初始中心对分割簇,得到多个二分聚类结果,然后从中选优以减轻局部最优收敛问题对算法性能的不良影响。然而,现有的随机采样初始中心对生成方法存在效率低、稳定性差、缺失值等不同问题,难以胜任大数据聚类场景。针对这些问题,该文首先创建出了初始中心对组合三角阵和初始中心对编号三角阵,然后通过建立两矩阵中元素及元素位置间的若干映射,从而实现了一种从随机整数集合中生成二分聚类初始中心对的线性复杂度算法。理论分析与实验结果均表明,该方法的时间效率及效率稳定性均明显优于常用的随机采样方法,特别适用于高维大数据聚类场景。
关键词:Bisecting K-means/
初始中心生成/
三角矩阵映射/
随机整数/
高维大数据聚类/
线性算法
Abstract:The algorithm of Bisecting K-means obtains multiple clustering results by using a set of initial center pairs to segment a cluster, and then selects the best from them to mitigate the adverse effect of the local optimal convergence on the performance of the algorithm. However, the current methods of random sampling to generate initial center pairs for Bisecting K-means have some problems, such as low efficiency, poor stability, missing values and so on, which are not competent for big data clustering. In order to solve these problems, firstly the lower triangular matrix composed by the pairs of initial centers and the lower triangular matrix composed by serial numbers of the pairs of initial centers are created. Then, by establishing several mappings between the elements and their positions in the two matrices, a linear complexity algorithm is proposed to generate initial center pairs from the set of random integers. Both theoretical analysis and experimental results show that the time efficiency and efficiency stability of this method are significantly better than the current methods of random sampling, so it is particularly suitable for these scenarios of high-dimensional big data clustering.
Key words:Bisecting K-means/
Initial center generation/
Triangular matrix mapping/
Random integer/
High-dimensional big data clustering/
Linear algorithm
PDF全文下载地址:
https://jeit.ac.cn/article/exportPdf?id=b3c130cb-3cb4-4be7-8a13-dbb0d20e1856