作者:李成严,李鑫宇,张磊,王广泽
Authors:LI Cheng yan,LI Xin yu,ZHANG Lei,WANG Guang ze摘要:关联规则挖掘主要用于发现隐藏在数据中的知识 。加权关联规则挖掘能更有效地挖掘出项目重要性不同的规则 。针对人工赋权的方法存在一定的主观随意性 ,没有充分利用数据本身特征且串行算法无法处理大数据集的问题 。提出了独立概率完全加权关联规则的并行挖掘算法,该算法以项在当前数据集中出现概率为依据进行完全加权模型构建 ,以挖掘出更多用户所期待的关联规则 。采用前缀划分、位图存储等技术分别解决加权频繁项集筛选、候选加权频繁项集生成所造成时间代价高的问题 。引入分布式并行计算思想 ,并在 Spark 框架下编程实现 ,使算法可以在大数据环境下对加权关联规则进行高效挖掘 。利用数值实例对该模型和算法进行了验证 ,结果表明此算法可在保证算法时间效率优越的同时获得更多隐藏信息。
Abstract:Association rule mining is mainly used to discover knowledge hidden in the data. Weighted association rule mining can mine rules with different importance of the project more effectively. There is a certain subjective arbitrariness in the method of artificial weight assignment, existing the problem of not fully utilizing the characteristics of the data itself and the serial algorithm cannot deal with large data sets. A parallel mining algorithm for independent probability fully weighted association rules is proposed. The algorithm constructs a fully weighted model based on the probability of the item appearing in the current data set. In order to mine the association rules expected by more users. Technologies like prefix partition and bitmap storage are used to solve the problem of high time cost caused by weighted frequent item-sets filtering and candidate weighted frequent item-sets generation respectively. The idea of distributed parallel computing is introduced and implemented in Spark framework, which enables the algorithm to efficiently mine weighted association rules in big data environment. Numerical examples are used to verify the proposed model and algorithm. The results show that the proposed algorithm can obtain more effective hidden information and has higher time efficiency.
PDF全文下载地址:
可免费Download/下载PDF全文
删除或更新信息,请邮件至freekaoyan#163.com(#换成@)