尹君1,
胡召玲2,
张学珍1,3,,
1. 中国科学院地理科学与资源研究所, 中国科学院陆地表层格局与模拟重点实验室, 北京 100101
2. 江苏师范大学, 江苏 徐州 221116
3. 中国科学院大学, 北京 100049
基金项目: 国家重点研发计划项目(批准号:2017YFA0603301)和中国科学院(A类)战略性先导科技专项项目(批准号:XDA19040101)共同资助
详细信息
作者简介: 华萌萌, 女, 24岁, 硕士研究生, 主要从事数据挖掘研究, E-mail: meng970127@163.com
通讯作者: 张学珍, E-mail: xzzhang@igsnrr.ac.cn
中图分类号: P467;P468收稿日期:2020-11-02
修回日期:2021-01-16
刊出日期:2021-03-30
Preliminary study on machine learning-based intelligent recognition of historical climate reconstruction papers and data mining
HUA Mengmeng1,2,,YIN Jun1,
HU Zhaoling2,
ZHANG Xuezhen1,3,,
1. Key Laboratory of Land Surface Pattern and Simulation, Institute of Geographical Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101
2. Jiangsu Normal University, Xuzhou 221116, Jiangsu
3. University of Chinese Academy of Sciences, Beijing 100049
More Information
Corresponding author: ZHANG Xuezhen,E-mail:xzzhang@igsnrr.ac.cn
MSC: P467;P468--> Received Date: 02 November 2020
Revised Date: 16 January 2021
Publish Date: 30 March 2021
摘要
摘要:本文基于机器学习方法开展了从海量的气候变化研究论文中智能识别历史气候重建论文,并提取关键信息的技术研究。首先以人工标注的1450篇古气候重建论文摘要作为样本数据,对机器学习中常见的9种分类模型进行训练和精度检验,发现极端随机树模型在此类文本中具有较高的分类精度;其次,利用这一模型对ResearchGate中70万余篇气候变化相关的论文摘要进行智能分类,从中筛选出6039篇千年尺度气候重建论文摘要,并根据词云图验证了分类结果的可靠性。在此基础上,采用命名实体识别技术对6039篇论文摘要,从重建气候要素、代用资料类型和目标地区(国家)这3个维度开展了文本数据挖掘。挖掘结果表明:温度和降水是两大主要的重建要素,树轮、历史文献、沉积(含孢粉)是位居前三位的主要代用资料,这与领域专家经验基本一致;同时,重建气候要素与代用资料类型及二者的组合规律呈现鲜明的地理差异,这与区域气候特征密切相关。
关键词: 历史气候/
气候重建/
文本分类/
数据挖掘/
机器学习
Abstract:It is a hot topic to carry out integrated reconstructions of historical climate changes using numerous existing single proxy-based reconstructions. To achieve the integrated reconstruction, there is a great demand to collect target papers of existing reconstructions. Taking this background, this study explored a machine learning-based technology of intelligently recognition of historical climate reconstruction papers and carried out key information mining from these papers. Firstly, we prepared a set of 1450 abstracts of published paleoclimate reconstruction papers and tagged one by one artificially with millennium-scale reconstruction and with other reconstruction. We used this set of abstracts as sample dataset to train and test nine machine learning-based classification models. We found that classification accuracy of Extra Trees model was higher than the other models. Then, we used the Extra Trees model on a set of more than 70×104 abstracts of climate change research papers from the ResearchGate website. As a result, 6039 abstracts for the millennium-scale climate reconstruction were selected intelligently. The reliability of the 6039 abstracts were also confirmed by comparing its word cloud to that of sample dataset. Finally, using the technology of Named-entity recognition on the 6039 abstracts, three dimensions of information, including reconstructed climate elements, proxy data categories and target regions(countries), were mined intelligently. The frequencies of key words show that on the dimension of climate elements temperature and precipitation are the two most frequently climate elements for reconstruction. On the dimension of proxy data, tree ring, historical documents and sediments(including pollen) are the three most frequently proxy data. These results keep consistent with the experts' experience of this field. The results also show that frequencies of reconstructed climate elements, proxy data categories and their combination exhibit distinct geographical differences, which may be relevant to regional climatic characteristic.
Key words:historical climate/
climate reconstruction/
text classification/
data mining/
machine learning
PDF全文下载地址:
http://www.dsjyj.com.cn/data/article/export-pdf?id=60768b47c23e6710c26d5b5e