基于自然语言处理和机器学习的疑似土壤污染企业识别

删除或更新信息，请邮件至freekaoyan#163.com(#换成@)

本站小编 Free考研考试/2021-12-31

作者简介: 黄国鑫(1980—)，男，博士，副研究员。研究方向：土壤和地下水污染防治。E-mail：huanggx@caep.org.cn.

通讯作者: 王夏晖,wangxh@caep.org.cn ;

中图分类号: X322

Natural language processing and machine learning-based suspected soil contamination enterprise identification

HUANG Guoxin^1,,
ZHU Shouxin^1,2,
WANG Xiahui^1,,,
TIAN Zi¹,
JI Guohua¹,
LU Ran¹,
CUI Xuan¹,
Chen Xi¹
1.Chinese Academy for Environmental Planning, Beijing 100012, China
2.School of Water Resources and Environment, China University of Geosciences (Beijing), Beijing 100083, China

Corresponding author: WANG Xiahui,wangxh@caep.org.cn ;

CLC number: X322

-->

摘要
HTML全文
图(6)表(4)
参考文献(21)
相关文章
施引文献
资源附件(0)
访问统计

摘要:针对污染场地识别的精准性不高、科学性不足、全面性不够和数据共享难度大等问题，以南方某地级市为研究区，借助大数据平台，基于自然语言处理和机器学习，通过引入摘要中热词权重构建改进型朴素贝叶斯模型，并对兴趣点(POI)数据进行中类行业预测和污染企业识别。结果表明，与随机森林算法和XGBoost算法相比，朴素贝叶斯算法的性能最佳；企业名称+经营范围构建有语义词汇库后，朴素贝叶斯算法的准确率、召回率和综合评价指标(F₁)值得到大幅提升，分别提高了0.23、0.23和0.23；采用权重1.27和平滑参数α为1.10后，建立了改进型朴素贝叶斯模型，实现了行业类别预测，相应的准确率、召回率和F₁值分别为0.63、0.62和0.63；识别出研究区中26个疑似土壤污染行业有关1774家企业。改进型朴素贝叶斯模型能够有效地预测疑似土壤污染企业，具有较好的准确率与召回率，能够为场地污染识别与风险管控实践提供理论依据和设计参数。
关键词: 土壤污染/
自然语言处理/
机器学习/
中类行业/
污染企业识别/
改进型朴素贝叶斯模型

Abstract:Aiming at the problems of low accuracy, inadequate scientific basis, bad wholeness and the difficulty in data sharing of soil contamination identification, a typical city in South China was selected as the research area. Based on the natural language processing and machine learning, an improved naive Bayesian model was constructed by the weights of hot words from an abstract and then utilized to predict the middle-class industries and identify the relevant contamination enterprises from point of interest (POI) data with a big data platform. The results showed that the performance of the naive Bayesian aggregation was better than that of random forest and XGBoost aggregations; the precision, recall and F₁ values of the naive Bayesian aggregation were improved by 0.23, 0.23 and 0.23 after the semantic vocabulary database was constructed by enterprise name and business scope; the naive Bayesian model that constructed under the weight of 1.27 and smoothing parameter α value of 1.10 could be used for the prediction of the middle-class industries with the precision, recall and F₁ value of 0.63, 0.62 and 0.63, respectively, and 1774 suspected soil contamination enterprises affiliated to 26 industry categories were identified in the research area. Therefore, the improved naive Bayesian model with the good precision and recall values can be effectively used to predict the suspected contamination enterprises, and provides the theoretical bases and design parameters for site contamination identification and risk management.
Key words:soil contamination/
natural language processing/
machine learning/
middle-class industries/
contamination enterprise identification/
improved naive Bayesian model.

加载中

图1大数据平台架构
Figure1.Big data platform framework

下载: 全尺寸图片幻灯片

图2基于改进型朴素贝叶斯算法的行业类别预测模型
Figure2.Improved naive Bayesian algorithm-based industry category prediction model

下载: 全尺寸图片幻灯片

图38个基于多源数据的土壤污染重点行业词云
Figure3.Eight word clouds based on the multi-source data-based soil contamination key middle-class industry

下载: 全尺寸图片幻灯片

图4不同权重引起的朴素贝叶斯算法性能比较
Figure4.Performance comparison of the naive Bayesian algorithm by different weights

下载: 全尺寸图片幻灯片

图5不同平滑参数α引起的朴素贝叶斯算法性能比较
Figure5.Performance comparison of the naive Bayesian algorithm by different α parameter values

下载: 全尺寸图片幻灯片

图6研究区中行业企业空间分布
Figure6.Spatial distribution of the industry enterprises in the study area

下载: 全尺寸图片幻灯片

表1自关联表
Table1.Self-correlation table

当前类别标识	类别名称	分类说明	上级类别标识
193	毛皮鞣制及制品加工	—	—
1 931	毛皮鞣制加工	指带毛动物生皮经鞣制等化学和物理方法处理后，保持其绒毛形态及特点的毛皮(又称裘皮)的生产活动	193
注：“毛皮鞣制加工”为小类名称；“毛皮鞣制及制品加工”为中类名称。

当前类别标识	类别名称	分类说明	上级类别标识
193	毛皮鞣制及制品加工	—	—
1 931	毛皮鞣制加工	指带毛动物生皮经鞣制等化学和物理方法处理后，保持其绒毛形态及特点的毛皮(又称裘皮)的生产活动	193
注：“毛皮鞣制加工”为小类名称；“毛皮鞣制及制品加工”为中类名称。

下载: 导出CSV
表2不同行业分类预测算法性能比较
Table2.Performance comparison of the different industry category prediction algorithms

算法类型	P	R	F₁
随机森林	0.28	0.28	0.28
XGBoost	0.31	0.29	0.30
朴素贝叶斯	0.35	0.36	0.35

算法类型	P	R	F₁
随机森林	0.28	0.28	0.28
XGBoost	0.31	0.29	0.30
朴素贝叶斯	0.35	0.36	0.35

下载: 导出CSV
表3不同有语义词汇库构建方法引起的朴素贝叶斯算法性能比较
Table3.Performance comparison of the naive Bayesian algorithm by different sematic database construction methods

有语义词汇库构建方法	P	R	F₁
企业名称	0.35	0.38	0.36
企业名称+经营范围	0.58	0.61	0.59

有语义词汇库构建方法	P	R	F₁
企业名称	0.35	0.38	0.36
企业名称+经营范围	0.58	0.61	0.59

下载: 导出CSV
表4改进型朴素贝叶斯模型的预测结果
Table4.Prediction results of the improved naive Bayesian algorithm

序号	中类行业名称	企业数量/家	序号	中类行业名称	企业数量/家
1	金属表面处理及热处理加工	207	14	其他仓储业	51
2	铁合金冶炼	196	15	炼铁	48
3	专用化学产品制造	167	16	电池制造	46
4	农药制造	118	17	皮革鞣制加工	47
5	常用有色金属冶炼	113	18	环境卫生管理	40
6	基础化学原料制造	102	19	贵金属冶炼	23
7	合成材料制造	100	20	炸药、火工及焰火产品制造	11
8	毛皮鞣制及制品加工	94	21	常用有色金属矿采选	10
9	涂料、油墨、颜料及类似产品制造	85	22	铁矿采选	9
10	环境治理业	82	23	棉纺织及印染精加工	5
11	纸浆制造	80	24	稀有稀土金属矿采选	1
12	炼钢	73	25	贵金属矿采选	1
13	稀有稀土金属冶炼	64	26	化学药品原料药制造	1

序号	中类行业名称	企业数量/家	序号	中类行业名称	企业数量/家
1	金属表面处理及热处理加工	207	14	其他仓储业	51
2	铁合金冶炼	196	15	炼铁	48
3	专用化学产品制造	167	16	电池制造	46
4	农药制造	118	17	皮革鞣制加工	47
5	常用有色金属冶炼	113	18	环境卫生管理	40
6	基础化学原料制造	102	19	贵金属冶炼	23
7	合成材料制造	100	20	炸药、火工及焰火产品制造	11
8	毛皮鞣制及制品加工	94	21	常用有色金属矿采选	10
9	涂料、油墨、颜料及类似产品制造	85	22	铁矿采选	9
10	环境治理业	82	23	棉纺织及印染精加工	5
11	纸浆制造	80	24	稀有稀土金属矿采选	1
12	炼钢	73	25	贵金属矿采选	1
13	稀有稀土金属冶炼	64	26	化学药品原料药制造	1

下载: 导出CSV

[1]	宋昕, 林娜, 殷鹏华. 中国污染场地修复现状及产业前景分析[J]. 土壤, 2015, 47(1): 1-7.
[2]	李梦瑶. 中国污染场地环境管理存在的问题及对策[J]. 中国农学通报, 2010, 26(24): 338-342.
[3]	王夏晖. 大数据: 场地污染智能识别与风险精准管控驱动力[J]. 环境保护, 2019, 47(3): 14-16.
[4]	FAZIO M, CELESTI A, PULIAFITO A, et al. Big data storage in the cloud for smart environment monitoring[J]. Procedia Computer Science, 2015, 52: 500-506. doi: 10.1016/j.procs.2015.05.023
[5]	李赛. 大数据环境下突发事件应急决策支持系统研究[D]. 武汉: 华中师范大学, 2016.
[6]	周煜申, 康望星, 沈存, 等. 大数据在水环境综合评价预警中的应用研究[J]. 江苏科技信息, 2017, 34(35): 52-54. doi: 10.3969/j.issn.1004-7530.2017.35.018
[7]	HENGL T, DE JESUS J M, HEUVELINK G B M, et al. SoilGrids250m: Global gridded soil information based on machine learning[J]. Plos One, 2017, 12(2): 1-40.
[8]	马丽萍, 曹国良, 郝国朝. 基于大数据的大气污染防治方式优化探究-以西安市为例[J]. 环境与可持续发展, 2018, 43(2): 54-56. doi: 10.3969/j.issn.1673-288X.2018.02.014
[9]	铁晓波. 大数据平台下基于人工免疫系统的MBR膜污染研究[D]. 天津: 天津工业大学, 2017.
[10]	赵苗苗, 赵师成, 张丽云, 等. 大数据在生态环境领域的应用进展与展望[J]. 应用生态学报, 2017, 28(5): 1727-1734.
[11]	WANG D S, LIU J Z, ZHU A X, et al. Automatic extraction and structuration of soil-environment relationship information from soil survey reports[J]. Journal of Integrative Agriculture, 2019, 18(2): 328-339. doi: 10.1016/S2095-3119(18)62071-4
[12]	CHEN S, LIANG Z, WEBSTER R, et al. A high-resolution map of soil pH in China made by hybrid modelling of sparse soil data and environmental covariates and its implications for pollution[J]. Science of the Total Environment, 2019, 655: 273-283. doi: 10.1016/j.scitotenv.2018.11.230
[13]	JIA X, HU B, MARCHANT B P, et al. A methodological framework for identifying potential sources of soil heavy metal pollution based on machine learning: A case study in the Yangtze Delta, China[J]. Environmental Pollution, 2019, 250: 601-609. doi: 10.1016/j.envpol.2019.04.047
[14]	NASFI R, AMAYRI M, BOUGUILA N. A novel approach for modeling positive vectors with inverted Dirichlet-based hidden Markov models[J]. Knowledge-Based Systems, 2020, 192: 1-17.
[15]	ARPAIA P, CESARO U, CHADLI M, et al. Fault detection on fluid machinery using Hidden Markov Models[J]. Measurement, 2020, 151: 1-7.
[16]	黄春梅, 王松磊. 基于词袋模型和TF-IDF的短文本分类研究[J]. 软件工程, 2020, 23(3): 1-3.
[17]	王方伟, 杨少杰, 赵冬梅, 等. 基于改进TF-IDF的多态蠕虫特征自动提取算法[J]. 华中科技大学学报(自然科学版), 2020, 48(2): 79-84.
[18]	何敏, 武德安, 吴磊. 基于MapReduce的平均多项朴素贝叶斯文本分类[J]. 计算机应用研究, 2016, 33(1): 115-117. doi: 10.3969/j.issn.1001-3695.2016.01.027
[19]	赵博文, 王灵矫, 郭华. 基于泊松分布的加权朴素贝叶斯文本分类算法[J]. 计算机工程, 2020, 46(4): 91-96.
[20]	徐光美, 刘宏哲, 张敬尊, 等. 用平滑方法改进多关系朴素贝叶斯分类[J]. 计算机工程与应用, 2017, 53(5): 69-72. doi: 10.3778/j.issn.1002-8331.1507-0161
[21]	陈凯, 黄英来, 高文韬, 等. 一种基于属性加权补集的朴素贝叶斯文本分类算法[J]. 哈尔滨理工大学学报, 2018, 23(4): 69-74.

Turn off MathJax -->
WeChat

点击查看大图

图( 6)表( 4)

计量

文章访问数:590
HTML全文浏览数:590
PDF下载数:49
施引文献:0

出版历程

收稿日期:2020-07-11
录用日期:2020-10-26
网络出版日期:2020-11-11
-->刊出日期:2020-11-10

-->

基于自然语言处理和机器学习的疑似土壤污染企业识别

黄国鑫^1,,
朱守信^1,2,
王夏晖^1,,,
田梓¹,
季国华¹,
卢然¹,
崔轩¹,
陈茜¹

通讯作者: 王夏晖,wangxh@caep.org.cn ;

作者简介: 黄国鑫(1980—)，男，博士，副研究员。研究方向：土壤和地下水污染防治。E-mail：huanggx@caep.org.cn 1.生态环境部环境规划院，北京 100012
2.中国地质大学(北京)水资源与环境学院，北京 100083
收稿日期: 2020-07-11
录用日期: 2020-10-26
网络出版日期: 2020-11-11
关键词: 土壤污染/
自然语言处理/
机器学习/
中类行业/
污染企业识别/
改进型朴素贝叶斯模型
摘要:针对污染场地识别的精准性不高、科学性不足、全面性不够和数据共享难度大等问题，以南方某地级市为研究区，借助大数据平台，基于自然语言处理和机器学习，通过引入摘要中热词权重构建改进型朴素贝叶斯模型，并对兴趣点(POI)数据进行中类行业预测和污染企业识别。结果表明，与随机森林算法和XGBoost算法相比，朴素贝叶斯算法的性能最佳；企业名称+经营范围构建有语义词汇库后，朴素贝叶斯算法的准确率、召回率和综合评价指标(F₁)值得到大幅提升，分别提高了0.23、0.23和0.23；采用权重1.27和平滑参数α为1.10后，建立了改进型朴素贝叶斯模型，实现了行业类别预测，相应的准确率、召回率和F₁值分别为0.63、0.62和0.63；识别出研究区中26个疑似土壤污染行业有关1774家企业。改进型朴素贝叶斯模型能够有效地预测疑似土壤污染企业，具有较好的准确率与召回率，能够为场地污染识别与风险管控实践提供理论依据和设计参数。

English Abstract

Natural language processing and machine learning-based suspected soil contamination enterprise identification

HUANG Guoxin^1,,
ZHU Shouxin^1,2,
WANG Xiahui^1,,,
TIAN Zi¹,
JI Guohua¹,
LU Ran¹,
CUI Xuan¹,
Chen Xi¹

Corresponding author: WANG Xiahui,wangxh@caep.org.cn ;

1.Chinese Academy for Environmental Planning, Beijing 100012, China
2.School of Water Resources and Environment, China University of Geosciences (Beijing), Beijing 100083, China
Received Date: 2020-07-11
Accepted Date: 2020-10-26
Available Online: 2020-11-11
Keywords: soil contamination/
natural language processing/
machine learning/
middle-class industries/
contamination enterprise identification/
improved naive Bayesian model
Abstract:Aiming at the problems of low accuracy, inadequate scientific basis, bad wholeness and the difficulty in data sharing of soil contamination identification, a typical city in South China was selected as the research area. Based on the natural language processing and machine learning, an improved naive Bayesian model was constructed by the weights of hot words from an abstract and then utilized to predict the middle-class industries and identify the relevant contamination enterprises from point of interest (POI) data with a big data platform. The results showed that the performance of the naive Bayesian aggregation was better than that of random forest and XGBoost aggregations; the precision, recall and F₁ values of the naive Bayesian aggregation were improved by 0.23, 0.23 and 0.23 after the semantic vocabulary database was constructed by enterprise name and business scope; the naive Bayesian model that constructed under the weight of 1.27 and smoothing parameter α value of 1.10 could be used for the prediction of the middle-class industries with the precision, recall and F₁ value of 0.63, 0.62 and 0.63, respectively, and 1774 suspected soil contamination enterprises affiliated to 26 industry categories were identified in the research area. Therefore, the improved naive Bayesian model with the good precision and recall values can be effectively used to predict the suspected contamination enterprises, and provides the theoretical bases and design parameters for site contamination identification and risk management.

全文HTML

--> --> --> 近年来，场地土壤污染问题越来越受到公众和社会的关注^[1-2]。我国在汲取国外近40年治理经验的基础上，提出了“预防为主，保护优先，风险管控”的场地土壤污染防治策略，初步形成了包括法律、法规、导则、指南和规章在内的一整套相对较为完善的场地土壤风险管控体系。尽管如此，我国场地土壤污染风险管理依然处于刚刚起步阶段，尤其是土壤污染底数不清。目前，主要采用现场踏勘、人员访谈、资料分析并结合日常监管等方式进行疑似污染场地识别，但是，这些传统方式的精准性不高、科学性不足、全面性不够，工作效率较低。
近年来，大数据在生态环境保护领域的研究与应用得到了快速发展^[3-10]，特别是利用大数据开展土壤污染风险识别与风险管控的研究越来越受到研究者的关注^[11-13]。针对非结构化调查报告，利用自然语言处理，自动提取和生成结构化土壤污染信息，实现土壤数据分析已见报道^[11]。有****基于第二次土地调查数据，结合高程、地貌、土地类型等17个环境协变量数据，利用随机森林、极端梯度提升等，绘制了高精度的全国土壤pH空间分布地图，并推测了土壤重金属环境容量^[12]。值得一提的是，JIA等^[13]考虑到政府部门间存在数据孤岛、数据共享难度大等问题，以长江三角洲地区为研究区，基于兴趣点(Point Of Interest)的非结构化文本数据，利用多项式朴素贝叶斯算法，识别了疑似土壤污染企业，对场地调查评估、风险管控等环境管理提供了良好的决策支撑作用。但是，该研究仅能识别《国民经济行业分类》(GB/T 4754-2017)中大类行业企业，利用企业名称构建有语义词汇库，且未构建无语义词汇库^[13]。识别中类甚至小类行业以提高预测精度、增加有语义词汇库库容以克服朴素贝叶斯算法的过度拟合和零概率现象、构建无语义词汇库以降低维数和提高运算速度等已成为疑似土壤污染企业识别中迫切需要解决的问题。
鉴于此，本研究以南方某地级市为研究区，借助大数据平台，基于自然语言处理和机器学习，尝试利用改进型朴素贝叶斯算法，预测POI数据中企业所属中类行业类别，识别疑似土壤污染企业，以期为场地污染识别与风险管控实践提供理论依据和设计参数。

3. 结论

1)在行业分类预测时，朴素贝叶斯算法的性能优于随机森林算法和XGBoost算法的性能。
2)与仅采用企业名称相比，采用企业名称+经营范围构建有语义词汇库后，朴素贝叶斯算法的准确率、召回率和F₁值均得到大幅提升，可将其作为最佳的有语义词汇库构建方法。
3)采用权重1.27和平滑参数α为1.10后，建立了改进型朴素贝叶斯模型，相应的准确率、召回率和F₁值分别为0.63、0.62和0.63，进而获得了最佳的分类预测性能。
4)利用改性型朴素贝叶斯模型识别出研究区中28个疑似土壤污染行业有关1774家企业，其在各区(市、县)均存在集聚区，特别是在A、B、C区最为集中。

参考文献 (21)

基于自然语言处理和机器学习的疑似土壤污染企业识别

本站小编 Free考研考试/2021-12-31

作者简介: 黄国鑫(1980—)，男，博士，副研究员。研究方向：土壤和地下水污染防治。E-mail：huanggx@caep.org.cn.

通讯作者: 王夏晖,wangxh@caep.org.cn ;

Natural language processing and machine learning-based suspected soil contamination enterprise identification

Corresponding author: WANG Xiahui,wangxh@caep.org.cn ;

计量

出版历程

基于自然语言处理和机器学习的疑似土壤污染企业识别

通讯作者: 王夏晖,wangxh@caep.org.cn ;

English Abstract

Natural language processing and machine learning-based suspected soil contamination enterprise identification

Corresponding author: WANG Xiahui,wangxh@caep.org.cn ;

全文HTML

1.1. 基础数据及预处理

1.2. 大数据软硬件环境

1.3. 大数据技术架构

1.4. 基于改进型朴素贝叶斯算法的中类行业类别预测与污染企业识别

1.5. 实验设计

1.6. 数据分析方法

2.1. 不同土壤污染重点行业词云

2.2. 行业分类预测算法筛选

2.3. 有语义词汇库构建方法

2.4. 朴素贝叶斯模型优化

2.5. 行业企业空间分布

相关话题/数据 污染 制造 计算 空间

领限时大额优惠券,享本站正版考研考试资料!

南水北调中线水源地小流域面源污染生态阻控

平板玻璃行业烟气污染物治理工艺及减排效果

基于LUR模型的PM2.5浓度空间分布监测及分析

海绵型建筑与小区综合雨量径流系数计算方法

基于SWAT模型的观澜河流域城市面源污染负荷量化及影响效应评估

镉砷污染土壤修复技术的文献计量分析

化学原料药行业挥发性有机废气污染特征与治理中的主要问题及建议

纳米ZrO2-SRB颗粒对铬和氟污染地下水修复的动态实验

多硫化物原位修复地下水中六价铬污染柱实验模拟

水处理中的空间限域效应：强化物质传输

基于自然语言处理和机器学习的疑似土壤污染企业识别

本站小编 Free考研考试/2021-12-31

作者简介: 黄国鑫(1980—)，男，博士，副研究员。研究方向：土壤和地下水污染防治。E-mail：huanggx@caep.org.cn.

通讯作者: 王夏晖,wangxh@caep.org.cn ;

Natural language processing and machine learning-based suspected soil contamination enterprise identification

Corresponding author: WANG Xiahui,wangxh@caep.org.cn ;

计量

出版历程

基于自然语言处理和机器学习的疑似土壤污染企业识别

通讯作者: 王夏晖,wangxh@caep.org.cn ;

English Abstract

Natural language processing and machine learning-based suspected soil contamination enterprise identification

Corresponding author: WANG Xiahui,wangxh@caep.org.cn ;

全文HTML

1.1. 基础数据及预处理

1.2. 大数据软硬件环境

1.3. 大数据技术架构

1.4. 基于改进型朴素贝叶斯算法的中类行业类别预测与污染企业识别

1.5. 实验设计

1.6. 数据分析方法

2.1. 不同土壤污染重点行业词云

2.2. 行业分类预测算法筛选

2.3. 有语义词汇库构建方法

2.4. 朴素贝叶斯模型优化

2.5. 行业企业空间分布

相关话题/数据 污染 制造 计算 空间

相关话题/数据污染制造计算空间