删除或更新信息,请邮件至freekaoyan#163.com(#换成@)

A social media-based dataset of typhoon disasters, 2017

本站小编 Free考研考试/2022-01-02

<script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.2-beta.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> <script type='text/x-mathjax-config'> MathJax.Hub.Config( { extensions : ["tex2jax.js"], jax : ["input/TeX", "output/HTML-CSS"], tex2jax : {inlineMath: [["\\(", "\\)"]]} }); </script>
Abstract & Keywords
Abstract:?Typhoons are a category of natural disasters whose annual occurrence causes major life and property loss in the Northwestern Pacific region. During typhoon events, social media serve as an effective tool to transmit and acquire disaster information in real time. Texts and photos from social media can be used as a way of crowd sourcing to extract disaster loss information, analyze human behaviors and formulate responses. The dataset presented here consists of social media-based data collected from "Sina-Weibo" microblogs, "WeChat" articles, and "Baidu" news about the typhoon events in 2017, covering Typhoon "Merbok", "Roke", "Khanun", "Haitang", "Mawar", "Hato", "Nesat" and "Pakhar". We mainly collected text data from these social media platforms and websites, which were then cleaned for redundancy and irrelevance. This dataset can be used for deeper disaster information mining of typhoon events.
Keywords:?typhoon;?social media;?disaster reduction;?data mining

Dataset Profile
Chinese title2017年台风灾害社交媒体数据集
English titleA social media-based dataset of typhoon disasters, 2017
Data corresponding authorXie Jibo (xiejb@radi.ac.cn)
Data authorsYang Tengfei, Xie Jibo, Li Guoqing
Time range2017
Geographical scope15°N – 30°N, 101°E – 132°E; specific areas include: southeast China and surrounding area
Data volume1.70 GB (9749 texts from "Baidu" news and "WeChat" Subscription; 9601 records from "Sina-Weibo")
Data format.html, .xls, .sql
Data service system<http://www.sciencedb.cn/dataSet/handle/547>
Sources of fundingNational Key R&D Program of China (2016YFE0122600); International Partnership Program of Chinese Academy of Sciences(131C11KYSB20160061)
Dataset compositionThis dataset consists of two compressed (ZIP) files, which are "Data.zip" and "Classification example.zip". Among them, "Data.zip" is made up of eight subfolders, which are "Haitang", "Hato", "Khanun", "Mawar", "Merbok", "Nesat", "Pakhar", and "Roke". Social media data are stored in these subfolders in different formats, which include .html, .xls and .sql. "Classification example.zip" is made of seven subfolders which represent seven large categories of disaster losses, respectively. Each subfolder contains a few subfolders which represent small categories under corresponding large categories. These data are saved in XLS format.
Data.zip:
● XLS file: Texts from social media are stored in XLS format in a structured form.
● SQL file: Users can execute the SQL file in their own MySQL database to import the data which contain structured texts from social media.
● HTML file: It is used to store original web pages retrieved from "Baidu" news and "WeChat" Subscription.
Classification example.zip:
● XLS file:It is used to store data of disaster loss. Each file corresponds to a specific category of disaster loss.



1. ? Introduction
Typhoons cause major losses to human life and property each year in the Northwestern Pacific region. How to quickly collect information and make reasonable responses is an urgent problem faced by disaster relief departments. Crowd sourcing and citizen observation has been an effective method to obtain disaster information, among which social media, in particular represented by Twitter,1 Facebook,2 micro-blog data,3 etc., provide near real-time information during the disaster period. By making full use of the dynamic information collected by social media, the disaster relief department can get timely information about the disaster events and people's responses to them. Research has been done on the mining of disaster information based on social media data. Evidence shows that people's behavior is greatly influenced by social media when disasters occur.4 A study commissioned by the American Red Cross5 found that more than half of the respondents believed that government agencies should monitor social media to acquire timely and effective disaster information. As to how to use social media data to mine valuable disaster information, Chae J et al.6 used Twitter data for hurricane disaster analysis, and the results provided support for government departments' policy decision-making. Some studies7,8 built disaster event classifiers based on microblog data for disaster event identification, which detected disasters through citizen observation. In addition, achievements have been made in the spatio-temporal analysis of disaster,9,10 the characteristics of disaster social responses,11 and the prediction simulation of disaster trends,12,13 etc., which greatly improved the efficiency of disaster relief.
Collecting useful information for disaster events from social media is quite time-consuming and complicated due to unstructured expression. Although some social media platforms provide the API (Application Program Interface) for public information access, they also set restrictions to limit the information we can acquire. For example, we can't get the micro-blog information that relates to a specific disaster event; nor can we get the micro-blog information on a specified historical period directly through API. In other words, the API of these platforms does not provide corresponding retrieval functions, which undoubtedly increases the workload of subsequent data processing. Therefore, in our research project, we develop a toolkit to automatically harvest and process social media-based disaster information. We use the toolkit to generate a typhoon disaster dataset for 2017 based on several social media platforms. The dataset is mainly composed of text data that come from "Sina-Weibo" microblogs, "WeChat" Subscription and "Baidu" news. Figure 1 shows typhoon disaster data from "Sina-Weibo". The data contain textual descriptions and pictures of the disasters, as well as the time and location of data upload. It provides data support for the disaster relief departments to understand the timely progress of the disaster.




Figure 1 ? Disaster information from "Sina-Weibo" microblogs

2. ? Data collection and processing
2.1 ? Overview
The dataset records information on the following eight typhoon events: "Merbok", "Roke", "Khanun", "Haitang", "Mawar", "Hato", "Nesat" and "Pakhar" (Table 1).
Table 1 ? The list of typhoons in 2017
No.NameLandfall time
1Merbok2017/6/12
2Roke2017/7/23
3Nesat2017/7/29
4Haitang2017/7/30
5Hato2017/8/23
6Pakhar2017/8/27
7Mawar2017/9/3
8Khanun2017/10/16

The data from "WeChat" Subscription and "Sina-Weibo" are mostly from unofficial media and public uploads, which mainly describe the?progression of a disaster based on public observation. In order to give a more comprehensive understanding of the disaster, we added data from Baidu news which were released by official media, which mainly contained disaster loss?statistics, reliefmeasures, etc. We used different methods to obtain data from varied data sources. Among them, keyword search was used to retrieve data from "WeChat" Subscription and "Baidu" news. For example, when "Typhoon Hato 2017" was entered, the "Baidu" search engine returned the news related to "Typhoon Hato" in 2017. The toolkit we developed was used to conduct the search and to automatically generate relevant contents. Then, we parsed and cleaned these texts and stored them into the database in a structured form. The same method was used to obtain data from "WeChat" Subscription. For "Sina-Weibo", we used the advanced search function of the platform to obtain data related to the typhoon events. According to the track of the typhoon events (Figure 2), we selected the name of the Typhoon plus the characters "台风 (Typhoon)" as the keywords for setting retrieval conditions.




Figure 2 ? Tracks of the typhoon events in 2017Source: "Tianditu" (http://map.tianditu.com/).


2.2 ? Data collection process
We developed a social media data harvesting system with functions of data collection, parsing, cleaning, and management, as shown in Figure 3. We acquired data from different platforms by using the collection module, and then parsed them into a structured form. The HTML pages from "WeChat" Subscription and "Baidu" news were stored in their original HTML format. Cleaning the data involved a process that comprised removing duplicated information, translating traditional Chinese into simplified Chinese, translating full-width characters into half-width characters, etc. Finally, these data were stored in a structured form. The structure of the data is shown in Table 2.




Figure 3 ? Flowchart of the social media data harvesting system
Table 2 ? Structure of the data
File(.zip)FolderFolderFile(.xls, .sql, .html)Notes
Data.zipbaiduHaitang
Hato
Khanun
Mawar
Merbok
Nesat
Pakhar
Poke
.xls
.sql
.html
.html: Users can parse the page themselves according to their research needs.
.sql: User can execute the SQL file in their own MySQL database to import the data into it.
.xls: Users can use the data directly through the XLS file.
wechat
weibo.xls
.sql


2.3 ? Data classification
Social media data contain a lot of disaster loss information, and different types of damage may be included in the same data. For example, a text from "Sina-Weibo" writes, "After the typhoon, many trees were blown down and many cars were smashed." The text contains disaster loss information about the destruction of trees and cars and we divided these information into different categories of disaster losses. Below we provide a classification example according to the type of reported damage caused by the disaster. The raw data in this classification example are all from "Sina-Weibo" microblogs related to typhoon "Hato" in Zhuhai. Users can classify the rest of the data in the dataset by referring to the classification example or according to their specific needs in research. The seven large categories include social effects, forestry, fisheries, traffic, electric power, communication and infrastructure damage. One large category contains several small categories, as shown in Figure 4. For example, the category of social effects contains injuries and deaths, water shortage, building damage, and market shutdown. The classification example is shown in Table 3.




Figure 4 ? Category of disaster loss
Table 3 ? An example of disaster classification
Large categorySmall categoriesNumber of posts
Social effectsInjuries and deaths12
Water shortage258
Building damage78
Market shutdown3
ForestryDestruction of trees and plants119
FisheriesLoss of fishing ground1
Damage of fishing boats1
TrafficTraffic congestion101
Vehicle damage38
Electric powerElectric powercutoff287
Damage of electric power equipment4
CommunicationInterruption of networks and signals123
Infrastructure damageDamage of street lamps, billboards, bridges, roads, and so on34



3. ? Sample description
Data fields for "Sina-Weibo" includes ID, keyword, province, city, content, picture, location, release time, platform, number of forwards, comments, number of likes, as shown in Table 4. Each column has a limit of no more than 140 characters. The topics of the dataset include property loss, traffic impact, casualties, power supply, communication impact, rescue arrangements, response measures, and public attitudes toward the typhoon, among others.
Table 4 ? Data from "Sina-Weibo"
ID210
KeywordTyphoon
ProvinceGuangdong Province
CityZhuhai City
ContentAfter the typhoon, Mr. Liu asked me out for a walk to experience the post-disaster Zhuhai. Almost no restaurant was open. Having looked for a long time, finally we found a restaurant which was open. We saw so many cars smashed, trees blown down, and yachts blown ashore. My little white car was scratched by the branches. How can I go to work tomorrow, since Hengqin is so far away? The last picture, as a tribute to our soldiers!
Picturehttp://ww2.sinaimg.cn/square/005WuHsBgy1fiu0v3h5b8j30qo0zkdvg.jpg ;
http://ww4.sinaimg.cn/square/005WuHsBgy1fiu0ul1m9aj30qo0zktks.jpg ;
http://ww3.sinaimg.cn/square/005WuHsBgy1fiu0wqzd5nj30qo0zk4ap.jpg ;
http://ww4.sinaimg.cn/square/005WuHsBgy1fiu0y2re2bj30qo0z
LocationZhuhai
Release time2017-08-23 22:25
PlatformiPhone 7
Number of forwards
Comments
Number of likes1

Data fields for "Baidu" news include ID, title, link, source, release time, and keyword, as shown in Table 5. The fields for "WeChat" Subscription include ID, title, content, source, release time, and keyword, as shown in Table 6. The themes of the data include typhoon tracks, disaster loss statistics, government announcements, emergency measures, etc.
Table 5 ? Data from "Baidu" news
ID51
Title95 thousand people in Fujian to be relocated under Typhoon "Nesat"and"Haitang"
Linkhttp://www.huaxia.com/xw/dlxw/2017/07/5415198.html
Sourcehuaxia.com
Release time2017-07-31, 15:11
KeywordTyphoon Haitang 2017

Table 6 ? Data from "WeChat" Subscription
ID31
TitleTyphoon "Haitang" has come!Weihai has become a sea!
ContentTyphoon "Haitang" has come!Weihai has become a sea!)
SourceNeurologist [神经科专家] ( name of a WeChat account)
Release time2017-08-04
KeywordTyphoon Haitang 2017


4. ? Quality control and assessment
Keywords related to the designated typhoon event were diversified and optimized to ensure maximum retrieval of related information from each social media platform. After data collection was completed, we manually checked the validity of the data, and removed incomplete entries as well as entries irrelevant to the typhoon disaster. In addition, we established a database index system to avoid duplicate data. For disaster classification, three colleagues were arranged to classify these original?data to ensure the accuracy of the final classification results. Prior to this, classification standards had been set up to minimize possible discrepancies. Finally, we randomly sampled 500 data entries from each platform and found an accuracy rate of nearly 100%.

5. ? Value and significance
To our knowledge, there were no social media-based datasets for these typhoons before, and our dataset effectively fills up this gap. The data in our dataset can be analyzed to meet different needs of disaster research. For example, the disaster loss data presented here can be re-classified into different categories to support real-time evaluations of disaster losses. The data can also be used for further analysis of typhoon disasters such as victims’ sentiment analysis in the typhoon area, the extraction of buzzwords during typhoon transits, etc. In follow-up studies, we have used the texts in this dataset to train the corpus for automatic identification of typhoon disaster information, which achieved satisfactory results.

Acknowledgments
This work is supported by the National Key R&D Program of China (2016YFE0122600). We thank Edward T.-H. Chu, Associate Professor at National Yunlin University of Science and Technology, Taiwan, China for his advice on data collection. We thank Li Zhenyu from Shandong University of Science and Technology and Dr. Tian Chuanzhao from the Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences for their careful examination of our dataset.


1.
Sakaki T, Okazaki M & Matsuo Y. Twitter analysis for real-time event detection and earthquake reporting system development. IEEE Transactions on Knowledge & Data Engineering 25 (2013): 919 – 931.

+?CSCD?·?Baidu Scholar

2.
Bird D, Ling M & Haynes K. Flooding Facebook – the use of social media during the queensland and Victorian floods. Australian Journal of Emergency Management 27 (2012): 27 – 33.

+?CSCD?·?Baidu Scholar

3.
Wang YD, Li H, Wang T et al. Emergency information mining and analysis of emergency based on social media. Journal of Wuhan University 41 (2016): 290 – 297.

+?CSCD?·?Baidu Scholar

4.
National Research Council (U.S.). Public Response to Alerts and Warnings Using Social Media: Report of a Workshop on Current Knowledge and Research Gaps. Washington, DC: The National Academies Press, 2013.

+?CSCD?·?Baidu Scholar

5.
American Red Cross. Social media in disasters and emergencies. Available at: <http://i.dell.com/sites/content/shared-content/campaigns/en/Documents/red-cross-survey-social-media-in-disasters-aug-2010.pdf> [Accessed December 11, 2017].

+?CSCD?·?Baidu Scholar

6.
Chae J, Thom D, Yun J et al. Public behavior response analysis in disaster events utilizing visual analytics of microblog data. Computers & Graphics 38 (2014): 51 – 60.

+?CSCD?·?Baidu Scholar

7.
Zhou Y, Yang L, Walle BVD et al. Classification of microblogs for support[ing] emergency responses: Case Study [of] Yushu Earthquake in China, 2014. Proceedings of the 47th Hawaii International Conference on System Sciences, 2013: 1553 – 1562.

+?CSCD?·?Baidu Scholar

8.
Qu Y, Huang C, Zhang P et al. Microblogging after a major disaster in China: a case study of the 2010 Yushu earthquake. Proceedings of ACM Conference on Computer Supported Cooperative Work, 2011: 25 – 34.

+?CSCD?·?Baidu Scholar

9.
Chae J, Thom D, Jang Y et al. Special section on visual analytics: Public behavior response analysis in disaster events utilizing visual analytics of microblog data. Computers & Graphics 38 (2014): 51 – 60.

+?CSCD?·?Baidu Scholar

10.
Chen Z, Gao T, Luo NX et al. Social media effectiveness to reflect the spatial and temporal distribution of natural disasters. Science of Surveying and Mapping 42 (2017): 44 – 48.

+?CSCD?·?Baidu Scholar

11.
Liu HB & Zhai GF. A comparative study of the social response characteristics of different disasters based on social media information. Journal of Catastrophology 32 (2017):187 – 193.

+?CSCD?·?Baidu Scholar

12.
Stoové MA & Pedrana AE. Making the most of a brave new world: Opportunities and considerations for using Twitter as a public health monitoring tool. Preventive Medicine 63 (2014): 109 – 111.

+?CSCD?·?Baidu Scholar

13.
Velardi P, Stilo G, Tozzi AE et al. Twitter mining for fine-grained syndromic surveillance. Artificial Intelligence in Medicine 61 (2014): 153 – 163.

+?CSCD?·?Baidu Scholar


Data citation
1. Yang T, Xie J & Li G. A social media-based dataset of typhoon disasters, 2017. Science Data Bank. DOI: 10.11922/sciencedb.547

稿件与作者信息

How to cite this article
Yang T, Xie J & Li G. A social media-based dataset of typhoon disasters, 2017. China Scientific Data 3 (2018), DOI: 10.11922/scdata.2017.0014.en
Yang Tengfei
social media data collection and analysis, writing.
PhD student, research area: natural language processing, disaster information mining.

Xie Jibo
motivation of the research, writing.
xiejb@radi.ac.cn
PhD, Associate Professor, research area: geospatial data infrastructure, remote sensing, geo-computation.

Li Guoqing
advice on dataset design and data check, writing.
PhD, Professor, research area: geospatial data infrastructure, remote sensing, big data.

National Key R&D Program of China (2016YFE0122600)


相关话题/信息 数据 媒体 台风 神经科

  • 领限时大额优惠券,享本站正版考研考试资料!
    大额优惠券
    优惠券领取后72小时内有效,10万种最新考研考试考证类电子打印资料任你选。涵盖全国500余所院校考研专业课、200多种职业资格考试、1100多种经典教材,产品类型包含电子书、题库、全套资料以及视频,无论您是考研复习、考证刷题,还是考前冲刺等,不同类型的产品可满足您学习上的不同需求。 ...
    本站小编 Free壹佰分学习网 2022-09-19
  • 基于化合物分子结构的量化计算结果数据库
    摘要&关键词摘要:目前,大量已知结构的化合物缺乏基本物性数据和热动力学数据。为了进一步提高化学数据库中数据的完备性和拓展使用性,本数据库利用Gaussian03软件程序基于化合物结构数据库以及化合物基本信息资源对约20万个化合物的结构进行了数据分析和量化几何结构优化、光谱和频率以及热动力学计算模拟, ...
    本站小编 Free考研考试 2022-01-02
  • 《丝绸之路历史地理信息专题》卷首语
    丝绸之路是中西方贸易路线,也是民族迁徙、交流的大通道,其形成不晚于公元前5世纪,横亘亚欧大陆,荟萃罗马西欧文化、中国文化、印度文化、闪族伊斯兰文化等人类主要文化系统,在推动人类文明与经济文化交流中发挥了重要作用。国际国内对丝绸之路的研究从上世纪初开始,已历百年。随着考古遗址的发掘,出土文献的利用,丝 ...
    本站小编 Free考研考试 2022-01-02
  • 2007–2009年黄海底层水CTD观测及沉积环境因子数据集
    摘要&关键词摘要:2007–2009年通过搭载黄海冷水团航次及中国近海开放共享航次共4个航次,在黄海利用CTD获得了154个站位的经纬度、水深、底层水温度和盐度数据;通过154个站位的沉积物样品的采集和分析,获得了调查站位沉积物的粒度、含水量、有机质含量、叶绿素a及脱镁叶绿素a含量以及各参数分层分布 ...
    本站小编 Free考研考试 2022-01-02
  • 基于土地利用的长江经济带1970s末至2015年人类活动强度数据集
    摘要&关键词摘要:人类活动强度数据集可以用于评估人类活动对生物多样性的影响等。本数据集以中国国家尺度土地利用数据库(China’sLand-Use/coverDatasets,CLUDs)为数据源,采用生态系统综合人类扰动指数赋值方案,研制了长江经济带1970年代末、1980年代末、1995年、20 ...
    本站小编 Free考研考试 2022-01-02
  • 明清时期丝绸之路沿线城市建成区范围GIS数据集
    摘要&关键词摘要:城市建设是人类利用土地的主要形式之一。城市建成区的变化记录着城市系统演变的历史,反映了城市位置、规模和形态的变迁。丝绸之路沿线城市建成区的历史数据为研究这些城市的演化过程提供了数据支撑,为更长时段及其他城市要素的复原工作提供了数据基础。本文以城墙围合范围指代城市建成区范围,以明清时 ...
    本站小编 Free考研考试 2022-01-02
  • 清至民国石羊河流域聚落数据集
    摘要&关键词摘要:石羊河流域地处河西干旱区,是丝绸之路的必经之地,流域内聚落的变化对干旱地区社会与生态环境变迁有重要的指示作用。因此石羊河流域聚落数据集,不仅是研究干旱区生态环境变迁的重要数据,也是丝绸之路研究的基础数据。本数据集合方志、地理调查表、地图资料提取了清至民国流域内的聚落信息。通过详细地 ...
    本站小编 Free考研考试 2022-01-02
  • 唐代丝绸之路东中段交通线路数据集(618–907年)
    摘要&关键词摘要:丝绸之路交通线路是研究丝绸之路的重要基础,唐代丝绸之路交通路线奠定了历史丝绸之路交通的基本框架。本文以唐代(618–907年)丝绸之路东中段交通为研究对象,综合利用历史文献、考古成果,以及历史地理学和地理信息系统方法建立交通线路数据集,尽可能客观地反映唐代丝绸之路东中段交通面貌。本 ...
    本站小编 Free考研考试 2022-01-02
  • 晚清民国新疆地区湖泊、湿地数据集
    摘要&关键词摘要:干旱区湖泊和湿地是区域环境变化的敏感因子及指示器。历史时期新疆地区湖泊与湿地的重建数据不仅是全球变化所需要的基础水文数据,而且是历史时期丝绸之路研究必备的环境数据。通过对宣统元年(1909年)的《新疆全省舆图》,民国二十四年(1935年)新疆地区一套大比例尺军用地形图数字化处理,结 ...
    本站小编 Free考研考试 2022-01-02
  • 两汉丝绸之路交通数据集
    摘要&关键词摘要:本文以谷歌地球(GoogleEarth)提供的高清晰度卫星图片为基础,通过对历史文献、考古成果、今人研究等资料的梳理尽可能地实现对两汉时期沙漠绿洲丝绸之路主要交通点的精确地理定位,进而根据地形地貌特征复原这一时期丝绸之路的主要线路走向,最终形成包括交通点、交通线在内的两汉丝绸之路交 ...
    本站小编 Free考研考试 2022-01-02
  • 蒙元时期丝绸之路旅行家行程GIS数据集
    摘要&关键词摘要:蒙元时期丝绸之路上的旅行家为数甚多,其中有约15位的行程可供复原,复原工作对研究该时期丝绸之路的走向和不同时期路线的选择意义较大。本文收集整理了文献记载的旅行家途经地点,再依据现代研究成果、古今地图、GoogleEarth卫星影像等绘制往来路线。15位旅行家、使节从最早的耶律楚材( ...
    本站小编 Free考研考试 2022-01-02