删除或更新信息,请邮件至freekaoyan#163.com(#换成@)

A dataset of scientific literature on floods, 1990 – 2017

本站小编 Free考研考试/2022-01-02

<script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.2-beta.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> <script type='text/x-mathjax-config'> MathJax.Hub.Config( { extensions : ["tex2jax.js"], jax : ["input/TeX", "output/HTML-CSS"], tex2jax : {inlineMath: [["\\(", "\\)"]]} }); </script>
Abstract & Keywords
Abstract:?With an increasing number of scientific achievements published, it is particularly important to conduct literature-based knowledge discovery and data mining. Flood, as one of the most destructive natural disasters, has been the subject of numerous scientific publications. On January 1, 2018, we conducted literature data collection and processing on flood research and categorized the retrieved paper records into Whole SCI Dataset (WS) and High-Citation SCI Dataset (HCS). These data sets can serve as basic data for bibliometric analysis to identify the status of global flood research during 1990 – 2017. Our study shows that while the Chinese Academy of Sciences was the most productive institution during this period, the United States was the most productive country. Besides, our keyword analysis reveals the potential popular issues and future trends of flood research.
Keywords:?literature data sets;?flood;?WS;?HCS

Dataset Profile
Chinese title1990–2017年全球洪涝灾害研究SCIE文献数据
English titleA dataset of scientific literature on floods, 1990 – 2017
Data corresponding authorLi Guoqing (ligq@radi.ac.cn)
Data authorsZhang Hongyue, Li Guoqing, Huang Mingrui, and Qing Xiuling
Time range1990 – 2017
Data volume6.46 MB (including 8115 records in Whole SCI Dataset and 150 records in High-Citation SCI Dataset)
Data format.xls
Data service system<http://www.sciencedb.cn/dataSet/handle/637>
Sources of fundingNational Key Research and Development Program of China(2016YFE0122600)
Dataset compositionThis dataset consists of two compressedZIP files, namely, “WS.zip” and “HCS.zip”. The data are saved in XLS format.
● “WS.zip” refers to the Whole SCI Dataset that stores a full list of the papers collected.
● “HCS.zip” represents the High-Citation Dataset that stores a collection of papers, each with over 100 citations.



1. ? Introduction
In recent years, floods have been among the most frequent disasters worldwide. According to the German Reinsurance Company statistics, flood has been one of the most significant natural disasters in the world.1 Due to its long time span and wide geographical scope, it is difficult to define a flood event with a specified time and location. In most cases, a series of flood events happened together, which requires scientific publications to be used as a group to illustrate empirical findings.
Literature-based knowledge discovery has been applied in several research domains, such as medical and biological research,2,3 as well as information and science development studies.4,5 Due to the increasing body of texts and the open-access policies of numerous journals, literature mining is becoming useful for both hypothesis generation and scientific discovery.3 However, due to semantic heterogeneity, literature-based data acquisition is restricted. To the best of our knowledge, there are few studies collecting scientific publications on disaster events.
To unearth the hidden knowledge on flood research, our study adopts literature-based knowledge discovery to obtain Whole SCI Dataset and High-Citation Dataset by means of data retrieval and processing.

2. ? Data collection and processing
2.1 ? Overview
Among the most popular Web-based literature databases, such as Google Scholar, Web of Science (WoS), Scopus, and PubMed, WoS (Thomson Reuters) was one of the most frequently used by scientists from natural sciences in recent years.6 The WoS Core Collection has several citation indexes, including Science Citation Index (SCI), Social Sciences Citation Index, and Arts and Humanities Citation Index.7 Despite the continued emergence of bibliometric databases, SCI remains arguably the most reliable one for retrieving scientific output.8
In this study, we collected data from the SCI database at the Web of Science (WoS, formerlyknown as ISIWebofKnowledge)Core Collection. Although WoS guarantees a relatively stable search environment with clearly defined lists of indexed journals,9 its searching conditions are restricted to metadata such as titles, keywords, and abstracts.9 To obtain a full and accurate list of texts, we narrowed down our queries by setting four parameters.
The first parameter is research topic. In advanced search, the “topic” field tag was set as “TS?=?(((flood near (event or hazard or disaster)) or (flood near/3 (inundation or damage or risk or zone))) not (volcano or basalt)),” where “near/n” was used to find records containing all terms within a certain number of words (n) of each other (i.e., up to n words can be inserted between two terms). Considering that volcano and basalt papers are often published in the category of flood, “not (volcano or basalt)” was added to the searching formula to exclude irrelevant records.
The second parameter is time span. A time span was set to restrict the publication time from January 1990 to December 2017.
The third parameter is publication type. “Article” was selected as the only publication type of this search. The reason is that other publication types (e.g., discussion, biographical item, and editorial) do not provide sufficient attribute information and thus cannot meet our data requirements.
The last parameter is additional citation indexes. Science Citation Index Expanded was included in the scope of our search.
For a full description of the papers, full records of each paper, including cited references, were downloaded and stored in TXT format. Descriptive fields include abstract, authors, country, publication year, institution, research area, WoS subject category, journal, title, and source. An example of the description fields is presented in Figure 1.




Figure 1 ? Descriptive fields for the articles collectedNotes: In the figure, the fields were named according to the criteria of Web of Science. PT denotes publication type (J=journal; B=book; S=series; P=patent); AU – authors; AF – authors’ full name; TI – document title; SO – publication name; LA – language; DT – document type; DE – author keywords, which denotes keywords given by authors; ID – keywords plus, which denotes keywords generated by ISI; AB – abstract; C1 – authors’ address; RP – reprint address; EM – e-mail address.

A total of 15,935 records were obtained on January 1, 2018.

2.2 ? Data processing
The processing tools adopted by this study included Thomson Data Analyzer (TDA)10 and Microsoft Excel. TDA is a strong text mining software able to mine text information from multiple fields and offer visualized comprehensive analyses. Moreover, TDA enables a systematic organization of the literature information retrieved.11 Microsoft Excel was used to rearrange the exported data.
Figure 2 presents the data processing flowchart used by this study, which includes two major steps: record removal and field cleaning. Natural language processing (NLP) was adopted to perform data cleaning. NLP mainly involved tokenization, stop word removal, stemming, lemmatization, and field merging.




Figure 2 ? Processing flowchart of initial datasets (raw data)
Tokenization is the process of breaking up a given string into a series of subsequences, such as words, keywords, phrases.12 Each of the subsequences is called a “token”. In this process, some special symbols, such as punctuation, etc., will be removed. In some cases, some common words are of little value when the document matches the user’s needs and thus need to be completely removed from the vocabulary. These words are calledstop words .12 In literature retrieval, stemming and lemmatization are different in meaning. Stemming usually refers to a very crude heuristic process that removes the affixes at both ends of the word. This process often involves the removal of derived affixes.12Lemmatization usually refers to the process of using the vocabulary and the morphological analysis to remove the inflection affixes,12 thereby returning the original form of the words or the words in the dictionary, and the returned result is called a lemma. Raw data of this dataset were downloaded from the WoS database and stored in text format, which were then organized into several attribute fields, including title, authors, abstract, keywords, journal, publication year, and country. As each attribute field reflects distinct paper information, researchers can select specific fields to perform knowledge mining.
The raw records obtained were first imported into TDA by using an import filter, and the records were then segmented and stored in respective fields. A literature analysis was performed by using NLP and statistical methods. NLP modules in TDA were utilized to process the metadata fields of initial literature datasets, including title, abstract, authors, keywords, and keywords plus. The goal of NLP is to use rules to process text for specific purposes (such as translation, extraction of assertion, and summarization), where the rules may be predefined or learned through supervised or unsupervised methods.13 First, long sentences, including the title and abstracts, were segmented, and part-of-speech tagging was conducted on the segmented words and phrases. The segmented words and phrases were then further processed. Regular expression is an efficient tool for extending retrieval. The Fuzzy Matching Editor allows the user to tailor TDA’s cleanup algorithms to suit the requirements of data sources. These two modules were adopted to perform the processing task. Preprocessing includes de-duplication and empty record removal. De-duplication refers to the removal of duplicate records so that the abstract of each record could be used as a unique identifier. If two records shared exactly the same abstract, then one of them would be removed from the dataset. De-duplication ensures the uniqueness of each record. Empty record removal refers to the exclusion of paper records with null fields. If a record did not contain full information on title, abstract, or keywords, then the record would be removed.
A total of 15,935 papers were initially retrieved from the ISI WoS Core Collection, of which 7,820 were removed and the remaining 8,115 records were included. Key attribute fields include title, abstract, keywords, country, document type, journal, publication year, research area, and times cited.
However, the field data included inconsistencies ranging from spelling differences – whether intentional or accidental, to synonyms (e.g., “happy” and “glad”). As accurate analysis relies on minimizing these inconsistencies, the keywords were first preprocessed through data cleaning, by using such tools as number filter, punctuation eraser, stop word filter, English stemmer, and self-defined regex filter. Machine-assisted and rule-based recognitions were then adopted to merge synonyms and reduce the size of keyword list.

2.3 ? Whole SCI (WS) and High-Citation SCI (HCS)
To emphasize high-citation papers among all the articles retrieved, we grouped the papers into two datasets: WS and HCS (Figure 3). WS data refers to all the retrieved records after data processing, while HCS data contains selected papers, each with over 100 citations.
The user community of a research paper includes three stakeholders: authors who write the paper, editors who review and decide on the publication of the paper, as well as readers of the published article. A published paper reflects the authors’ research interests and the editors’ recommendations. As citation index reflects the impact factor of a published paper, high-citation papers are those with greatest popularity among readers. In this sense, WS represents authors’ interests and editors’ views, whereas HCS shows readers’ preferences.




Figure 3 ? Relationship between HCS and WS


3. ? Data field analysis
The WS and HCS datasets were stored as WS.xls and HCS.xls, with 8115 and 150 records, respectively.
3.1 ? Statistics of attribute fields
Each record consists of 11 attribute fields, including article ID, title, abstract, publication country, times cited, keywords (authors), keywords plus, research area, document type, journal, and publication year. Table 1 shows the field statistics of WS.xls.
Table 1 ? Field statistics of WS paper records
FieldNumber of itemsCoverage (%)Data typeMeta tags
ISI unique article identifier8,115100%NumberIdentity number
Title8,114100%Paper title
Abstract8,115100%
Authors 1st6,085100%
Number of authors29100%Number
Countries13398%Country
Times cited188100%Number
Keywords (authors)16,69284%
Keywords plus10,89887%
Keywords (authors and plus)21,975100%
Research area119100%
Document type6100%Document type
Journal1,280100%
Publication year28100%YearDate


3.2 ? Most productive journals and institutions
The flood papers of this database mainly came from the following 10 journals, each of which exceeded 100 publications: Natural Hazards (499 papers), Journal of Hydrology (385), Natural Hazards and Earth System Sciences (256), Journal of Flood Risk Management (212), Hydrological Processes (206), Geomorphology (154), Hydrology and Earth System Sciences (142), Water Resources Research (139), Hydrological Sciences Journal (114), and Water (109).
Each of the following four institutions published more than 100 papers: Chinese Academy of Sciences (163), Vrije Universiteit Amsterdam (131), University of Bristol (130), and Delft University of Technology (127).

3.3 ? Keyword analysis
“Author keywords” was used to denote keywords provided by authors in an article, while “keywords plus” referred to those generated by ISI on the basis of each article’s citations and references. 84% and 87% of the total 8,115 publication records retrieved contained “author keywords” and “keywords plus”, respectively. A total of 24,927 keywords were obtained after the two types of keywords were merged, which can be used to illustrate the trends of flood research.
21,975 keywords were obtained after data cleaning. Among the most frequently used keywords, “flood” and “climate change” were ranked first with 27 records for each, followed by “model” (18 records), “flood risk” (14 records), and “precipitation”, “rainfall”, “river” and “uncertainty” (13 records for each). Figure 4 shows a word cloud of the top 50 keywords generated based on their frequency.




Figure 4 ? Word cloud of the top 50 keywords


4. ? Quality control and assessment
To guarantee the relevance of each record, we excluded those whose titles or keywords did not contain “flood.” In addition, duplicate records and records with empty fields were removed from the dataset. Stop words, punctuation, and number were deleted during processing. After data collection, we manually checked data validity and removed incomplete entries as well as entries irrelevant to flood disasters.

5. ? Value and significance
With an increasing number of publications added to the already large volume of scientific literature, it becomes significant more than ever to perform bibliometric analysis on specific research themes. In recent years, scientometric methods have been applied in global remote sensing,12,13 night-time light remote sensing,14 and the remote sensing of human health.15 To the best of our knowledge, no literature-based datasets for flood research are available hitherto, and our dataset effectively fills this research gap. The literature-based knowledge mining model can be applied in disaster research, where the keywords can be used as core knowledge for topic analysis. The dataset presented here can be used to analyze major issues of flood research. A comparative analysis of WS and HCS datasets is helpful in illustrating the potential issues and future trends of flood research.

Acknowledgments
This work is supported by the Hainan Provincial Department of Science and Technology under Grant No.ZDKJ2016021. We thank Dr. Huang Mingrui from the Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences for her support on the collection of this dataset. Thank Qing Xiuling from the National Science Library, Chinese Academy of Sciences for her suggestion on data retrieval and processing.


1.
Syvitski JPM, Overeem I, Brakenridge GR et al. Floods, floodplains, delta plains – A satellite imaging approach. Sedimentary Geology 267 – 268 (2012): 1 – 14.

+?CSCD?·?Baidu Scholar

2.
Hristovski D, Peterlin B, Mitchell JA et al. Using literature-based discovery to identify disease candidate genes. International Journal of Medical Informatics 74 (2005): 289 – 298.

+?CSCD?·?Baidu Scholar

3.
Jensen LJ, Saric J & Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics 7 (2006): 119 – 129.

+?CSCD?·?Baidu Scholar

4.
He L & Li F. Topic discovery and trend analysis in scientific literature based on topic model. Journal of Chinese Information Processing 26 (2012): 109 – 115.

+?CSCD?·?Baidu Scholar

5.
Zins C. Conceptual approaches for defining data, information, and knowledge. Journal of the American Society for Information Science and Technology 58 (2007): 479 – 493.

+?CSCD?·?Baidu Scholar

6.
Vieira E & Gomes J. A comparison of Scopus and Web of Science for a typical university. Scientometrics 81 (2009): 587 – 600.

+?CSCD?·?Baidu Scholar

7.
Bakkalbasi N, Bauer K, Glover J et al. Three options for citation tracking: Google Scholar, Scopus and Web of Science. Biomedical Digital Libraries 3 (2006): 7.

+?CSCD?·?Baidu Scholar

8.
Kostoff R. The underpublishing of science and technology results. Scientist 14 (2000): 6.

+?CSCD?·?Baidu Scholar

9.
Perianes-Rodriguez A, Waltman L & van Eck NJ. Constructing bibliometric networks: A comparison between full and fractional counting. Journal of Informetrics 10 (2016): 1178 – 1195, DOI: 10.1016/j.joi.2016.10.006

+?CSCD?·?Baidu Scholar

10.
Feng H & Fang S. Research on the application of Thomson data analyzer to analyze the patent intelligence of scientific institutions. Information Science 26 (2008): 1833 – 1843.

+?CSCD?·?Baidu Scholar

11.
Yang Y, Akers L, Klose T et al. Text mining and visualization tools – Impressions of emerging capabilities. World Patent Information 30 (2008): 280 – 293

+?CSCD?·?Baidu Scholar

12.
Manning CD Raghavan P & Schütze H.Introduction to Information Retrieval. Vol. 39. Cambridge: Cambridge University Press, 2008.

+?CSCD?·?Baidu Scholar

13.
Saffer JD & Burnett VL. Introduction to biomedical literature text mining: Context and objectives. Biomedical Literature Mining. NY: Humana Press, 2014: 1 – 7.

+?CSCD?·?Baidu Scholar

14.
Zhang H, Huang M, Qing X et al. Bibliometric analysis of global remote sensing research during 2010 – 2015. ISPRS International Journal of Geo-Information 6 (2017): 332.

+?CSCD?·?Baidu Scholar

15.
Zhuang Y, Liu X, Nguyen T et al. Global remote sensing research trends during 1991–2010: a bibliometric analysis. Scientometrics 96 (2013): 203 – 219.

+?CSCD?·?Baidu Scholar

16.
Hu K, Qi K, Guan Q et al. A scientometric visualization analysis for night-time light remote sensing research from 1991 to 2016. Remote Sensing 9 (2017): 802 – 809.

+?CSCD?·?Baidu Scholar

17.
Viana J, Santos JV, Neiva RM et al. Remote sensing in human health: A 10-year bibliometric analysis. Remote Sensing 9 (2017): 1225 – 1235.

+?CSCD?·?Baidu Scholar


Data citation
1. Zhang HY, Li GQ, Huang MR et al. A dataset of scientific literature on floods, 1990 – 2017. Science Data Bank. DOI: 10.11922/sciencedb.591

稿件与作者信息

How to cite this article
Zhang HY, Li GQ, Huang MR et al. A dataset of scientific literature on floods, 1990 – 2017. China Scientific Data 3 (2018), DOI: 10.11922/csdata.2018.0020.en
Zhang Hongyue
literature data collection and processing, manuscript writing.
PhD, research area: natural language processing, disaster information mining.

Li Guoqing
advice on dataset design and data check, manuscript writing.
ligq@radi.ac.cn
PhD, Professor, research area: geospatial data infrastructure, remote sensing, big data.

Huang Mingrui
advice on data collection and processing, manuscript writing.
PhD student; research area: literature retrieval, bibliometric analysis.

Qing Xiuling
advice on literature retrieval and dataset processing, manuscript writing.
PhD, Associate Professor; research area: literature retrieval, bibliometric analysis.

Zhang Huarong
manuscript editing.
MSc; research area: drone mapping, geographic information processing.

National Key Research and Development Program of China (2016YFE0122600); Hainan Provincial Department of Science and Technology under Grant No.ZDKJ2016021


相关话题/信息 数据 文献 洪涝灾害 稿件

  • 领限时大额优惠券,享本站正版考研考试资料!
    大额优惠券
    优惠券领取后72小时内有效,10万种最新考研考试考证类电子打印资料任你选。涵盖全国500余所院校考研专业课、200多种职业资格考试、1100多种经典教材,产品类型包含电子书、题库、全套资料以及视频,无论您是考研复习、考证刷题,还是考前冲刺等,不同类型的产品可满足您学习上的不同需求。 ...
    本站小编 Free壹佰分学习网 2022-09-19
  • 2010年中国生态系统服务空间数据集
    摘要&关键词摘要:生态系统服务是生态系统形成并维持的人类赖以生存和发展的环境条件与效用,是测度自然生态系统保护价值的重要指标。采用科学方法模拟生态系统服务的空间分布对掌握当前我国生态本底状况,识别生态保护重要区,从而有效支持生态管理决策具有重要意义。本研究以遥感地物分类数据为基础,根据通用土壤流失方 ...
    本站小编 Free考研考试 2022-01-02
  • 2000–2018年青海湖湖冰物候特征数据集
    摘要&关键词摘要:湖冰物候是气候变化的灵敏指示器。青海湖是我国境内最大的咸水湖,其湖冰物候特征及变化备受关注。本文基于较高时空分辨率的TerraMODIS和LandsatTM/ETM+/OLI遥感影像,综合应用RS和GIS技术构建2000–2018年青海湖湖冰物候特征数据集。本数据集基于MODIS数 ...
    本站小编 Free考研考试 2022-01-02
  • 1952–2009年青藏高原东南部贡嘎山海螺沟流域冰川物质平衡数据集
    摘要&关键词摘要:冰川物质平衡是反映气候变化的敏感指标,是评估冰川变化对水资源和海平面上升影响的基础。然而,青藏高原具备连续物质平衡监测的冰川数量较少,尤其是青藏高原东南部海洋型冰川分布区,加之该区域部分冰川消融区表碛分布广泛,使得青藏高原东南部海洋型冰川物质平衡总体变化特征尚不清楚。本研究基于物理 ...
    本站小编 Free考研考试 2022-01-02
  • 2015年中国西部冰湖编目数据集
    摘要&关键词摘要:以青藏高原为核心的中国西部地区,地理范围为26°N–55°N,65°E–105°E,包含喜马拉雅山、横断山、天山及阿尔泰山等区域。该区域内冰湖分布密集,不仅能够真实地记录气候与冰川的变化状况,而且对于区域水资源具有十分重要的作用。本数据集在综合中国第二次冰川编目数据、云量覆盖度低于 ...
    本站小编 Free考研考试 2022-01-02
  • 2000–2013年黄河源区阿尼玛卿山冰川区数字高程模型及表面高程变化数据集
    摘要&关键词摘要:阿尼玛卿山集中分布了黄河源区81.3%的冰川,该区域冰川变化对黄河源区气候变化指示及冰川水资源评估具有重要参考意义。应用2013年10月31日的TanDEM-X/TerraSAR-X双基站合成孔径雷达数据与SRTM(ShuttleRadarTopographicalMission) ...
    本站小编 Free考研考试 2022-01-02
  • 2000–2014年中国西北地区面积≥10 km<sup>2</sup>主要湖泊边界数据集
    摘要&关键词摘要:中国西北地区深居内陆,气候干旱,湖泊面积变化可在一定程度上反映区域水资源时空分配变化特征。本数据集在综合分析降水数据并结合Landsat系列卫星影像实际覆盖情况确定解译时间的基础上,参考“中国2005–2006年1∶25万面积1.0km2以上湖泊数据集”,选取西北地区113个面积在 ...
    本站小编 Free考研考试 2022-01-02
  • 2009年贡嘎山海螺沟冰川表碛空间分布数据集
    摘要&关键词摘要:冰川消融区表碛厚度空间分布对一条冰川的消融、物质平衡和径流过程的影响有别于无表碛覆盖型冰川。然而,青藏高原及周边地区仅少数冰川有表碛厚度的实测资料,导致区域表碛影响尚不清楚。海螺沟冰川是一条典型的表碛覆盖型冰川,位于青藏高原东南缘贡嘎山东坡。本研究基于ASTER影像的可见光近红外、 ...
    本站小编 Free考研考试 2022-01-02
  • 2015年东帕米尔高原克拉牙依拉克冰川跃动数据集
    摘要&关键词摘要:2015年5月东帕米尔高原克拉牙依拉克冰川(38°35′6″N–38°44′48″N,75°7′47″E–75°22′29″E)发生跃动,给当地牧民生产、生活带来一定的损失。加强对跃动冰川的监测和分析,对冰川跃动机理和灾害预警预报研究具有重要意义。本研究收集了冰川跃动期间Lands ...
    本站小编 Free考研考试 2022-01-02
  • 1980–2015年岗日嘎布地区冰川分布数据集
    摘要&关键词摘要:岗日嘎布地区位于青藏高原东南部,北西–南东走向,毗邻波密、墨脱、察隅、八宿,空间范围分布在29°00′N–29°30′N、96°20′E–97°00′E内。该地区冰川属海洋型冰川,其面积变化对当地及区域水资源有重要影响。本数据集在综合中国科学院寒区旱区环境与工程研究所的我国第一次冰 ...
    本站小编 Free考研考试 2022-01-02
  • 1980–2014年岗日嘎布地区冰川高程变化数据集
    摘要&关键词摘要:由于印度洋季风的影响,位于藏东南的岗日嘎布是青藏高原最湿润的地区,海洋性冰川在该地区集中发育。该地区的冰川物质损失对海平面上升、调节河川径流及冰湖溃决灾害等有重要影响。本文基于1980年航测地形图、2000年2月11–22日SRTM数字高程模型(DEM)和2014年的X波段Terr ...
    本站小编 Free考研考试 2022-01-02