A multi-source 120-year US flood database with a unified common format and public access

Despite several flood databases available in the United States, there is a benefit to combine and reconcile these diverse data sources into a comprehensive flood database with a unified common format and easy public access in order to facilitate flood-related research and applications. Typically, floods are reported by specialists or media according to their socioeconomic impacts. Recently, data-driven analysis can reconstruct flood events based on in situ and/or remote-sensing data. Lately, with the increasing engagement of citizen scientists, there is the potential to enhance flood reporting in near-real time. The central objective of this study is to integrate information from seven popular multi-sourced flood databases into a comprehensive flood database in the United States, made readily available to the public in a common data format. Natural language processing, geocoding, and harmonizing processing steps are undertaken to facilitate such development. In total, there are 698 507 flood records in the United States from 1900 to the present, which highlights the longest and most comprehensive recording of flooding across the country. The database features event locations, durations, date/times, socioeconomic impacts (e.g., fatalities and economic damages), and geographic information (e.g., elevation, slope, contributing area, and land cover types retrieved from ancillary data for given flood locations). Finally, this study utilizes the flood database to analyze flood seasonality within major basins and socioeconomic impacts over time. It is anticipated that thus far the most comprehensive yet unified database can support a variety of flood-related research, such as a validation resource for hydrologic or hydraulic simulations, hydroclimatic studies concerning spatiotemporal patterns of floods, and flood susceptibility analysis for vulnerable geophysical locations. The dataset is publicly available with the following DOI: https://doi.org/10.5281/zenodo.4547036 (Li, 2020).

losses beyond $100,000 (U.S. dollars) per event (Gourley et al., 2017). The flood-producing storms and hurricanes frequently strike the coastal regions with devastating socioeconomic impacts, among which the most damaging Hurricane Katrina affected nine states and resulted in monetary losses over 168 billion USD. Moreover, under the influences of climate change, the increasingly intensified hydrologic cycle and sea level rise pose more threats to coastal areas (Alfieri et al., 2016;Tabari, 2020). IPCC AR5 (2013) has reported that the frequency and intensity of floods in the U.S. are changing, which challenges 35 current water-related infrastructure and water management principles. In light of flood risks, a compilation of a comprehensive flood database can provide insights of both national and regional flood characteristics.
A brief list of data publications on flood disasters is summarized in Table 1 for different regions around the world.
Many published works have hitherto been limited to developed countries such as European countries and the U.S. Developing countries either restrict data sharing or lack the resources to collect and assemble flood events. With respect to the available 40 period, not many works continuously offer up-to-date flood data accessible to the public or research communities (e.g., Fiorillo et al., 2018;He et al., 2018;Luu et al., 2019;Petrucci et al., 2019;Shi, 2003). However, it is noteworthy that there are means of collecting flood information. Conventionally, flood reports are produced by local specialists with limited and sometimes delayed information (e.g., Filrilo et al., 2018;He et al., 2018;Luu et al., 2019;Petrucci et al., 2019). Later on, media outlets (e.g., newspaper) start to participate in timely flood reporting, but typically on the high-impact floods (e.g., Hilker et al., 2009;45 Shi, 2013;Smith et al., 2012;Vos et al., 2010). Insurance companies collectively offer valuable information on flood damages and people affected from a financial perspective (Swiss Re, 2010). Until recently, the increasing engagement of social scientists greatly supports near-real-time flood reporting with web or mobile applications (Chen et al., 2016;de Bruijn et al., 2019), although these reports are often confined to populated, urban areas. In addition to human-led reporting, stream gage and opportunistic sensors (e.g., surveillance cameras, ground radars, and satellites) can also augment flood monitoring in real-time 50 (Hall et al., 2015;Shen et al., 2019).
Despite long-established flood records (reports), there are few studies attempting to merge multi-source flood databases, especially considering the increasing number and diversity of flood databases available. The motivations of a merged dataset are primarily two-fold. First and foremost, we are still under-utilizing all sorts of flood information that can be used for model validation and flood risk analysis (Scotti et al., 2020). Second, each individual dataset has its own limitations, 55 and thereby no single database holistically describes flooding in a given region (Gourley et al., 2013). For instance, flood reports by government agencies or the media are skewed towards high-impact events, whereas local community-level, lowend floods are oftentimes ignored. In light of these motivations, efforts should be undertaken to collectively merge all possible sources to provide off-the-shelf data support to complement flood-related research. Gourley et al. (2013) assembled a georeferenced U.S. database from three primary sources: 1) discharge observations from the U.S. Geological Survey (USGS), 60 2) flood reports by the National Weather Service from 2006 to 2013, and 3) witness reports from the public. Amponsah et al.
(2018) merged a high-resolution flash flood database in Europe with a set of spatial data, rainfall data, and discharge data from 1991 to 2015. Petrucci et al. (2019) collaboratively harmonized five regional flood databases from 1980 to 2015 in the Mediterranean region to investigate the causes of deaths in flood events. These merged datasets are relatively short in time and not complete. In this study, we introduce a comprehensive United States Flood Database -USFD, which compiles seven 65 individual databases and converts them into a common data format. Sources to compile this database include 1) reports from news media, 2) reconstructed flood events from gage and satellite instruments, and 3) crowdsourcing data queried from the web and mobile applications. As a result, a 120-yr flood database in the U.S. is assembled, unified, and published for public access, as well as an interactive web interface for immediate use. It is anticipated that this database can support a variety of flood-related research, such as a validation source for hydrologic/hydraulic simulation, climatic studies concerning 70 spatiotemporal patterns of floods given this long-term and U.S.-wide coverage, and flood risk analysis for vulnerable geophysical locations. Primary assessments on flood occurrences across the U.S., flood seasonality within major basins, and socioeconomic impacts across time are carried out to share insights on U.S. floods. This article is structured as follows. Sect. 2 details seven individual databases and ancillary datasets used to form our database. Sect. 3 describes methods to retrieve (query), clean, and unify these datasets in a processing pipeline. Lastly, Sect 4. 75 serves as a pre-assessment on floods in the U.S. over the past 120 years, spatially aggregated by geopolitical boundaries and for major U.S. river basins.

Individual databases
In this section, we detail seven individual databases, which are the NOAA National Weather Service (NWS) storm 80 reports, Emergency Events Database (EM-DAT), Dartmouth Flood Observatory database (DFO), the University of Connecticut Flood Events Database (FEDB), cyber-infrastructure flood database (CyberFlood), meteorological Phenomena identification near the ground data (mPing), and Global Flood Monitoring (GFM).

National Weather Service storm reports
The NOAA NWS routinely publishes post-event reports of floods from trained spotters, local authorities, and 85 emergency management officials. This dataset is arguably the most exhaustive meteorology-driven reporting in the U.S. The descriptors can be mainly categorized into the geophysical location (e.g., begin and end location), time period (e.g., begin and end time), causes (e.g., heavy rain), impacts (e.g., fatalities and damages), and narratives (see technical documentation for details: https://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf). Limitations of this database for flood events are summarized in Gourley et al. (2013) such as 1) imprecise event location, 2) times related to meteorological events, 3) relying 90 on in-person witness accounts, and 4) limited information about the site exposure to antecedent condition. We retrieve all flood records from 1950 to the present, which totals 144,313 reports.

Emergency Events Database (EM-DAT)
The EM-DAT database is produced and maintained by the Centre for Research on the Epidemiology of Disasters (CRED) in Belgium, which contains all types of global natural disasters in the world from 1900 to the present. These recorded 95 events should meet one of the following criteria: 1) >10 people dead, 2) >100 people affected, 3) declaration of a state of emergency, or 4) a call for international assistance. The sources of information stem from government agencies, nongovernment organizations, insurance companies, research institutes, and press agencies. The EM-DAT provides information including geographic location, time entry, fatalities, and economic damages. All the flood-related data entries are collected via public access at https://www.emdat.be/. Due to its reporting criteria, there are only 189 events recorded in the U.S. 100

Dartmouth Flood Observatory (DFO)
The DFO data, regarded as one of the most popular flood databases in the world, collects flood events from news, government agencies, and stream gauges and remote-sensing instruments from 1985 to the present (Brakenridge, 2020).
Different from other databases, the DFO collectively retrieves spatial flood information from satellite remote-sensing products, such as the Moderate Resolution Image Spectroradiometer (MODIS), Sentinel-1, and Landsat. Flood extent is accordingly 105 provided as shapefiles for easy integration into Geographic Information Systems software. The tabular data includes geophysical location, date/time, fatalities, affected area, displaced people, flood severity, and primary causes. However, events without significant river flooding are not included in this database, and they are subject to uncertainties from satellite-derived flood extent such as water-like echoes in urban areas and limitations due to cloudiness. 469 events have been retrieved from its tabular data in the U.S. 110

University of Connecticut Flood Events Database (FEDB)
Taking advantage of nation-wide flow records at 6,301 stations operated by the U.S. Geological Survey (USGS) and radar rainfall measurements, a comprehensive flood database is reconstructed from 2002 to 2013, using the characteristic point method (Shen et al., 2017). At each gauge site, flood events are identified by baseflow separation and filtered with nonsignificant peaks (i.e., less than 95 th percentile). Additionally, flood-producing rainfall events are traced within a certain time 115 window to portray an event. The FEDB provides shapefiles of stream gauges with a series of flood event attributes (e.g., flow peak, flow period, rainfall event period, base flow, rainfall-runoff coefficient, and spatial moments-based characteristics). The original dataset is retrieved from https://ucwater.engr.uconn.edu/fedb. Limitations of this database are the reconstructed events may not necessarily lead to damages, which may undermine its role in flood impact-related research, and the flood events must occur in USGS-gauged basins. Over 542,000 events have been reconstructed in the U.S., making this flood database being the 120 biggest contributor to the combined database.

CyberFlood
CyberFlood is a crowdsourced flood database by collecting event reports in a web application developed at the University of Oklahoma (Wan et al., 2014). It is regarded as one of the first integrated systems that collect, organize, visualize, and manage a flood database globally. We queried the latest results of CyberFlood, which contains flood events, geographic 125 locations, date/time, country code, causes, and fatalities. The latest version of CyberFlood has 203 flood records from 1998 to 2008. To facilitate data unification, we convert all the code-based descriptors to strings (i.e., country and causes) with key matching methods.

meteorological Phenomena Identification Near the Ground (mPing)
The mPing app is a crowdsourcing, weather-reporting software jointly developed by NOAA National Severe Storms 130 Laboratory (NSSL) and the University of Oklahoma (Elmore et al., 2014). Members of the public who downloaded this app based on their GPS-enabled smartphones can report the weather event at their locations. Time, geophysical coordinates, standard event types (e.g., flood events classified into four severity levels, tornado, precipitation type, wind). The four flood severity levels are based on the Flash Flood Severity Index (FFSI) proposed by Schroeder et al. (2016). One major limitation for crowdsourcing data lies in the data validation, as some events are improperly misreported or even hacked. Chen et al. (2013) 135 compared these reports to ground radar observations with respect to precipitation types, and a satisfying correspondence is found between the two. mPing data provides REST API for research-purpose uses, and we queried flood-related events from 2013 to the present with 5000 flood events counted.

Global Flood Monitoring (GFM)
The GFM data is produced and managed by de Bruijn et al. (2019), with over 88 million Twitter tweets over the globe 140 since 2014. Contents tied to flood observations are filtered with the Natural Language Processing (NLP) tool BERT, which extracts time of observation and toponyms (in token) and assigns reports to the database attributes after a quality assessment.
It is found in the study of Bruijn et al. (2019) that around 90% of the events are correctly detected when compared to another disaster database. Table attributes include event_id, location_id, location_ID_url, country_ID, country_ISO3, and the time of detection. Due to privacy issues, all the locations are archived in tokens, which requires further decoding. Data is publicly 145 accessible at https://www.globalfloodmonitor.org/download. Given the latest database, we retrieved 6315 flood events in the U.S. and subsequently processed them as described in Sect. 3.

Ancillary datasets
Since one purpose of this database is for flood susceptibility analysis, contributing factors to flooding are also incorporated for a given location. Land Use Land Cover (LULC), Digital Elevation Model (DEM), slope, distance to a major 150 river, drainage area, and 500-yr flood depth are factored into the data attributes. The LULC value is retrieved from the Copernicus Global Land Service (CGLS) at 100-m resolution, covering urban, cultivated land, forest, vegetation, wetland, water, and ice. The topographic inputs (i.e., DEM and slope) are acquired from the NASA Shuttle Radar Topography Mission (SRTM) at 90-m spatial resolution, and hydrography dataset (i.e., river networks, drainage area) are acquired from MERIT Hydro at the same resolution (Yamazaki et al., 2019). The 500-yr flood depth is downloaded from the Joint Research Centre 155 Data Catalogue at https://data.jrc.ec.europa.eu/collection/floods at 1-km spatial resolution. All the extensive computations (i.e., sampling) are processed using the Google Earth Engine platform (Gorelick et al., 2017). It is challenging to completely retrieve the location of events from some databases. The NWS storm report has some missing entries in geographic coordinates, but instead, it has detailed narratives. To compensate, we use the NLP toolkit provided by the spaCy package, which contains pre-trained models for English multi-tasks. spaCy firstly tokenizes the event 170

Processing methods
narratives and subsequently parses and tags each word with respective entities. Then, we can geocode locations into geographic coordinates via calling the Google Map API. The GFM also does not contain precise geophysical locations to protect user privacy. Therefore, the geographic coordinates are inferred by first converting location tokens into administrative locations (e.g., cities and villages) via the GeoNames API and then geocodes them into geographic coordinates. The state names of a merged dataset are firstly validated with geophysical locations from an inverse geocoding. If they do not match, a new name 175 from GeoNames is assigned to replace the original one. Meanwhile, the empty fields are also filled during this process. We also processed supplementary flood information such as affected areas and damages from the original database. The affected area for each event is calculated by assuming a circular area whose radius is approximated by the recorded range if available.
For economic damages, we sum up all available sub-category damages (e.g., agriculture, property, and structure) to give a holistic view. For those single databases that do not provide information or information that cannot be inferred from the specific 180 header, we uniformly treat them as Not A Number (NAN) values. As a result, we merged 698,507 total flood records in the U.S. from 1900 to the present. In the DOI link, a merged database USFD, along with seven individual databases, are provided in either comma-separated format or Excel format for general readability. Additionally, an interactive web interface is built FEDB contributes a major portion of the unified database because of the data length of flow records, and states with higher gauge densities undoubtedly yield more event numbers than gauge-sparse regions. The data nonuniformity underlies a major limitation of this specific dataset. Regions with more exposure to observational sources (e.g., densely populated and gaugedense areas) likely have more recorded events. However, it is expected that by including more observations from remote-205 sensing sources, this gap can be potentially compensated. Following the composition, NWS storm reports comprise the second largest number of events because of the long data length (i.e., 70 years). Other databases, such as EM-DAT, though the longest available length, only records very high-impact events, and the crowdsourcing databases are limited by their short lengths.

Flood seasonality in major water basins
Flood variability is highly associated with seasonal atmospheric pathways of moisture delivery and basin attributes 210 (Dickinson et al., 2019). In this regard, we segregate the nation-wide events into major basins and months. The Hydrologic Unit Code (HUC) 4-digit basins, as shown in Fig. 3, are obtained from the national hydrography dataset. basins, similar to that of other studies (e.g., Brunner et al., 2020;Dickinson et al., 2019;Villarini, 2016). The basins are clustered into several regions according to local hydroclimatologies. In the West Coast (e.g., western Washington, Oregon, 215 and California), flood events are dominant in winter months because of Atmospheric Rivers (AR) as a main driving factor, which is a carrier of water vapor from the tropics (Ralph et al., 2006). Moving to the East, floods in the Rockies (i.e., Upper Colorado and Great Basins) are featured by spring snowmelt in snow-fed rivers, whereas in the Desert Southwest (e.g., Lower Colorado and Rio Grande regions), floods likely occur in late summer, which is ascribed to the North American monsoon and North Pacific tropical cyclones. Closer to the Gulf of Mexico, flooding events during late spring and summer are due to severe 220 thunderstorm activity and mesoscale convective systems. The lower Mississippi, Ohio, and Tennessee river basins experience their biggest floods during the spring from extratropical cyclones (Lavers and Villarini, 2013). The lower Florida Peninsula features high numbers of summer flood events, which are tied to North Atlantic tropical cyclones (Villarini et al., 2014). In the northeastern U.S., tropical cyclones, winter-spring extratropical cyclones, warm-season thunderstorms are the primary flood agents, yet winter-spring extratropical cyclones account for larger fractions, similar to the study of Smith et al. (2011). 225 Figure 3b displays the number of flood events within each basin, grouped by months. The Mid-Atlantic region -HUC2 02takes seven places out of the top twenty basins, with the Delaware river basin near the coast (HUC2-0204) being the highest one. In terms of flood seasonality, it is relatively evenly spanned across seasons and months for these listed basins, which suggests its susceptibility to widespread floods. This symmetric feature around the Appalachian Mountains across seasons is also highlighted in Villarini (2016), in which they suggest flow regulations play an essential role in weakening the seasonal 230 cycle. In a recent study by Brunner et al. (2020), these basins are identified as severe or moderate widespread flooding in space, and our results indicate these regions also have widespread flooding in time (month). snowmelt occurring in early spring, in conjunction with the rain-on-snow effect. The Great Plains feature an earlier maximum flood frequency, transitioning from early summer to late spring, which relates to enhanced and earlier timing of thunderstorm activities due to spring warming. The South Atlantic Coast shows a delayed maximum flood frequency from winter to spring.
The lower Florida Peninsula, however, does not present a clear monthly shift, which is still controlled by tropical cyclones.

Flood impact assessment 240
In the USFD, flood impacts are based on affected areas, economic damages, and fatalities. Since affected areas are relatively subjective, they are not analyzed in this study. All the economic damages (US dollars) are adjusted for inflation with GDP deflectors obtained from the World Bank. Figure 5 depicts the fatalities and damages by year. Although events continuously span from 1900 to the present, impact assessments were not provided in the earlier years (before 1980). The 10yr running mean generally represents the long-term trend. Both fatalities and damages begin at high rates in the early years,  We showcase three analyses based on the developed flood database. For flood occurrences across the U.S., Texas, Pennsylvania, and Missouri are highlighted with great exposures to floods in total amount, which could raise awareness from policy-makers and the public. Flood seasonality in major river basins generally follows the large-scale synoptic weather patterns. In addition, delayed timings of maximum flood frequency are observed in the West Coast and Atlantic River Basin, possibly due to earlier snowmelt than in prior decades that now contributes to spring floods. Floods in the Great Plains, on the 270 contrary, feature an earlier month of maximum flood frequency, which is possibly tied to intensified thunderstorm activities because of earlier spring warming. Lastly, flood impacts are assessed in terms of economic damages and fatalities, and we found a slight increasing trend in damages in recent years. Especially in Texas and Louisiana, a consistent increase in damages is evident, which relates to intensified storm activity and expanding urban zones. Under a warming climate, storms are projected to occur more frequently in the future, which challenges current water infrastructures and water management 275 principles. Notwithstanding, there are some limitations associated with the current version of USFD. First, the individual databases disproportionally make up the merged one. FEDB, taken from streamflow records over a long history, consists of the majority of the flood events. The emerging crowdsourced databases are expected to play a significant role with the increasing engagement of citizen scientists. Space-based observations could markedly bridge the gaps between well-observed 280 urban areas and gauged basins to gauge-sparse areas in rural zones. For instance, complete use of the MODIS imagery onboard Terra and Aqua satellites, in association with Landsat or SAR data, can reconstruct global flood events at daily resolution. In addition to flood extent, flood depth can also be approximated through the use of high-resolution DEM data. Floods, reported by insurance companies, offers another angle to not only record events, but relate flood hazards to societal impacts comprehensively. In the future, we hope to incorporate such a dataset to enrich our database at a global view. Second, the data 285 processing framework needs to be automated in real-time. We plan to migrate processing codes to the cloud, so that they can query records from the child databases and update the parent database on a regular basis.

Data and code availability
The USFD open-access dataset and all individual datasets are available at https://doi.org/10.5281/zenodo.4547036 (Li et al., 2020b). The Python codes to process, merge, and analyze are publicly accessible at https://github.com/chrimerss/USFD.

Competing interests
The authors declare that they have no conflict of interest.       1964, 1997, 2011, 2012, 2016, and 2017, which