This study proposes a comprehensive benchmark dataset for
streamflow forecasting, WaterBench-Iowa, that follows FAIR (findability, accessibility, interoperability, and reuse) data principles
and is prepared with a focus on convenience for utilizing in data-driven
and machine learning studies, and provides benchmark performance for
state of art deep learning architectures on the dataset for comparative
analysis. By aggregating the datasets of streamflow, precipitation,
watershed area, slope, soil types, and evapotranspiration from federal
agencies and state organizations (i.e., NASA, NOAA, USGS, and Iowa Flood
Center), we provided the WaterBench-Iowa for hourly streamflow forecast
studies. This dataset has a high temporal and spatial resolution with rich
metadata and relational information, which can be used for a variety of deep learning and machine learning research. We defined a sample benchmark task
of predicting the hourly streamflow for the next 5 d for future
comparative studies, and provided benchmark results on this task with sample
linear regression and deep learning models, including long short-term memory
(LSTM), gated recurrent units (GRU), and sequence-to-sequence (S2S). Our
benchmark model results show a median Nash-Sutcliffe efficiency (NSE) of 0.74 and a median Kling-Gupta efficiency (KGE) of 0.79
among 125 watersheds for the 120 h ahead streamflow prediction task.
WaterBench-Iowa makes up for the lack of unified benchmarks in earth science research and can be accessed at Zenodo
Deep learning, a set of algorithms based on artificial neural networks (ANN) for supervised and unsupervised modeling, has been widely used and recognized as a powerful approach within many scientific disciplines for technological and predictive progress (Goodfellow et al., 2016). As conventional machine learning techniques were deemed limited in learning the representations of high-dimensional datasets from their raw form, by providing universal approximator models (Cybenko, 1989; Hornik et al., 1989; Leshno et al., 1993), deep neural networks increased scientists' ability to model both linear and non-linear problems without time-intensive data engineering processes by domain experts (LeCun et al., 2015). Deep learning's predictive modeling capabilities have led to improvements in various fields, including image recognition and synthesis (Demiray et al., 2021), speech recognition, language modeling, and time-series prediction.
Flooding is a significant concern for many areas in the world as it is on an upward trend due to climate change. The 1998 Bangladesh flood, the Iowa flood of 2008, and the 2013 North India floods show how catastrophic and both economically and psychologically devastating floods can be for populations in the respective regions. In order to maximize the preparedness for floods and minimize their effects after the disaster (Yildirim and Demir, 2021), weather and flood forecasting stands as a perennial research interest for hydrologists and data scientists. Streamflow prediction and runoff modeling are research efforts where the water from the land or channel over time is modeled and forecasted using previous data points for a location or nearby locations with similar characteristics. Although this effort is conventionally carried out with physically based models that require extensive computational (Agliamzanov et al., 2020) and data resources, it is critical for flood mitigation and decision support (Xu et al., 2020).
Being a time-series prediction task, in essence, flood forecasting takes advantage of the practicality and efficacy that deep learning brings to predictive modeling. Both time-series adaptations of deep learning models intended for natural language processing, and time-series focused deep neural network implementations make this possible by proposing methodologies that put the sequential nature of time-series datasets into good use. Recurrent neural network (RNN) architectures such as long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997), gated recurrent unit (GRU) networks (Chung et al., 2014), and attention-based sequence-to-sequence (S2S) networks (Vaswani et al., 2017) are pronounced starting point for deep neural network architectures for most time-series forecasting tasks.
Supervised learning, whether it be deep or not, is the most common form of
machine learning (LeCun et al., 2015), and supervised learning tasks, such
as flood forecasting, need a dataset of previously recorded or labeled
entries for the task. That dataset typically consists of
The number of studies in hydrology and water resources, and particularly in flood forecasting that employ deep learning, has been gaining interest in the last several years (Sit et al., 2020). Flood forecasting studies in the literature, due to the aforementioned sequential nature, have vastly employed RNNs and LSTMs. Kratzert et al. (2018) utilized LSTM networks for daily runoff prediction using meteorological datasets. Furthermore, Kratzert et al. (2019) applied a similar approach for ungauged US locations. Bai et al. (2019) incorporated a stack autoencoder with LSTM for daily streamflow measurements from data for a week. Xiang et al. (2020) predicted the next 24 h of hourly streamflow rate by utilizing an encoder-decoder S2S neural network that also uses rainfall products. Xiang and Demir (2020), moreover, extended their study and developed a model that forecasts the hourly streamflow rate for the next 5 d using 3 d of historic data. They also incorporated upstream sensors into their proposed network. Using the same dataset, Xiang et al. (2021), explored the generalization of S2S encoder-decoder networks in flood forecasting. Sit and Demir (2019) predicted hourly sensor measurements for 24 h using data from the upstream sensor network and historic stage height measurements. Finally, Sit et al. (2021a), utilized graph neural networks for streamflow forecasting for a small watershed in Iowa. To sum up, deep learning models such as LSTM have been used in meteorology and hydrology studies of soil moisture modeling (Seeger et al., 2016), water table depth prediction (Zhang et al., 2018), rainfall runoff modeling (Hu et al., 2018; Kratzert et al., 2018), streamflow forecasting (Xiang et al., 2020). As represented in perspective studies (Reichstein et al., 2019), deep learning models such as LSTM can extract spatiotemporal features automatically to gain further process understanding of earth system science problems. Therefore, we pay great attention to the application of LSTM and its variant models in this research.
Most of the studies mentioned here acquired several raw data products,
whether in terms of rainfall measurements, physical features of the studied
area, or stage height, or discharge measurements, from authorities and build
their own dataset benefiting from their expertise in the area. There are
several datasets and benchmarks in other earth science studies, i.e., air
quality forecast dataset, 3D cloud detection dataset, and LANL earthquake
prediction dataset. One of the early user-friendly datasets in earth science
is the Beijing PM
For improved generic deep learning-based flood forecasting models, scientists must expand on previous work, and this can be done with the same testing set-up and evaluation mechanism. There are some studies in the literature of hydrology in limited numbers that construct the neural network architecture around the CAMELS dataset (Newman et al., 2014). CAMELS is a vast dataset that includes meteorological and observed streamflow data points for the USA, albeit not in an easy to use and ideal format for deep learning research. It contains 671 catchments in the contiguous USA that are minimally impacted by human activities. It includes features such as topography, climate, streamflow, land cover, soil, and geology on a watershed scale, and the hydrometeorological time-series data ranges from 1980 to 2014 on a daily basis. The data are generated from different sources, including Daymet, NLDAS, and Maurer. CAMELS aggregated these datasets at the watershed level. The researchers also performed the model simulation using physically based models such as the NWS model (McEnery et al., 2005), and SNOW-17/SAC-SMA (Franz et al., 2008) which are two popular traditional models of the past decades; however, these modeling results are not shared as a benchmark. Even though there is a dataset that could be used for predictive deep learning rainfall runoff modeling, there is still a lack of accessible datasets for benchmarking purposes (Maskey et al., 2020). There remains a need for a dataset that is more convenient to use in deep learning research given that most of the deep learning researchers are not domain experts. The limited usage of CAMELS in the literature also predicates the challenges the CAMELS dataset presents for deep learning research.
Another dataset for flood forecasting is FlowDB (Godfried et al., 2020). Unlike CAMELS, there are not many studies that report their performance over FlowDB yet as the dataset was only recently published. FlowDB is an hourly precipitation and river flow dataset that also includes a subset dataset for flash floods. The subset dataset includes injury costs and damage estimations for flash flood events. FlowDB gathers river flow data from the USGS and precipitation data from many agencies, including the USGS, NOAA, and ASOS. Additionally, the data FlowDB provides regarding flash floods uses NSSL Flash by NOAA.
This study proposes a flood forecasting dataset that is prepared with a focus on convenience for utilization in data-driven and machine learning studies and provides benchmark performance for state of art deep learning architectures on the dataset for comparative analysis. Our dataset follows FAIR data principles (Wilkinson et al., 2016), which means it is findable and accessible through DOI, and the data is richly described with references. WaterBench provides data from 125 catchments in the state of Iowa. The precipitation time-series data ranges from October 2011 to September 2018 along with catchment-based features, such as topography, soil type, and slopes. Even though the dataset was designed in a way to eliminate most of the preprocessing and data engineering tasks for machine learning applications and research, it could be used in other studies with similar goals, such as physically based modeling. Similarly, the dataset could be used by combining it with other benchmark datasets such as IowaRain (Sit et al., 2021b) utilizing cloud-based rainfall products (Seo et al., 2019). WaterBench is different from CAMELS with a higher temporal resolution. In addition, it focuses on the state of Iowa, and many large catchments in WaterBench contain multiple USGS gauges, which helps to represent the river structure better, and upstream-downstream relations in deep learning algorithms. The rest of this paper is structured as follows; the dataset preparation phase and methodology employed in that phase are discussed in Sect. 2. Section 3 gives a list of tasks that could be tackled using this dataset and presents the performance of several neural network implementations in flood forecasting tasks. In the last section, conclusions are discussed.
The State of Iowa is located in the Midwest of the USA. It has abundant and diversified water resources with 115 318 km of rivers and streams from border to border (Iowa Department of Natural Resources, 2022). In 2008, eastern Iowa was devastated by flooding which caused over USD 6 billion in property losses. Streamflow monitoring and forecasting are consequently critical for Iowa for better water resources and disaster management. In addition, agricultural-based activities in Iowa have a low pavement rate with limited human influence, which makes it a suitable area for rainfall runoff studies.
The location of 125 USGS gauges in the State of Iowa for upstream sub-basins (green dot) and large downstream basins (red dot).
The United States Geological Survey (USGS) has over 100 streamflow gauges in the State of Iowa for monitoring the streamflow rate in different streams. The measurements from the USGS are typically recorded at 15–60 min intervals in Iowa. Due to site maintenance or shutdowns the coverage of the USGS streamflow gauges changes over the years. In this dataset, we selected all USGS gauges in the State of Iowa with available data from 1 October 2011 (the water year 2012) to 30 September 2018 (the water year 2018).
As shown in Fig. 1, red dots are located at the outlets of larger basins with multiple USGS gauges, which are divided into several smaller upstream sub-basins. The green dots are located at the outlets of the most upstream sub-basins. Thus, considering the connectivity of the streams, the relationship of these gauges in one watershed can be represented as a tree structure.
WaterBench includes detailed metadata and time-series features for each catchment. These datasets are available in .csv format for each catchment. The details of the datasets with data source, type, resolution and units are shown in Table 1. The statistics of the data, including the watershed size, concentration time (the longest streamflow path in the catchment), slope, and four soil types, are shown in Table 2 and Fig. 2. For each catchment, we provide static data (area, slope, travel time, etc.) as well as time series for streamflow, precipitation, and evapotranspiration (ET).
The details of datasets with data source, type, resolution and units.
The minimum, maximum, mean, median, and standard deviation (SD) of the watershed area, concentration time, average slope, and percentage of soil types including loam, silt, sandy clay loam, and silty clay loam among 125 USGS gauges in the State of Iowa.
Histograms of the catchment area
Summary statistics for precipitation and streamflow among 125 catchments from water year 2012–2018. Missing rate as a limitation.
As shown in Table 3, all 125 catchments share similar precipitation ranges from 794 to 1056 mm, with a small standard deviation of 57 mm. Geologically, all the catchments are located in two HUC (hydrologic unit code) watersheds, the Upper Mississippi and Missouri rivers, and the study results may not be applicable to other regions in the USA. However, the modeling algorithms and the neural network architectures normally apply to a broad spectrum of problems, and they would be useful in other regions. WaterBench-Iowa is also subject to a relatively high missing data rate for streamflow as the reliable hourly dataset is limited by the USGS for some of the watersheds in Iowa. In the following sections, we will discuss the details of specific datasets and features.
In the water cycle, precipitation is the main driving force of the
streamflow. Based on the 90 m digital elevation model (DEM), only the
precipitation in a certain area will contribute to a stream. Each measuring
station has its corresponding area, which can be calculated from the
watershed boundary shapefiles. Since the total precipitation amount is the
product of precipitation intensity and area, in the same watersheds
upstream sub-basins typically have lower streamflow rates than the larger
basins. In WaterBench, the boundary shapefiles of each watershed are
obtained from the Iowa Flood Information System (IFIS), a system operated by
the Iowa Flood Center (IFC). Moreover, the area is calculated from the
shapefiles in the unit of km
The time of concentration provides the dimension of stream length for a
watershed. In WaterBench, the time of concentration is defined as the
longest length divided by the velocity, which is the time the water concentrates
from the most distant point from the watershed outlet. The velocity used in
this study is a constant value of 0.75 m s
time of concentration contains one value per station, and it is available in the column of “travel_time” in the “{station_id}_data.csv” files.
The slope is one of the topographic features that represents the slope gradient in percentage. A steep slope may cause a higher velocity and lower infiltration rate, which normally causes a larger streamflow rate during a precipitation event. The original file, hillslope map, is calculated by IFC (Sit et al., 2019), which splits the land of Iowa into over 600 000 hydrologic units using the algorithm developed by Mantilla and Gupta (2005). In WaterBench, the average slope is calculated from the mean value of the hillslopes in each catchment (Gericke and Du, 2012). Thus, the slope is a constant value per watershed, and it is available in the column of “slope” in the “{station_id}_data.csv” files.
Soil type is one of the topographic features that represents the proportions
of 12 different soil types on the land. Normally, the sandy soil has the
largest infiltration rate, and the clay has the least infiltration rate. The
original file, global soil types, is available from NASA (Post and Zobler,
2000). It is a 2-D map with a spatial resolution of 0.5
The streamflow rate is a variable measured by the USGS in the unit of cubic
feet per second. The data was acquired from the USGS National Water
Information System. There are nearly 200 real-time streamflow measuring
stations in Iowa. After removing the stations established after 2011 or
permanently closed before 2018, a total of 125 stations are selected, as
shown in Fig. 1. For each station, streamflow data was aggregated to
hourly values. The original data contains a few missing values due to
station system failures or internet outages. For the stations located in the
northern part of Iowa, the river may freeze and have no flow rate
measurement over the winter, and all missing values were reported as
Many station-based and satellite datasets have been measuring precipitation over the years. After comparisons, it is found that NOAA's Stage IV multi-sensor measurement is the most accurate (Seo et al., 2018) in the state of Iowa. The Stage IV multi-sensor provides the hourly precipitation amount with a 4 km-grid spatial resolution. The catchment level average precipitation is then calculated at each hour. Since there is no rainfall or snowfall most of the time, most precipitation values in the dataset are 0. In the dataset, we provide the hourly catchment-averaged precipitation data for each station from 1 October 2011 00:00 to 30 September 2018 23:00. Thus, the precipitation data contains 61 368 values per station, and they are available in the column of “precipitation” in the “{station_id}_data.csv” files.
The ET represents the evaporation and plant transpiration from the land in the water cycle. It is one of the major losses of precipitated water. As no high-resolution real-time ET dataset is available, we used the monthly estimation from the historical measurement data in the past decades (Krajewski et al., 2017) as an empirical dataset. This is a monthly based dataset for the entire state of Iowa, and successfully captures the seasonal effects in the state of Iowa. In the dataset, we applied the ET value for each time stamp from 1 October 2011 00:00 to 30 September 2018 23:00. Thus, the ET data contains 61 368 values for all stations, and they are available in the column of “et” in the “{station_id}_data.csv” files.
The median NSE and KGE among 125 watersheds in 125 different models at the prediction of the next 1–120 h.
The median (standard deviation) NSE and KGE among 125 watersheds at the prediction hours 1, 6, 12, 24, 48, 72, 96, and 120 in 125 different models.
As many USGS measurement gauges are in the same watershed, many catchments in WaterBench-Iowa are not independent, and a relationship tree is given in the “catchment_relationship.csv”. The csv file represents a disconnected directed graph with each row representing an edge. Out of 125 catchments 63 have 1 or more upstream, as shown in the relationship, which are relatively large catchments. The remaining 62 catchments are specified as the very upstream catchments which have only 1 stream gauge. As these catchments have no overlapping areas, the catchments in our dataset form a disconnected graph. For the catchments that have overlapping areas, the watershed ID 646 has the largest connected subgraph with 27 upstream catchments. With upstream-downstream relationships, WaterBench-Iowa supports the cutting-edge studies such as graph neural networks.
In this section, we define a sample benchmark task of predicting the hourly
streamflow for the next 5 d for future comparative studies. This task
forecasts the future hourly floods at each hour as the National Water Model
does. At each hour
Histogram of the GRU model performance.
Cumulative probability curve of the NSE and KGE at the 120 h forward predictions.
We take two separate approaches to tackle this problem. The first approach
involves a separate deep learning model for each of the available
watersheds, while the second one involves building a single large regional model
that carries out the same task for all available watersheds. For this
specific task, we selected the last water year as the test set, and the rest
as the training set. We further formatted the original dataset into a
ready to use structure for each watershed with four files named as
train_x, train_y, test_x,
test_y. Thus, a total of 500 files for 125 watersheds are
provided for this specific task. As general statistics, such as mean
square error (MSE) and root mean square error (RMSE) are not
dimensionless, the metrics for this study are Nash-Sutcliffe efficiency
(NSE) and Kling-Gupta efficiency (KGE). They are both dimensionless
statistics that are widely used in hydrological studies, and can be used to
compare between watersheds. Both NSE and KGE range from negative infinity to
1, and the closer to 1 the better. The Eqs. (1) and (2) for NSE and KGE are
shown below:
Both NSE and KGE are dimensionless and in the range of (
To provide baseline results over the sample benchmark task and two approaches defined in the previous section, we employed a linear regression model using ridge regression, and three deep learning models using LSTM, GRU, and S2S network architectures. For the first approach, we considered each watershed independently and trained one model for each watershed. Thus, the relationship between the watersheds is not used in this benchmark. The median NSE and KGE scores among 125 watersheds at each hour are shown in Fig. 3 and Table 4. As shown in the figure and the table, the ridge regression has a high accuracy in the first 24 h as the streamflow rates normally do not change too much in 1 d, and they are relatively easy to predict. The metrics for the medium-range show that the model using GRU has the best performance. The NSE and KGE histograms of GRU show that for most of the watersheds the GRU model performs well and the GRU model gives negative scores only in a limited number of watersheds. The standard deviations show relatively stable results in all prediction hours using deep learning models. However, the ridge model shows higher standard deviations and lower model performance than deep learning models over 48 h.
Figure 5 shows the cumulative distribution of the NSE and KGE among the 125 catchments at the lead time of 120 h in addition to the median value for all 125 catchments. The results suggest that there is a large standard deviation between catchments, and that negative NSE and KGE values occur in 10 % of the catchments. These catchments with negative NSE or KGE values are small (Fig. 7), so it is very challenging to predict the streamflow over 5 d.
As for the second approach, we attempted to develop single regional models for all 125 watersheds as they share similar physical attributes. As shown in Fig. 6, a single model of all 125 watersheds is possible with the physical features including area, slope, travel time, and soil types using the customized NSE loss function (Xiang et al., 2021). Among four models, similar to the first approach, the performance of ridge regression is hard to beat at first. Nevertheless, the deep learning model S2S starts to show a better performance starting the second day. Table 5 shows the detailed results of the regional model. Regional modeling using deep learning is more difficult as seen by the decline in model performance and greater standard deviations compared to the watershed modeling results in Table 4.
The median NSE and KGE among 125 watersheds using one regional model at the prediction of the next 1–120 h.
The median (standard deviation) NSE and KGE among 125 watersheds at the prediction hour 1, 6, 12, 24, 48, 72, 96, and 120 using one regional model.
The distribution of the 120 h ahead prediction using the best model in our benchmark (GRU for the single station).
As shown in the results, there are two major limitations. First, the model
efficiency is low on the first day. It is shown in Fig. 3 and Table 4 that
the deep learning models do not show a higher accuracy in the first several
hours compared to the ridge model. Some hydrological studies have also shown
that the basic persistence model (Streamflow
Although a lot of metadata are provided in our dataset, as a benchmark our study does not consider complex pretreatment or models with domain knowledge in hydrology. Some recent studies have shown that the moving average for smoothing, the consideration of time lag, the consideration of watershed upstream-downstream connections, and other deep learning model architectures may be effective for a better prediction. However, these studies were based on their own datasets, and the results cannot be directly compared. We encourage researchers to conduct comparisons based on the WaterBench-Iowa.
The data and codes that support this study are openly available in Zenodo at
In this study, by aggregating the datasets of watershed area, slope, soil types, streamflow, precipitation, and ET from NASA, NOAA, USGS, and IFC, we present a dataset, namely WaterBench-Iowa, that is prepared for an hourly streamflow forecast task. This dataset has a high temporal resolution with abundant geographic and relational information, which can be used for a variety of deep learning and machine learning application research. We defined a sample streamflow forecasting task for the next 120 h and provided example benchmark results on this task with a traditional linear and three customized deep learning models.
WaterBench-Iowa is not filtered and thus represents an actual streamflow
forecast problem as much as possible. Although the data are limited to the
Midwest, we believe that any studies on this dataset could provide insights
into other streamflow forecasting and rainfall runoff modeling studies in
other watersheds. With the open-source release of WaterBench-Iowa
(
ID contributed to the conceptualization, project supervision and administration, funding acquisition, and review and editing. ZX contributed to the data curation, methodology, writing of method, results, and discussion. BD contributed to the software, validation, data analysis, dataset submission, and review and editing. MS contributed to conceptualization, investigation, supervision, writing of introduction and conclusion, and review and editing.
The contact author has declared that none of the authors has any competing interests.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the special issue “Benchmark datasets and machine learning algorithms for Earth system science data (ESSD/GMD inter-journal SI)”. It is not associated with a conference.
The work reported in this study was made possible by the support of members of the Iowa Flood Center and the Department of Civil and Environmental Engineering at the University of Iowa. This research received no external funding. We sincerely appreciate all the valuable comments and suggestions from the editors and reviewers, which helped us improve the quality of the manuscript.
This paper was edited by Martin Schultz and reviewed by three anonymous referees.