GRUN: An observations-based global gridded runoff dataset from 1902 to 2014

Freshwater resources are of high societal relevance and understanding their past variability is vital to water management in the context of current and future climaticon-going climate change. This study introduces a global gridded monthly reconstruction 10 of runoff covering the period from 1902 to 2014. In-situ streamflow observations are used to train a machine learning algorithm that predicts monthly runoff rates based on antecedent precipitation and temperature from an atmospheric reanalysis. The accuracy of this reconstruction is assessed with cross-validation and compared with an independent set of discharge observations for large river basins. The presented dataset agrees on average better with the streamflow observations than an ensemble of 13 state-of-the art global hydrological model runoff simulations. We estimate a global long-term mean runoff of 15 3741938’452 km yr in agreement with previous assessments. The temporal coverage of the reconstruction offers an unprecedented view on large-scale features of runoff variability also in regions with limited data coverage, making it an ideal candidate for large-scale hydro-climatic process studies, water resources assessments and for evaluating and refining existing hydrological models. The paper closes with example applications fostering the understanding of global freshwater dynamics, interannual variability, drought propagation and the response of runoff to atmospheric teleconnections. The GRUN dataset is 20 available from the ETHZ Research Collection at https://doi.org/10.6084/m9.figshare.9228176 https://doi.org/10.3929/ethz-b000324386 (Ghiggi et al., 2019)(Ghiggi et al., 2019). (NOW TEMPORARY AT https://figshare.com/s/db241b4e0baf4fdb8430 BEFORE FINAL PUBLICATION).


Introduction
Water is one of the most important natural resources for human development and its availability affects water supplies, agricultural yields, energy production, and infrastructure safety and operation.Two-thirds of the global population is currently exposed to severe water scarcity (Vörösmarty et al., 2010;Kummu et al., 2016;Mekonnen and Hoekstra, 2016), and a recent annual risk report of the World Economic Forum (WEF, 2018) lists the water crisis as one of the largest global risks in terms of potential impact and likelihood.While river flow is regularly used to assess regional renewable freshwater resources (Vörösmarty et al., 2000;Oki and Kanae, 2006;Veldkamp et al., 2017;Munia et al., 2018), there is to date no publicly available global dataset providing observation-based estimates of the evolution of runoff and river flow throughout the 20th and the early 21st centuries.In the last decades, several international initiatives promoted the launch of modelling inter-comparison projects with the aim to improve the representation of the terrestrial water cycle in global hydrological models (Dirmeyer et al., 2006;Dirmeyer, 2011;Haddeland et al., 2011;Harding et al., 2011;Van Den Hurk et al., 2011;Warszawski et al., 2014;Van Den Hurk et al., 2016;Schellekens et al., 2017) as well as to develop tools to refine regional hydrological predictions in data-sparse regions (Sivapalan, 2003;Blöschl et al., 2013;Hrachowitz et al., 2013).In the meantime, a widespread decline in the number of streamflow monitoring stations has also been reported (Shiklomanov et al., 2002;Fekete and Vörösmarty, 2007;Fekete et al., 2012Fekete et al., , 2015;;Laudon et al., 2017) and alternative estimates of streamflow are thus needed for reconstructing past large-scale runoff variability, not only during the beginning of the century but also in recent decades.
G. Ghiggi et al.: GRUN In this contribution, we use a recently published collection of in situ streamflow data (Do et al., 2018;Gudmundsson et al., 2018b) in combination with a century-long reanalysis (Compo et al., 2011;Kim et al., 2017) to fill this gap.This study introduces a global gridded reconstruction of monthly runoff covering the period from 1902 to 2014 at a 0.5 • spatial resolution.Runoff is defined here as the amount of water drained from a given land unit (i.e.grid cell) eventually entering the river system, including groundwater flow and snowmelt.The methodology builds upon previous work where gridded runoff rate estimates were obtained for Europe (Gudmundsson andSeneviratne, 2015, 2016).Hereafter, these two papers are referred to as GS15 and GS16.Monthly observations of precipitation, temperature and observed runoff rates from small catchments are used as input for a machine learning (ML) algorithm to learn the runoff generation process without the explicit description of the involved hydrological processes.Gridded precipitation and temperature data are then used to predict runoff rates in ungauged regions as well.The reconstruction accuracy is evaluated using runoff observations at the grid-cell scale as well as river discharge measurements in large river basins, both not used for model training.It is also benchmarked against an ensemble of global hydrological model simulations forced with the same precipitation and temperature data.
The paper concludes with a section illustrating the potential of the newly established data product (GRUN) for climatological, hydrological and environmental studies.

Atmospheric forcing
Gridded observations of precipitation and temperature data are obtained from the Global Soil Wetness Project Phase 3 (GSWP3) dataset (version 1.05) (Kim et al., 2017).GSWP3 is a dynamically downscaled and bias-corrected version of the 20th Century Reanalysis (20CR) (Compo et al., 2011).The dataset covers the period 1901 to 2014 and is available on a regular 0.5 • × 0.5 • grid at 3-hourly resolution.The subdaily data are aggregated to monthly means and bilinearly interpolated to a cylindrical equal-area (CEA) grid composed of cells with an area of 2500 km 2 and a spatial resolution of approximately 50 km.

Runoff observations
Monthly runoff observations are derived from the Global Streamflow Indices and Metadata Archive (GSIM) (Do et al., 2018;Gudmundsson et al., 2018b).This dataset includes a collection of 35 002 streamflow stations obtained by merging existing international and national databases.GSIM provides a wide range of time series indices at monthly, seasonal and yearly resolution.Here time series of monthly mean streamflow are considered.The data selection and preprocessing of these observations is detailed in Sect.3.1.

Observed continental-scale river discharge
Observed monthly river discharge from 718 large river basins is taken from the Global Runoff Data Centre (GRDC) Reference Dataset (https://www.bafg.de/GRDC/EN/04_spcldtbss/43_GRfN/refDataset_node.html, last access: 31 October 2019).The dataset contains a selection of streamflow stations with a basin area greater than 10 000 km 2 and corresponding catchment shapefiles.These time series are removed from GSIM to ensure that independent observations are used for model evaluation (see Sect. 3.3).

Global hydrological models' simulations
The Inter-Sectoral Impact Model Intercomparison Project (ISIMIP) offers a framework to compare simulations and to quantify the uncertainty across hydrological and land surface models forced with equal inputs (Warszawski et al., 2014).The accuracy of GRUN is benchmarked against runoff simulations for the period 1971-2010 from an ensemble of stateof-the-art global hydrological models (GHMs) participating in the second phase of ISIMIP2a Water (Gosling et al., 2017).The GHM simulations used in the main text are driven with the GSWP3 forcing and do not account for human impacts on river flow ("nosoc" scenario).In the Supplement, we also provide the results based on simulations that account for direct human impacts (i.e. the "pressoc" and "varsoc" scenarios from ISIMIP2a).Further details on the ISIMIP2a simulation setup can be found at https://www.isimip.org/protocol/#isimip2a (last access: 31 October 2019).

GSIM time series selection and preprocessing
Step 1. Sub-setting GSIM stations and conversion of flow volumes to runoff rates Runoff is defined here as all the water draining from a small land area.Runoff cannot be observed directly, but at a monthly timescale the average catchment runoff can be assumed to equal the monthly streamflow measured at the outlet divided by the catchment area, provided storage of river water (e.g. in dams, reservoirs) and/or river water losses (e.g.river channel and lake evaporation, irrigation) are minimal.Thus, runoff rates (millimetres per month) are obtained by dividing the GSIM river discharge (cubic metres per month) with the station's upstream catchment area (km 2 ).We then select catchments with an area comparable to the grid-cell size of the atmospheric forcing data in order to derive observational estimates of the runoff rate response to changes in atmospheric forcing.
To retrieve accurate estimates of grid-cell runoff, only GSIM stations fulfilling the following criteria have been selected for further analysis.
1.The time series has observations within the period 1902-2014 (when GSWP3 forcing is available).
2. The original data provider reports an estimate of the drainage area.This choice is made to have the possibility to verify the geographic location of the station as well as to assess the reliability of the automated delineation of the drainage area using a digital elevation model as provided in GSIM (Do et al., 2018).
3. GSIM provides the shape of the drainage area and the quality of the catchment delineation is flagged as "medium" or "high".This criterion imposes that the difference between the drainage area reported by the data provider and the one estimated by GSIM is less than 10 % (Do et al., 2018).
4. The drainage area is between 10 and 2500 km 2 .Very small catchments (< 10 km 2 ) are discarded because the uncertainty in the drainage area can significantly affect the magnitude of the runoff rates.On the other hand, catchments larger than 2500 km 2 are removed because their drainage area spans too many grid cells of the atmospheric forcing.
Based on these criteria, 10 042 GSIM stations are selected for further analysis.
Step 2. Correction for mislabelled missing values Manual investigation of monthly river discharge time series revealed the occurrence of multiple consecutive months with streamflow volumes exactly equal to 0 m 3 per month, in disagreement with the observed regional runoff pattern.These artefacts likely stem from a misleading treatment of missing values (e.g.due to damaged sensors).To identify such likely missing values, all time series are screened for the presence of more than 3 consecutive months with values of zero.If this pattern occurs, all zero values in the monthly time series are set to "missing".
Step 3. Remove time series with unrealistic runoff rates and short temporal coverage The following criteria have been adopted to remove observations that are too sparse or physically very unlikely: 1. Remove time series with only missing values.
2. Remove time series with negative monthly runoff rates.
3. Remove time series with less than 2 years of observations.
4. Remove time series with monthly runoff rates higher than 2000 mm per month.
This screening step gives a selection of 8211 stations.
Step 4. Homogeneity testing River discharge time series can show temporal changes in the hydrological behaviour because of changing instrumentation, recalibration of streamflow rating curves, flow regulation (i.e.dam construction) and other human activities (i.e.irrigation).Automated identification of such break points is usually done using statistical tests (Gudmundsson et al., 2018b).GSIM used a general-purpose procedure that was applied to all indices/timescales.In this study, the following two target-oriented change-point detection methods are applied after log-transformation of the time series: 1. univariate normal change point in mean (Chen and Gupta, 2012); 2. univariate normal change point in variance (Chen and Gupta, 2012); 3. univariate normal change point in normal mean and variance (Chen and Gupta, 2012).
Runoff time series are discarded when any of these tests detects a change point.
Figure 1 shows three river flow time series with different types of detected change points.Figure 1a illustrates the ability of the tests to identify gradual changes in low flow regulation or low flow measurement precision.Figure 1b displays the detection of sudden changes in the mean of the time series, e.g.caused by dam construction, river diversion or measurement errors, while Fig. 1c shows the potential in spotting subtle changes in river discharge variability possibly induced by reservoir operations.
The homogeneity testing procedure resulted in a final selection of 7264 stations.
The file "GSIM_training_stations.csv" provided in the Supplement lists this subset of GSIM stations, while Fig. S1 in the Supplement shows the catchment area distribution of these stations.

Retrieving runoff rates at the grid-cell scale of atmospheric forcing data
To give equal importance to high-latitude and tropical observations, the entire modelling procedure is conducted on a cylindrical equal-area (CEA) grid composed of cells with an area of 2500 km 2 and a spatial resolution of approximately 50 km.The final gridded runoff product is however projected back onto the WGS84 grid of the atmospheric forcing data.Because of the high density of stations in some regions and the typically elongated shape of the drainage area, many runoff observations span multiple cells of the CEA grid.Thus, an observational runoff time series representative of each cell is retrieved as follows: 1. project the GSIM catchment shape to the CEA grid, and 2. for each grid cell a. select those catchments of which the drainage area intersects the grid cell, and b. at each time step, take the median runoff rate of the selected catchments.
In addition to reducing the oversampling in high-stationdensity regions, this preprocessing step also smooths out some sub-grid variability.Additionally, it can also reduce the effect of potential outliers (i.e.stations that have exceptionally high or low runoff rates compared to their neighbours).
To avoid inhomogeneities arising from the concatenation of different runoff time series, the observational runoff time series at each grid cell is submitted to another homogeneity testing run (as described in Sect.3.1, step 4).
The procedure resulted in 5094 grid cells usable for model training, covering 8.5 % of the total land area and yielding 2 703 902 monthly runoff rate observations.Hereinafter, the grid-cell runoff time series are referred to as the runoff observations and Fig. 2 shows their spatio-temporal coverage.

Selection and preprocessing of GRDC time series
To obtain an independent dataset for assessing the accuracy of GRUN in large river basins, streamflow stations with a catchment area larger than 10 000 km 2 are selected from the GRDC reference dataset.Although most of these stations are included in the GSIM collection, they are not used for model training because only catchments with an area smaller than 2500 km 2 are used to derive grid-cell runoff observations (Sect.3.1, step 1).
The GRDC time series are subject to the preprocessing steps 1 to 4 detailed in Sect.3.1 to discard streamflow records of low quality.This procedure results in a selection of 379 large river basins.

Model setup
For the first time, GS15 and GS16 have used a ML algorithm to estimate monthly runoff at continental scale, and Ghiggi (2018) explored the utility of a wide range of algorithms to improve the task.Based on these findings, the present study employs the random forest (RF) algorithm (Breiman, 2001).RF is a ML algorithm which averages a set of randomized regression trees (Breiman et al., 1984) trained on different subsets of the original data.A regression tree divides the predictor space into high-dimensional rectangles by means of recursive binary splits.The predicted value of a new observation is the average of the observations used in the training process located in the region of the predictor space to which the new predictor values belong.By averaging the predictions of several randomized regression trees built on different training data, RF improves the final accuracy of the runoff estimates.
The monthly runoff rate (R) is modelled as a function of monthly precipitation (P ) and monthly near-surface temperature (T ) as where f corresponds to the RF model (RFM), s represents the identifier of the CEA grid cell, t is the time step, and τ is a time lag operator that provides information about meteorological conditions of the past 6 months to allow the RFM to approximate water storage effects that influence the runoff generation process.This differs from GS15 and GS16, which used a time lag operator of 12 months.The reasons behind this change are a reduction in training time of RFM and a de- crease in collinearity between predictors (caused by the seasonal cycle).Both precipitation and runoff observations are logtransformed before model training to adjust the skewed distribution of the data and to avoid only a small number of high-flow events dominating the optimization of the squared error loss.Once the RFM is trained, gridded precipitation and temperature data are fed to the model to obtain the final runoff reconstruction.Finally, the log-transformation of the predicted runoff values is inverted to derive runoff rates in conventional units.
The decision to only consider precipitation and temperature as explanatory variables is motivated by GS15, who found that the inclusion of other atmospheric variables as well as selected land parameters (topography and soil texture) did not significantly improve the overall accuracy of the estimate.Furthermore, reducing the number of predictor variables also helped to reduce computational costs significantly.While a more extensive screening of other land parameters is beyond the scope of this study, this could be the subject of potential future research.

The GRUN reconstruction
Accurate predictions of a machine learning algorithm are conditioned to training of the model with observations.The use of different training observations has the potential to generate different outcomes if the model is not able to generalize the relationship between the response (i.e.runoff) and the predictors (i.e.precipitation and temperature) adequately.This situation occurs when the statistical model adapts too much to the training data (overfitting).To test the sensitivity of the RFM to the training data, 50 runoff reconstructions are generated using a Monte Carlo approach in which the RFM is trained using a random 60 % subset of the grid cells with observations.
The ensemble mean of the realizations is referred to hereinafter as the GRUN reconstruction (Ghiggi et al., 2019).The ensemble of realizations is in turn used to investigate the model sensitivity to the training data at multiple spatiotemporal scales in Sect.5.4.Within a cross-validation framework (Hastie et al., 2009), the available data are split into a training set and test set.Training data are used to build the statistical model, while the test data are employed to assess the ability of the algorithm to predict information unavailable during the training process.To evaluate the agreement of runoff predictions with observations, two target-oriented (Meyer et al., 2018) cross-validation (CV) experiments named CV-SREX and CV-SPACE are set up, which help to avoid an overoptimistic view of model performance.
CV-SREX aims to evaluate the ability of the model to extrapolate in the situation where no nearby runoff observations are available at all.For this purpose, the globe is divided into 26 subcontinental regions (Fig. 4) as defined in the Special Report for Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation (SREX) of the Intergovernmental Panel for Climate Change (Seneviratne et al., 2012).Successively, at each cross-validation step, all observations within a SREX region are removed from the training dataset and subsequently used to test the performance of the RFM.This implies that the rainfall-runoff relationship is learned and transferred from regions far away as local information is not available to calibrate the model.
CV-SPACE follows the approach of GS15 and aims to assess the effective prediction accuracy in data-rich regions, where nearby runoff observations can provide information to refine the runoff estimates.In this case, the grid cells are randomly divided into 10 folds.Then, at each cross-validation step, one fold is used as test set, while the observations of the remaining folds are used as training data.

Validation at the basin scale
The selection of 379 GRDC river discharge observations detailed in Sect.3.3 is used to assess the accuracy of the GRUN reconstruction in large river basins (area larger than 10 000 km 2 ).GRUN-based 1st-order river discharge estimates are obtained by spatially averaging the grid-cell runoff times series within the basin and multiplying by the drainage area.At a monthly timescale, the effect of water routing is considered negligible except for a few very large basins.

Performance metrics
Six performance metrics are used to assess the accuracy of the RFM in reproducing different aspects of the runoff time series.Model skill is determined for each cross-validated grid cell and for each selected large GRDC river basin.The terms p t and o t refer to the predicted and observed time series respectively.
The relative bias (relBIAS) has an optimal value of 0 and allows us to investigate the presence of systematic errors.A positive (negative) value indicates a general overestimation (underestimation).It is defined as The ratio of standard deviations (rSD) has an optimal value of 1. Values lower than 1 indicate underestimation, while values higher than 1 indicate overestimation of the observed variability.It is defined as The squared correlation coefficient, R 2 , ranges between 0 and 1.It measures the degree of the linear association between the predicted time series and the observed one.It is insensitive to the bias.The optimal value is 1.The Nash-Sutcliffe efficiency (NSE), also called model efficiency (Nash and Sutcliffe, 1970), is a measure of the overall skill of the model.NSE = 1 corresponds to a perfect match between predicted and observed data, while a value lower than 0 indicates that model predictions are on average less accurate than using the long-term mean of the observed time series (mean(o t )).It is defined as The squared correlation coefficient between the observed and predicted monthly standardized anomalies (i.e.monthly time series with the monthly climatology removed, divided by the long-term standard deviation of each month) is R 2 anom .It ranges from 0 to 1 (best value).
The squared correlation coefficient between the observed and predicted monthly climatology is R 2 clim .It ranges between 0 and 1 (best values).

Grid-cell-scale validation
To evaluate the runoff reconstruction at different timescales, Fig. 3 shows scatterplots between observations and the CV-SPACE predictions for monthly, annual and long-term mean values.Overall, the agreement is satisfactory, although there is a tendency to underestimate runoff rates when the magnitude increases.
Figure 4 shows the spatial distribution of the considered skill scores based on the two cross-validation experiments CV-SREX and CV-SPACE, while Table 1 reports the median values of the grid-cell skill score distribution.The spatial patterns emerging from the two cross-validation experiments are very similar, with CV-SPACE displaying better scores because of the RFM ability to exploit local information to improve runoff estimates.On average, the relBIAS of the RFM is slightly negative, indicating a tendency to underestimate monthly runoff rates (Fig. 4a-b).However, in arid regions such as the southwestern USA, northeastern Brazil and southern Africa, the RFM tends to overestimate the runoff (relBIAS is positive).Fig- ure 4c-d show that when runoff is overestimated the variability tends to also be exaggerated (rSD is higher than 1).Oppositely, in the other areas, the variability is generally underestimated.
Overall, the runoff dynamics are well reproduced as indicated by high values of R 2 (Fig. 4e-f).The NSE skill scores (Fig. 4g-h) show that in most regions of the world, RFM predictions are more skilful than the observed runoff long-term mean (NSE > 0).The accuracy in reproducing runoff anomalies shows a more complex spatial pattern (Fig. 4i-j): humid regions and lowlands have quite high R 2 anom values, while decreasing skill is observed in mountainous regions and arid regions.Finally, Fig. 4k-l illustrate that the seasonal cycle of runoff is excellently reproduced across the whole globe.Figures S5 and S6 show the distribution of CV-SPACE skills of relBIAS, NSE and R 2 clim for each Köppen-Geiger (KG) climate zone (Peel et al., 2007) as well as the various SREX regions.The 25th, 50th and 75th quantiles of these skill distributions are reported in the "KG_CV_SPACE_Skills.csv" and "SREX_CV_SPACE_Skills.csv"files provided in the Supplement.Figures S7 and S8 show the cumulative distribution function of the monthly runoff rates and the monthly stan-dardized anomalies for different climate zones respectively.In dry climates (group B of KG) we note an overestimation of GRUN in the lower part of the runoff rate distribution compared to the observation, although in terms of standardized anomalies the cumulative distribution of the estimate agrees very well with the observations.

Basin-scale validation
Figure 5a evaluates the accuracy of GRUN using the selected GRDC reference stations (Sect.3.3), while Fig. 5b shows the observational agreement of river flow time series for some selected basins displayed in Fig. 5a.The temporal evolution of river flow is in general well captured and an underestimation of the peak flow volume is only evident for the Mekong River.For the Ebro the agreement between observations and GRUN starts to decrease from 1965 on.The dynamics are no longer well captured, and GRUN estimates are constantly higher than the GRDC observations.These discrepancies might be caused by the intensive irrigation and reservoir activities which have altered the natural hydrological regime of the basin.In that respect, it is interesting to notice that the NSE spatial pattern in Fig. 5a shows many similitudes with the estimated amount of runoff stored by engineered impoundments reported in Vörösmarty et al. (2004): low NSE scores tend to correspond to higher fractions of water impoundment.Both the Nile and Colorado river basins are exceptional examples of human-induced river flow alterations.
However, human activities are not the only cause of discrepancies between GRUN-based river discharge estimates and the observations.In the Amazon River, the negative NSE value and the visible phase lag between the estimated and the observed time series might not be caused by an inaccurate runoff reconstruction, but rather related to the fact that river discharge is simply estimated using the average runoff within the basin without taking water travel times into account.Indeed, for such an enormous river basin, a routing model accounting for water travel times would be necessary to correctly reproduce the river flow dynamics at monthly timescales.Figure S8 shows the spatial distribution of the remaining skill scores (e.g.other than NSE) for the GRDC basins.

Benchmarking against global hydrological models
In this section, we have benchmarked the performance of GRUN against well-established GHMs at two different scales.timate runoff, although the skill spread of relBIAS is reduced compared to the ISIMIP2a models.Among the GHMs there is not a clear tendency to under-or overestimate runoff.The same applies for the variability (rSD).The dynamics of runoff (R 2 ) are better reproduced by GRUN than by the considered GHMs, and the overall NSE skill score distribution is better for GRUN than for the ISIMIP2a GHM simulations.The anomalies (R 2 anom ) are also better reproduced by GRUN, and CV-SREX outperforms all the single GHMs.Finally, R 2 clim demonstrates that GRUN reproduces the seasonal cycle much better than the GHMs.Previous studies already showed that GHMs struggle in reproducing the sea-  (Gudmundsson et al., 2012;Gudmundsson and Seneviratne, 2015).Similar conclusions can be drawn by benchmarking GRUN against the pressoc and varsoc experiments of the ISIMIP2a runoff simulations (Figs. S2 and S3 respectively).Because GHMs are typically not calibrated at the gridcell scale (unlike GRUN), we have also benchmarked GRUN against ISIMIP2a GHM simulations in large river basins using the selection of GRDC reference stations with a catchment area larger than 10 000 km 2 detailed in Sect.3.3.The results for the ISIMIP2a nosoc, pressoc and varsoc scenarios are reported in Figs.S9, S10 and S11 respectively.The dynamics of runoff (R 2 ), the anomalies and the climatology (R 2 clim ) are still better reproduced by GRUN than by the ISIMIP2a GHMs across all scenarios.The average rel-BIAS of GRUN is close to 0 while the variability is slightly overestimated: this contrasts the results obtained at the gridcell scale where GRUN tends to underestimate the variability (rSD) compared to the observations.Figure S12 also provides a comparison of simulated river discharge from ISIMIP2a against 50 GRUN realizations (see Sect. 4.2) for the same time series displayed in Fig. 5, highlighting the larger scatter of conventional GHMs, likely due to structural and parameter uncertainties.

Sensitivity of the runoff estimates to the observations used for training
An ensemble of 50 runoff reconstructions trained on different subsets of observations (Sect.4.2) is used to assess the sensitivity of GRUN to the observations used for training.
Figure 7 shows the long-term mean of the monthly ensemble standard deviation and coefficient of variation (defined as the standard deviation divided by the mean).Regions characterized by higher runoff rates show higher standard deviation (Fig. 7a), but this variability across the realizations is small (< 20 %) compared to the runoff magnitude (Fig. 7b).With the exception of arid regions, the coefficient of variation is generally below 0.2 (Fig. 7b).
To put the sensitivity of GRUN in relation to the observations used for training, Fig. 8 compares the annual runoff volumes of the GRUN realizations against the state-of-theart GHMs participating in ISIMIP2a.The global long-term mean runoff volume estimated by GRUN (38 452 km 3 yr −1 ) lies within the lower range of the ISIMIP2a GHMs (Fig. 8a) and generally agrees with other global terrestrial discharge estimates (Table 3).The uncertainty attributed to the selec-Figure 6. Benchmarking the performance of GRUN against ISIMIP2a GHM runoff simulations ("nosoc" experiment).Box plot whiskers cover the 0.1 to 0.9 quantiles of the skill score distribution.The dark green vertical lines indicate the optimal score.GRUN cross-validation results are displayed in orange, while the multi-model mean (MMM) of ISIMIP2a GHM runoff simulations is displayed in dark blue.In most of the cases, the order of the boxes follows the rank of the median skill score.However, to avoid the compensatory effect with relBIAS and rSD scores, the individual boxes are ranked based on the absolute median value of the skill score minus the optimal score.The x axis of relBIAS is left and right truncated, for rSD it is right truncated and for NSE it is left truncated.tion of training observations (shaded area in Fig. 8a) of the global GRUN runoff volume is far smaller than the spread introduced by different physical representations of the hydrological processes in the GHMs.The uncertainty introduced by the selection of training observations increases proportionally to the magnitude of the runoff rates and is highest in the tropics (Fig. 8b).Reversely, the spread of the GHMs tends to be constant across all latitudes.GRUN almost always shows latitudinal mean runoff rates lower than the MMM and goes beyond the GHM range only between 20 and 30 • latitude north.This pattern is mainly related to the relatively low runoff estimates in GRUN in northeastern India and Bangladesh compared to those of the GHMs (Fig. 8c).

Limitations of GRUN
The streamflow observations used for model training underwent careful preprocessing and screening steps to remove time series presenting sudden changes in the hydrological signature.Therefore, and because the product is solely forced with precipitation and temperature, GRUN is not able to explicitly account for the effects of local human river flow regulation (dam operations in particular) on the reconstructed hydrological regimes.However, we note that some streamflow observations impacted by irrigation or other land and water management practices have likely not been removed, especially if the magnitude of water abstraction/returns did not alter the monthly hydrograph sufficiently to identify a change point or if the time series is not long enough to cover Earth Syst.Sci.Data, 11, 1655Data, 11, -1674Data, 11, , 2019 www.earth-syst-sci-data.net/11/1655/2019/   past periods of near-natural streamflow.This may be one of the reasons for the overestimation of runoff rates in several arid regions (Fig. 4a-b) known for intensive-irrigation activities (Wriedt et al., 2009;Siebert et al., 2015).To some extent, the impact of past land-use changes on water availability might be implicitly accounted for in GRUN, for example if the GSWP3 bias-corrected reanalysis captured regional changes in precipitation and temperature induced by human activities (e.g.Davin et al., 2007;DeAngelis et al., 2010;Luyssaert et al., 2014;Alter et al., 2015;Thiery et al., 2017) or if water management practices are altered gradually together with a climate change signal (e.g.irrigation may increase with decreasing precipitation).Any changing pattern in water availability emerging from GRUN is however solely conditioned by trends of the GSWP3 forcing and the runoff observations used for model training.Thus, our evaluation is that GRUN estimates likely lie closer to nearnatural runoff conditions than to human-regulated conditions (e.g.see the Nile River estimate in Fig. 5b), even though we cannot exclude that GRUN implicitly includes some human effects due to the various reasons mentioned above.Finally, we note that the accuracy of the runoff rates in mountainous regions is likely not optimal.The coarse resolution of the considered meteorological forcing does not allow capture of the sub-grid variability of precipitation and temperature that governs snowmelt volume and timing in such regions.
Although the statistical model could implicitly account for homogenous biases in the forcing dataset and streamflow observations, the reader must be aware of possibly inconsistent www.earth-syst-sci-data.net/11/1655/2019/ Earth Syst.Sci.Data, 11, 1655Data, 11, -1674Data, 11, , 2019 water balance in such regions.Glacier melting is also not explicitly accounted for in GRUN.
6 Example applications

Runoff climatology
Figure 9a displays the annual runoff climatology derived as the long-term mean of the GRUN reconstruction covering the 1902-2014 period.Long-term mean runoff rates differ by 3 orders of magnitude across the globe, with the highest rates in the tropics and large mountain ranges and lowest rates in the extratropics and major world deserts such as the Sahara.Monthly climatologies are provided in the Supplement (Fig. S13). Figure 9b and c show the months with the minimum and maximum of the mean seasonal runoff cycle.In the Northern Hemisphere, regions exposed to snow accumulation have the lowest runoff in winter and a runoff peak toward the end of spring as a result of snowpack melting and decreasing terrestrial water storage (Humphrey et al., 2016).In the humid mid-latitudes, evapotranspiration follows the seasonal cycle of temperature, causing the lowest (highest) runoff to occur prevalently during the summer (winter) months.In the tropics, maximum runoff tends to occur during the rainy season, which follows the migration of the Intertropical Convergence Zone (Schneider et al., 2014).

Trends in reconstructed runoff
GRUN can be used to investigate changing freshwater availability.Trends in observed (Fig. 10a) and estimated (Fig. 10b) annual runoff for the period 1971-2010 are computed using Sen's slope (Sen, 1968) and expressed in absolute and relative terms.Overall, the reconstructed trends are in line with other reported findings (Gudmundsson et al., 2018a) and closely resemble the observed trends.
In Europe, the Mediterranean region exhibits a decrease in annual runoff, while in central and northern Europe there is a tendency towards increasing runoff rates.This pattern is in agreement with previous studies (Stahl et al., 2010(Stahl et al., , 2012) ) and was recently attributed to anthropogenic climate change (Gudmundsson et al., 2017).In the eastern and western USA negative trends occur, while large portions of the Mississippi River basin show increasing runoff.
In the tropics, the Amazon basin shows a substantial decrease in annual runoff rates.Studies have shown that a reduction of freshwater discharge to the Atlantic Ocean has the potential to impact the Atlantic and the Northern Hemisphere climate (Vizy and Cook, 2010;Jahfer et al., 2017).Considering the increasing human pressure to which this basin is currently exposed (Castello and Macedo, 2016;Latrubesse et al., 2017) and the uncertain impact of deforestation on river flow (D'Almeida et al., 2007;Spracklen et al., 2012;Lawrence and Vandecar, 2015;Spracklen and Garcia-Carreras, 2015), the causes and consequences of such trends should be investigated in more detail.Similarly, the drying tendency observed in many regions of the Congo Basin could affect the eastern equatorial Atlantic climate variability (Materia et al., 2012).Reversely, tropical areas in Southeast Asia experience an increase in runoff.
The monthly resolution of GRUN also allows investigation of these changes at sub-seasonal timescales (Fig. S14), which might be relevant for water resource assessments because neglecting the seasonal fluctuations can cause underestimation of water scarcity (Mekonnen and Hoekstra, 2016).In addition to changes in magnitude, the GSIM dataset also offers the possibility to analyse shifts in the seasonality of the hydrological regimes.Figure S15 provides an overview of the months in which the minimum and maximum runoff volumes occurred at the beginning and at the end of the 20th century.Over Europe for example, Fig. S15 shows evidence for earlier occurrence of maximum runoff, which is consistent with changes in snowmelt timing already reported in recent studies (Blöschl et al., 2017;Hall and Blöschl, 2018).

Interannual variability and teleconnections
The long temporal coverage of GRUN combined with its high skill in reproducing runoff dynamics provide an unprecedented opportunity to study the response of runoff to the modes of climate variability throughout the 20th and the early 21st centuries.The Hovmöller diagram in Fig. 11a illustrates the interannual runoff variability by showing the time evolution of the latitudinal mean of monthly runoff standard anomalies.The occurrence of El Niño events, defined here as the periods in which the multivariate El Niño-Southern Oscillation (ENSO) index (MEI) (Wolter and Timlin, 2011) is larger than 1, coincides with negative anomalies in the tropical regions.A correlation analysis between monthly standard anomalies of GRUN and the MEI time series reveals that during El Niño events, the Amazon basin, the Southeast Asia, Australia and southern Africa tend to experience lower runoff rates (Fig. 11c), which is consistent with previous assessments (Ward et al., 2010;Wanders and Wada, 2015).The opposite occurs during La Niña events, and drier conditions are observed in the western United States, which is also consistent with previous work (Tang et al., 2016).
As an additional example, Fig. 11d shows the influence of the North Atlantic Oscillation (NAO) on the European continent.The analysis confirms the previous finding that when NAO is positive, England and Scandinavia exhibit higher runoff rates, while southern Europe experiences drier conditions (Bouwer et al., 2006;Bierkens and van Beek, 2009;Lorenzo-Lacruz et al., 2011;Steirou et al., 2017).

Drought and agricultural productivity
GRUN can be used to study the spatio-temporal development of slowly evolving phenomena such as droughts.Since runoff can be defined as the excess of water available to ecosystems, negative runoff standard anomalies can be used as an indicator for droughts and potentially lower agricultural and vegetation productivity (GS15, GS16, Humphrey et al., 2018).
Figure 12 shows three drought events that are known for their exceptional severity and devastating impact on agricultural production.Figure 12a displays the monthly runoff standard anomalies in August 1976 over Europe that according to our results ranks in the top five driest months (in terms of runoff anomaly) in large parts of England, northern France, central Europe and southern Sweden.Studies have shown that the drought mainly developed because of severe precipitation deficits (Zaidman et al., 2002;Spinoni et al., 2015) rather than extremely hot temperatures as it occurred during the 2015 drought (Ionita et al., 2017).Figure 12b reports the annual runoff standard anomalies in North America in the year 1934.This drought is known as the "Dust Bowl" and is unique for its spatial extent and duration.The negative runoff anomalies spanned the entire United States.Several studies have suggested that initial drying caused by La Niña conditions was amplified by human-induced land degradation of the US Great Plains (Schubert et al., 2004;Cook et al., 2009Cook et al., , 2014)).During this event, dust storms severely damaged the American prairies by destroying millions of hectares of cultivated land.Finally, Fig. 12c illustrates the Horn of Africa drought conditions in 1984.The event also ranks in the most extreme events of the region and resulted in a widespread famine, which killed as many as 700 000 people in Ethiopia (Kidane, 1990).The drought was linked to El Niño conditions and a strong reduction in annual precipitation (Viste et al., 2013;Lanckriet et al., 2015).

Conclusion and outlook
This study presents an observationally driven global gridded reconstruction of monthly runoff rates derived using a machine learning algorithm.The dataset covers the period from 1902 to 2014 and is provided on a 0.5 • × 0.5 • WGS84 grid.The machine learning algorithm is trained with runoff observations from a global collection of in situ streamflow ob-Earth Syst.Sci.Data, 11, 1655Data, 11, -1674Data, 11, , 2019 www.earth-syst-sci-data.net/11/1655/2019/ servations of relatively small catchments (< 2500 km 2 ) and uses gridded precipitation and temperature from a centurylong reanalysis product as predictors.Model validation based on cross-validation experiments shows that the accuracy of the reconstruction is reasonable.On average GRUN shows higher predictive skills than a collection of state-of-the-art global hydrological models, especially with respect to the reproduction of the seasonality, dynamics and anomalies of runoff.At the monthly timescale, we find that a restricted number of predictors (i.e.precipitation and temperature) is sufficient to reproduce important aspects of terrestrial water dynamics.GRUN is thus an interesting candidate to evaluate and refine current parametrizations of global hydrological models as well as to potentially constrain fluxes of fineresolution models (in space and time) through the adoption of multi-scale optimization techniques (Samaniego et al., 2010(Samaniego et al., , 2017)).Since the GRUN reconstruction does not explicitly account for human flow regulation, differences between this reconstruction and in situ observations may help to identify heavily regulated locations on a global scale (Jaramillo and Destouni, 2015;Arheimer et al., 2017).GRUN offers a unique view of large-scale features of runoff variability in regions with limited or no observational coverage.The new dataset can be exploited (i) to study the onset and development of large-scale extreme events such as droughts, (ii) to investigate links between runoff and modes of climate variability, (iii) to conduct large-scale water resource assess-ments, (iv) to detect changes in water availability and dynamics, (v) to reconstruct droughts in the last millennium in combination with tree rings (Nicault et al., 2008;Cook et al., 2010aCook et al., , b, 2015;;Meko et al., 2012), (vi) to benchmark regional streamflow archives and hydrological reconstructions (Wang et al., 2009;Wu et al., 2011;Caillouet et al., 2017;Mishra et al., 2018;Moravec et al., 2019;Smith et al., 2019), and (vii) to address other scientific challenges in water cycle research (Wagener et al., 2010;Montanari et al., 2013;Greve et al., 2014;Trenberth and Asrar, 2014;Hegerl et al., 2015).
We conclude by remarking that this dataset would not have been possible without the mobilization of national and international hydrological archives.This study shows the benefit of a wider access to hydrological data collected by various institutions worldwide.We call for a continuation of the international efforts to reduce political and technical barriers for the exchange of hydrometeorological data across the scientific community.

Figure 1 .
Figure 1.Detection of change points in runoff time series.The vertical red line indicates the change point in variance detected by the univariate normal change point in variance test, while the horizontal blue dashed lines illustrate the change in mean identified by the test of univariate normal change point in normal mean and variance.The title of the individual panels corresponds to the station identifier as used in GSIM.

Figure 2 .
Figure 2. Spatio-temporal coverage of grid-cell runoff observations.Panels (a) and (b) display the start and end year of the time series respectively.(c) Total number of runoff observations at each month between 1902 and 2014.(d) Number of years with at least 10 runoff observations in each year.(e) Month with most missing values.
G.Ghiggi et al.: GRUN   4.3 Model validation 4.3.1 Cross-validation at the grid-cell scale

Figure 3 .
Figure 3. Scatterplots of observed versus predicted runoff values.The colour intensity is related to the point density.

Figure 4 .
Figure 4. Spatial distribution of the skill scores obtained from the CV-SREX (left) and CV-SPACE (right) experiments.SREX region boundaries are superimposed over the skill maps of CV-SREX.

Figure 6 Figure 5 .
Figure 6 compares the distribution of the skill scores for the CV-SREX and CV-SPACE experiments against the skill of the ISIMIP2a GHM runoff simulations of the nosoc experiment at the grid-cell scale.CV-SPACE always has higher skill than CV-SREX and outperforms all ISIMIP2a GHM runoff simulations and their multi-model ensemble mean (MMM) except for relBIAS and rSD.Overall, the GRUN cross-validation experiments show a tendency to underes-

Figure 7 .
Figure 7.Long-term mean of the monthly standard deviation of the runoff reconstruction ensemble (a) and the corresponding coefficient of variation (b).

a
The long-term mean is obtained by extrapolation from continental-scale river discharge observations or water balance.b Antarctica is never included in the GRUN estimate of the global long-term mean runoff.Greenland is considered only if included in the reference dataset.c GRUN long-term mean runoff is computed for the period1902-2002.d Haddeland et al. (2011) report that the CRU land mask used to rescale global mean runoff (excluding Antarctica and Greenland) has an area of 1.44 × 10 8 km 2 , while the correct area value should be around 1.33 × 10 8 km 2 .

Figure 8 .
Figure 8.The uncertainty of GRUN, attributed to the finite sample of training data, compared to the spread introduced by different physical representations of the hydrological processes in the ISIMIP2a GHMs.The shaded area around GRUN lines shows the entire distribution of the 50 GRUN simulations.(a) Global annual runoff.(b) Latitudinal average of long-term mean runoff.(c) Difference between GRUN and the MMM long-term mean runoff.Grey cells represent missing values caused by missing data in some of the ISIMIP2a GHM simulations.

Figure 9 .
Figure 9. Runoff climatology (1902-2014).(a) Long-term mean annual runoff rates.(b) Month with the minimum and (c) the maximum long-term mean monthly runoff.

Figure 10 .
Figure 10.Changes in annual runoff rates (1971-2010) expressed in absolute terms (left) and percentage change relative to long-term mean (right).(a) Observed trends.(b) Trends based on GRUN.

Figure 11 .
Figure 11.Interannual variability of runoff and its relation to modes of climate variability.(a) Hovmöller diagram of standard runoff anomalies (reference period 1902-2014).Vertical dashed lines indicate onset of El Niño events.(b) Time series of the multivariate ENSO index (MEI).Red and blue shades characterize the intensity of El Niño and La Niña conditions respectively.(c) Correlation of the MEI with monthly runoff anomalies.(d) Relationship of European runoff anomalies with the North Atlantic Oscillation (NAO).

Figure 12 .
Figure 12.Three extreme drought events as reconstructed by GRUN.(a) European summer drought in 1976.(b) US Dust Bowl in 1934.(c) Ethiopian famine in 1984.

Table 1 .
Median values of skill scores for CV-SREX and CV-SPACE.

Table 2 .
Median values of skill scores computed for the large GRDC river basins.

Table 3 .
Comparing global long-term mean runoff from GRUN against values reported in the literature.GRUN estimates are obtained by considering the same time span and spatial coverage of the reported studies.Values in parentheses denote the uncertainty range reported in some studies.