A 500-year annual runoff reconstruction for 14 selected European catchments

,


Introduction
Global warming has impacted numerous land surface processes (Reinecke et al., 2021) over the last few decades, resulting in more severe droughts, heatwaves, floods and other extreme events. Droughts, in particular, pose a serious threat to Europe's water resources. The flow of many rivers is greatly hampered by prolonged droughts, which restrain the availability of fresh water for agriculture and domestic use. For example, the 2003 drought significantly reduced European river flows by approximately 60 % to 80 % relative to the average (Zappa and Kan, 2007). Likewise, the annual flow levels at several river gauges have decreased by 4036 S. Nasreen et al.: A 500-year annual runoff reconstruction for 14 selected European catchments tions based on various proxy data, such as past tree rings (Nicault et al., 2008;Kress et al., 2010;Cook et al., 2015;Tejedor et al., 2016;Casas-Gómez et al., 2020), speleothem (Vansteenberge et al., 2016), ice cores, sediments (Luoto and Nevalainen, 2017), and documentary and instrumental evidence (Pfister et al., 1999;Brázdil and Dobrovolný, 2009;Dobrovolný et al., 2010;Wetter et al., 2011).
The available reconstructed precipitation and temperature series (or fields) can be used to reconstruct runoff with hydrological (process-based) models (Tshimanga et al., 2011;Armstrong et al., 2020) respecting general physical laws, such as preserving mass balance (e.g. MIKE SHE; Im et al., 2009 or VELMA;Laaha et al., 2017) or data-driven methods which are able to capture complex non-linear relationships (for instance support vector machines, Zuo et al., 2020;Ji et al., 2021; artificial neural networks (ANNs), Senthil Kumar et al., 2005;Hu et al., 2018;Kwak et al., 2020;random forests, Ghiggi et al., 2019;Li et al., 2021;Contreras et al., 2021). While the lack of physical constraints in the datadriven models limits their application under changing boundary conditions (in comparison with those of the model training period), their advantage is that they can often directly use biased reconstructed data as an input series.
The objective of the present study is to provide a multicentury annual runoff reconstruction for 14 European catchments, utilizing the available precipitation (P ; Pauling et al., 2006) and temperature (T ; Luterbacher et al., 2004) recon-structions and the Old World Drought Atlas self-calibrated Palmer Drought Severity Index (scPDSI) reconstruction (Cook et al., 2015). Specifically, we assessed a conceptual lumped hydrological model (GR1A; Mouelhi et al., 2006) and two data-driven models, long short-term memory neural network (LSTM; Chen et al., 2020) and Bayesian regularized neural network (BRNN; Okut, 2016), for annual runoff simulation over the period 1500-2000.
Section 2 introduces P and T hydroclimatic reconstructions and the scPDSI drought indicator as well as precipitation, temperature and runoff observations. In Sect. 3, we describe the data preprocessing, models, the drought identification methodology and goodness-of-fit assessment. The accuracy of the employed P and T reconstructions, as well as the derived runoff simulations, is evaluated in Sect. 4. Finally, we summarize the advantages and limitations of reconstructed datasets in the Conclusions in Sect. 6.

Data
This section presents the data used in this study. To force the models, we investigated the use of precipitation (Pauling et al., 2006) and temperature (Luterbacher et al., 2004) reconstructions for the past half-millennium and scPDSI drought indicator data from the Old World Drought Atlas (Cook et al., 2015). For validating the runoff reconstructions, we used runoff from the Global Runoff Data Center (GRDC; Fekete et al., 1999). The accuracy of atmospheric forcing reconstruction used as model input was assessed using the observational data records of P and T from the Global Historical Climatology Network (GHCN; Menne et al., 2018). The datasets are summarized in Table 1 and are described in more detail below.

Temperature
Reconstructed temperature (T ) was obtained from Luterbacher et al. (2004), which relies on historical records and seasonal natural proxies (i.e. ice cores from Greenland and tree rings from Scandinavia and Siberia). Reconstructed temperature data are available at the same spatial and temporal resolution as precipitation (see Table 1). We refer to both of these datasets as reconstructed forcings or reconstructed precipitation/temperature fields.

Self-calibrating Palmer Drought Severity Index (scPDSI)
In addition, we used data from the Old World Drought Atlas (OWDA; Cook et al., 2015) which contains information regarding moisture conditions across Europe, specifically the self-calibrated Palmer Drought Severity Index (scPDSI) using summer-related tree-ring proxies over the period 0 to 2012 CE.

The Global Historical Climatology Network (GHCN)
The GHCN dataset (GHCN; Peterson and Vose, 1997) is one of the largest observational databases, collated by the National Oceanic and Atmospheric Administration (NOAA; Quayle et al., 1999). The GHCN-m dataset contains observed temperature, rainfall and pressure data from 1701 to 2010. Data for the majority of stations are, however, available after 1900. GHCN-m precipitation and temperature from GHCN V2, as well as from the GHCN V4 version (Menne et al., 2012), were used to assess the reconstruction accuracy of the P and T fields as an input into the considered models. We selected 113 precipitation and 144 temperature stations within the European domain (see Fig. 1) with records dating back earlier than 1875. Most stations are geographically concentrated in central Europe, and few stations are located in the eastern and northern areas of Europe (see Table 2). These data, hereafter, are referred to as the GHCN data.

Observed runoff
The Global Runoff Data Center (GRDC; http://www. bafg.de/GRDC/EN/Home/homepage_node.html, last access: 24 November 2016) provides data for more than 2780 gauging stations in Europe, with the oldest records starting from 1806. Only the GRDC runoff time series with at least 25 years of data prior to 1900 were selected. In total, there were 21 such stations predominantly available in central Europe: 11 in Germany, 2 in France, 2 in Switzerland, 1 in the Czech Republic, 1 in Sweden, 1 in Finland, 1 in Lithuania and 1 in Romania (see Fig. 1). These stations cover 12 European river basins (Rhine, Loire, Elbe, Danube, Wesser, Main, Glama, Slazach, Nemunas, Gota Alv, Inn and Kokemaenjoke), with areas ranging from nearly 6100 km 2 (Kokemaenjoki, Muroleenkoski, Finland) to 576 000 km 2 (Danube, Orsova, Romania). The mean annual discharge (Q mean ) varies from 50 to 5 600 m 3 s −1 and spans different time periods for each catchment. The most extensive records were available in Sweden (Vargoens KRV) and Germany (Dresden), containing the longest discharge series of 212 and 208 years, respectively. The gauging station in Köln also provided 195 years of data for the Rhine River. Note that some of the gauging stations are located nearby and therefore have a greater degree of similarity in their runoff time series (e.g. two stations in Basel, Rhine). Detailed information relating to all selected stations is provided in Table 2.

Methods
This section is divided into three parts. The first part describes the preprocessing of the reconstructed forcings (i.e. precipitation and temperature) for validation across Europe and the preparation of data for runoff simulation in 21 catchments (Sect. 3.1). The hydrologic and data-driven models used to generate the runoff reconstructions are presented in Sect. 3.2 and 3.3, respectively. Finally, Sect. 3.4 describes the methods for the evaluation of simulated runoff and reconstructed forcings, and Sect. 3.5 presents the methods to identify annual runoff droughts.

Data preprocessing
Two databases were considered for the analysis and development of the annual runoff reconstruction. The first one was used for evaluating the accuracy of meteorological forcing reconstructions used for hydrological simulations and consists of observed GHCN data for all available European sta-tions with long records (see Sect. 2.4) and values of corresponding grid cells from the reconstructed forcings dataset.
The second database was created as the basis for runoff reconstruction containing the observed runoff data for 21 selected catchments (Table 2) and the corresponding input variables of the models used to generate the multi-century runoff reconstructions. Several input variables were considered for inclusion in models such as reconstructed precipitation and temperature and Old World Drought Atlas scPDSI. The catchment average precipitation, temperature and scPDSI were estimated from the corresponding (gridded) datasets by averaging the relevant grid cells over the catchments. This database was further divided into two parts, calibration  and validation (before 1900), to assess the model's accuracy and to select an appropriate model. The data preprocessing, model selection, and evaluation of the models are depicted in Fig. 2.

Hydrologic model (GR1A)
We applied the annual timescale hydrologic model, GR1A (Mouelhi et al., 2006), to simulate annual runoff in each catchment. GR1A is a conceptual lumped hydrologic model (Manabe, 1969), considering dynamic storage and antecedent precipitation conditions. The model consists of a simple mathematical equation with a single (optimized) parameter: where Q, E and P represent annual runoff, basin average potential evapotranspiration and basin average precipitation, respectively, and i denotes the year. The parameter X is optimized individually for each catchment by maximizing the Nash-Sutcliffe efficiency (NSE) between observed and simulated runoff. Default gradient-based optimization from the R package airGR (Coron et al., 2017) was used. The potential evapotranspiration was calculated using the temperaturebased formula (Oudin et al., 2005). Compared to other conceptual models from the GR family (GR4J, GR5J), GR1A is simple to use, and it allows for analysing many variants, particularly defining the best antecedent rainfall, and is potentially useful to predict wet and dry hydrologic conditions (Mouelhi et al., 2006).

Data-driven models
Artificial neural networks (ANNs; Senthil Kumar et al., 2005;Kwak et al., 2020) can describe non-linear relationships and are widely used for rainfall-runoff prediction. The ANNs consist of artificial neurons organized in layers and connections that route the signal through the network. Each connection has an associated weight that is optimized within the calibration (in the context of ANNs, known as training).
There are many types of ANNs which differ in terms of structure and type of connections, as well as direction and functional forms used for neuron activation or training. In the present study, we considered two approaches: long shortterm memory (LSTM) neural networks and Bayesian reg-ularized neural networks (BRNNs). These approaches have been commonly used in past rainfall-runoff modelling studies (Hu et al., 2018;Kratzert et al., 2018;Xiang et al., 2020;Ye et al., 2021). We considered combinations of reconstructed forcing, OWDA-based scPDSI and lagged forcing as an input into the network for both model types. Specifically, the network using only reconstructed precipitation and temperature fields is referred to as [P , T ], the network with reconstructed forcing and OWDA scPDSI is termed as [P , T , PDSI]; and finally the network which includes 1-year lagged P and T forcing in addition to actual P and T is referred to as [P , T , lag]. We also considered and explored lag times longer than 1 year. However the correlation between precipitation and runoff drops significantly at lag times longer than 1 year and therefore was not included in presented analysis. Figure A1 shows the architecture of LSTM, which is a modified version of the recurrent neural network (Hochreiter and Schmidhuber, 1997), using backpropagation in time (Werbos, 1990). LSTM is known for efficient simulation of time series with long-term memory (Van Houdt et al., 2020). It generally consists of two unit states (hidden and cell states) and three distinct gates (hidden, input and output). In this process, the cell state saves the long-term memory at the previous unit, while hidden states act as a working memory to process information inside the gates. These gates can determine which information needs to be processed, remembered and transferred in the next state. With LSTM, different ac-tivation functions, such as hyperbolic tangent and sigmoid, can be used to update unit states. The implementation of the LSTM is carried out by applying the R packages "keras" (Arnold, 2017) and "tensorflow" (Abadi et al., 2016).
The training process of the LSTM is time-consuming due to its inherent complexity. Therefore we also considered the BRNN models that provide fast learning and convergence and were already used to tackle the complex relationship between rainfall and runoff (Ye et al., 2021). BRNNs are based on recurrent neural networks, which are often used to model time-series data (Wang et al., 2007), and the methods are extended with Bayesian regularization (Okut, 2016) to account for uncertainty related to network parameters and input data (Zhang et al., 2011). We trained this model in R using the "brnn" function of the "caret" package (Kuhn, 2015). More details are available in Appendix A3.
To set the optimal hyperparameters of the models (such as the number of neurons and activation functions) and to reduce the likelihood of overfitting during the calibration/training, the model performance was cross-checked considering an independent (or so-called "testing") set. The testing set was for each learning exercise extracted from the calibration data  as a random fraction (25 %). This process of the model development was repeated several times, minimizing the root mean square error (for BRNN) and mean square error (for LSTM) for each catchment individually. The model with the best performance was then chosen for further evaluation.

Goodness-of-fit assessment
We used a set of seven statistical metrics to assess the performance of simulated runoff, namely Nash-Sutcliffe efficiency (NSE), Pearson correlation (R), standard deviation ratio (rSD), Kling-Gupta efficiency (KGE), root mean square error (RMSE), mean absolute error (MAE), bias (BIAS) and relative bias (relBIAS). The mathematical formulations of these metrics are provided in Appendix A1.

Runoff drought identification
To check the utility of our reconstruction, we finally explore how well the annual runoff droughts are represented in the simulations. Our study considers annual hydrological droughts, defined as the streamflow/runoff deficit, following the threshold level approach (Yevjevich, 1967;Sung and Chung, 2014;Rivera et al., 2017). This approach is typically used for daily or monthly timescales, considering 0.1 or 0.2 quantile threshold levels. To accommodate the annual scale used here, we defined the start of the drought, when the annual runoff anomaly falls below the 0.33 quantile (regular drought) and the 0.05 quantile (extreme drought). The drought persists until the runoff rises above the threshold again. After that, the difference between runoff and the threshold was determined for each identified drought year, called the runoff deficit. Hydrological drought series can be further assessed to understand the critical aspects of runoff (temporal) dynamics and to classify past droughts in Europe (Wetter and Pfister, 2013;Cook et al., 2015).

Results and discussion
In this section, we analyse the 500-year annual reconstruction over space and time across Europe. Firstly, we provide a comparison between the GHCN-observed precipitation and temperature and the corresponding grid cells from Pauling et al. (2006) and Luterbacher et al. (2004) reconstructions. Next, the reconstructed annual runoff series for the selected catchments are evaluated against the corresponding observed GRDC runoff data.
Two distinct model types were investigated, i.e. a processbased conceptual lumped hydrological model (GR1A) and two data-driven models (BRNN and LSTM). While the former takes reconstructed forcing of precipitation and temperature as an input, in the case of the latter, we also considered PDSI and lagged reconstructed precipitation and temperature fields, as shown in Table 3. Statistical metrics, such as NSE, KGE, RMSE, MAE, R, BIAS and relBIAS (Appendix A1), are used to quantify the predictive skills of the models examined.

Evaluation of reconstructed precipitation and temperature fields
The 500-year annual paleoclimate reconstructions of precipitation (P ) and temperature (T ) were validated against the GHCN observation data. The map showing the comparison is given in Figs. 3 and 4. The reconstructed data are evaluated against observational P and T across 99 and 94 European sites, respectively. Figure 3 shows that for most of the sites the correlation coefficient (R) of P reconstruction at most of the sites is above 0.5; the relative bias (relBIAS) is between −0.1 and 0.1; KGE and NSE are showing values below 0.5 and 0.6, respectively; the rSD is between 0.7 and 0.9; and RMSE varies between 0 and 150. The performance of the temperature reconstruction was relatively better, as depicted in Fig. 4. In this case, RMSE between reconstructed and observational T is around 0.2 • C, and rSD fluctuates between 0.95 and 1.05, while R is higher than 0.84, and BIAS is less than 0.5 • C, except for stations located in the Alps. The NSE and KGE values were above 0.5 at the majority of the stations. Low skill observed at some locations can be explained by the unresolved variability of grid-cell average temperature, especially in regions with complex terrain.
It is worth noting that the large spread of goodness-of-fit (GOF) statistics is mainly due to the outlying values at the grid cells located along the boundary of the domain (i.e. the interface between land and sea/ocean) and high elevations (see also Figs. 3 and 4). In general, reconstructed precipita-   (Luterbacher et al., 2004) against GHCN observations. tion exhibits greater differences from observations than temperature. This may be because the proxies considered in the reconstruction rely on different seasons and climate conditions. Additionally, the shortest available instrumental data before the 20th century could encounter certain technical errors, such as problems with instrumental tools, station relocation and dating issues (Dobrovolný et al., 2010). Moreover, other studies (e.g. Ljungqvist et al., 2020) stated that the precipitation series employed for the reconstructions were relatively shorter and more erroneous than the temperature series before the 20th century (Pauling et al., 2006;Harris et al., 2014). Finally, the chosen statistical technique (principal component regression) could also possibly contribute to variance inflation with larger timescales (Pauling et al., 2006).

Assessment of the reconstructed runoff simulations
The GR1A conceptual hydrological model was driven by catchment average P and T and calibrated using observed annual runoff for each catchment separately. The simulated annual runoff series were then compared to the corresponding GRDC observations (for calibration and validation periods), and the results were summarized by means of GOF statistics. As can be seen in Fig. 5, the correlation and NSE statistics for calibration achieve reasonable results at most of the catchments, with a few exceptions (i.e. Kokemenjoki, Goeta, Nemunas and Inn). The catchments with relatively poor skills are located in northern Europe, which is in line with the previous findings by Seiller et al. (2012), who noted that the lumped hydrological models often exhibit larger uncertainties and fail to capture the extreme catchment values (both high and low) in those regions. The low skill for some of the catchments cannot be easily attributed only to bias in reconstructed precipitation and temperature (described in Sect. 4.1) but rather to low station and proxy coverage in some (especially northern) parts of Europe, leading to biased basin-average precipitation and temperature estimate. Another study of Fathi et al. (2019) suggested that the performance of the GR1A model is less efficient than the new Budyko-framework-based SARIMA model in simulating the annual runoff across the Blue Nile and the Danube catchment. This may be due to the simplified nature of the model that does not easily capture the complex relationship between rainfall and runoff variability.
In general, statistical values presented in heat maps (Fig. 5) indicate that the neural network algorithms are more skilled for runoff prediction than the GR1A model. The NSE and R statistics for the BRNN and LSTM models indicate a significant improvement in runoff prediction, as compared to the results obtained through the GR1A model. For instance, for Basel Rheinhalle the NSE increases from 0.27 to 0.73 (BRNN) and 0.75 (LSTM) for calibration and 0.2 to 0.54 (BRNN) and 0.52 (LSTM) for validation. Moreover, including scPDSI from OWDA with reconstructed forc-ing [P , T , PDSI] increases the performance slightly more (NSE 0.76 for calibration and 0.57/0.59 for validation, for BRNN/LSTM, respectively), and considering the lagged forcing results in the best performance (NSE 0.75/0.8 for calibration and 0.6/0.54 for validation, for BRNN/LSTM).
Similarly for all sites, the data-driven methods exhibited a strong correlation with the observed runoff, with the GR1A simulations resulting most frequently in lower correlation values. Other metrics (RMSE, MAE, KGE, rSD and rel-BIAS) are shown in Figs. S1-S5 in the Supplement. Across many study locations, the combination of reconstructed forcings and their 1-year lag performed the best in terms of rapid convergence (the number of iterations needed) and high accuracy from all input combinations for both data-driven models (BRNN, LSTM). For the validation period, the mean NSE To further demonstrate the differences between the individual models, we show the simulated runoff series for all models for those catchments with the highest (Blois-Loire) and lowest (Smalininkai-Nemunas) performance in Fig. 6. The performance of the models is comparable during the calibration period for the Loire River. Clearly, all data-driven models are capable of mimicking the observed runoff, while the GR1A model exhibited certain minor deviations, primarily until 1930. In the validation period, the differences between the models are more visible, in particular, for aboveaverage flows. This can be attributed to different generalization skill of individual models. At the beginning of the validation period (1870-1880), all models failed to simulate the high annual flows.
In the case of Nemunas catchment, the GR1A simulation deviates extremely from the observed data and cannot capture the mean flow level. However, the calibration is poor, even for the data-driven models, and does not simulate the year-to-year variability appropriately. Interestingly, for the validation period, the error in the GR1A model decreases. The performance of the data-driven models is similar in validation and calibration periods. Looking at the GOF statistics, the models considering OWDA-based scPDSI or lagged forcings (e.g. P t−1 ) perform slightly better in terms of KGE than the other model configurations.

The annual runoff reconstruction datasets
As a first step, we excluded the catchments that exhibited poor performance in validation (see Fig. 5). As a threshold, we considered validation NSE greater than 0.5 for at least one model, following the approach used by Ayzel et al. (2020). In this step, we excluded 7 catchments (Vlotho-Wesser, Decin-Elbe, Burghausen-Salzach, Smalininkai-Nemunas, Vargoens KRV-Goeta, Elverum-Glama and Muroleekoski- Figure 5. The correlation coefficient (top) and NSE (bottom) for calibration (left) and validation (right) of the considered models for 21 study catchments. The vertical axis represents the catchments (station name and river) and the horizontal axis the considered models. The rectangular black frames represent the catchments with NSE > 0.5 over the validation period.
Kokemenjoki) out of 21, ending up with a set of simulations for 14 catchments (highlighted by the rectangular box in Fig. 5).
Secondly, we identified the best candidate models for each of the 14 selected catchments, considering the GOFs based on the validation NSE and R greater than 0.5 and 0.7, respectively. The best model for each catchment was finally selected from those models considering the remaining validation measures (relBIAS, rSD, KGE, RMSE and MAE) as well. Specifically, we picked the models with consistent good validation measures. This choice is partly subjective, and more formal selection should be explored further. On the other hand, the candidate models all performed comparably in most cases.   Table 3. The combination of reconstructed forcing with 1-year time lags results in the best performance over nine catchments, of which seven employed the BRNN and the remainder the LSTM model. The LSTM with reconstructed forcing and OWDA-scPDSI was the best in just one case, and the remaining time-series reconstructions were most appropriately simulated with the BRNN [P , T ] and BRNN [P , T , PDSI]. It should be noted that the differences between the models performing well are small, as noted in Fig. 6 and further demonstrated in Fig. 7. The latter figure compares the cumulative distribution functions of annual runoff for the periods 1500-1800, 1800-1900 and 1900-2000, as simulated by the BRNN [P , T , lag] and LSTM [P , T , PDSI] -the two best-performing models -and the GR1A (the most deviating simulation from the best model) with the distribution of the observed annual runoff for the Basel Rheinhalle-Rhine catchment. For the calibration period  in Fig. 7, the models perform well except the GR1A, which generally overestimated the observed maxima. The cumulative distribution of BRNN-and LSTM-simulated runoff values is very similar for the validation period (1800-1900), except for the top and bottom 5 % in 1500-1800. The GR1A simulation showed significant differences for the entire distribution, thus overestimating/underestimating the maxima/minima. Our finding shows that GR1A simulates a Rhine minimum of 279 mm yr −1 in Basel, whereas the observed minimum in the past century is greater than 532.6 mm yr −1 , inferring that the cumulative distribution function (CDF) has significantly lower/higher runoff values between 1500 and 1800 for BRNN and GR1A, whereas LSTM appears to extrapolate less. The difference from the best model can be expressed in terms of KGE -even here, it was evident that the GR1A model deviated considerably (KGE 0.6-0.7), while the LSTM is very similar to the BRNN (KGE 0.92-0.96). The most severe drought year identified by the models in the period 1500-1800 appears to be 1669 and the year 1921 in the past century   (Fig. 7 left and right panels), while for 1800-1900 the models identified either 1865 (GR1A, LSTM) or 1858 (BRNN). Please note that the 1858 low-water mark is available at Laufenburg Pfister et al. (2006) near Basel and was regarded as one of the worst winter droughts in the last 200 years.
The resulting 14 annual runoff reconstructions are available at https://doi.org/10.6084/m9.figshare.15178107 and are shown in Figs. S6-S8 in the Supplement. As an example, we present only two runoff reconstructions here (Fig. 8).
As an additional validation for the reconstructed series, we inspected the scatter plots of the observed and reconstructed runoff (Fig. 9). The simulated series are generally consistent with the observed runoff, especially for the Montjean-Loire, Köln-Rhine and Basel Schifflaende-Rhine catchments, which exhibit the best relationship between the observed and the simulated runoff.
Finally, to check the consistency of our reconstructed dataset, we compared the skill of our simulation with respect to the GRDC runoff observation and the GSWP3-forced GRUN monthly runoff (Ghiggi et al., 2019) datasets. The gridded GRUN datasets were averaged over the respective catchments to enable comparison (Figs. S9 and S10 in the Supplement). Our reconstruction outperforms GRUN data in terms of RMSE, MAE, relBIAS and NSE across the majority of the catchments, while the correlation (reproduction of interannual dynamics) to GRDC runoff is slightly higher for GRUN compared to our reconstruction. The variability, which our data-driven models underestimate (on average by 16.5 %), is overestimated by GRUN (on average by 17.2 %). Since the correlation compensates for the relBIAS, the KGE for our reconstruction and GRUN is comparable. This suggests that GRUN could be used for data-driven model training, provided at least some information on flow characteristics is available in the catchment.

Identification of low flows, significant hydrological drought events and trends
In the final step of the analysis, we compared the droughts identified in the reconstructions with the GRDC-observed series (Fig. 10). The agreement between the simulated and observed runoff deficit is lower compared to the annual runoff time series. For most of the stations, the simulated deficit is lower than the corresponding observed estimates. This suggests that the reconstructed precipitation and temperature fields do not represent the inter-annual variability correctly. Despite a widespread issue with the representation of inter-annual persistence, Fig. 10 shows that the runoff deficits are simulated reasonably well for the Rees-Rhine and Köln-Rhine catchments. Furthermore, we contrasted reconstructed drought patterns over the last 500 years with data available from documentary evidence and other sources. In the case of extreme droughts, we considered the q 0.05 threshold before 2000 CE. Low-flow analysis since 1500 and the large deficit values for catchments (below 5th percentile) are shown in Table 4. In the 16th century, the years 1536, 1540 and 1590 are associated with significant runoff deficits. The event of 1540 has already been reported (Brázdil et al., 2013;Cook et al., 2015;Brázdil et al., 2019) as the worst event of the 16th century and more severe in terms of changing hydrologic conditions. In 1540, almost 90 % of the Rhine and Elbe River catchments (Basel and Cologne) experienced low yearly discharge, which ranked as the greatest low flows in the last 5 centuries (Leggewie and Mauelshagen, 2018). The seasonal precipitation was also deficient and was evident primarily in central Europe and England (Dobrovolný et al., 2010). Wetter and Pfister (2013) stated that the spring and summer of 1540 were likely to have been warmer than the comparable period during the 2003 drought. The simulation shows that the drought during 1540 was evident in most study catch- In the 17th century, the years 1603, 1616,1631,1666,1669,1676,1681,1684 and 1686 were simulated as exceptionally low-flow years. Furthermore, two events (1669 and 1686) were associated with the largest water deficit across several study catchments. Basel Schifflaende-Rhine catchment is a good example of this, which appears to have experienced an extreme runoff deficit during 1669. In the Köln-Rhine catchment, 26 remarkable droughts have been captured over the past 500 years, and the year 1686 reached the largest runoff deficit (156 mm yr −1 ). The 1616 is considered the driest year of the 17th century, the so-called "drought of the century" (Brázdil et al., 2013), which significantly impacted the major rivers in Europe (e.g. Rhine, Main and Wesser). Brázdil et al. (2018) identified three unusual drought periods (1540, 1616 and 1718-19) over the Czech lands, highlighting the 1616 drought, which caused widespread famine, dried up the Elbe river watershed and altered the climate of neighbouring nations (Switzerland and Germany). The hunger stone of the Elbe River also revealed the exceptionally dry year of 1616 (Brázdil et al., 2013). During the 18th century, a similar level of runoff deficit was simulated in the years 1706 and 1719.
During the 19th century, the years 1863, 1864, 1874, 1893 and 1899 were recognized as drought years in all catchments, while in the 20th century, the driest periods occurred in 1921, 1934, 1949 and 1976. The 1921 drought in the Blois-Loire, Rees-Rhine, Köln-Rhine, Orsova-Danube, Basel Rheinhalle-Rhine and Basel Schifflaende-Rhine catchments was ranked as the most exceptional drought in the 20th century. Three catchments (Basel Rheinhalle-Rhine, Basel Schifflaende-Rhine and Blois-Loire) exhibited a large runoff deficit during the year 1921. A noticeable increase in temperature was experienced across Europe, and certain areas were notably affected by a heatwave in July of that year. The majority of central Europe, southern England and Italy were affected by this drought, where the rainfall was found to have decreased around 50 % to 60 % relative to the average (Bonacina, 1923;Cook et al., 2015). The precipitation totals were recorded as the lowest since 1774, and the year was also ranked top (in terms of deficit rainfall) in the Great Alpine region (Haslinger and Blöschl, 2017), where the rainfall deficit began in winter 1920/21 and lasted until autumn 1921. Also reported in newspapers, the Rhine River show that August 1976 was the fifth driest month between 1900 and 2014, in agreement with some of our catchment reconstructions signalling the 1976 as a yearly drought in the Köln-Rhine, Hann. Münden-Wesser and Bodenwerder-Wesser.
In summary, the reconstructed annual runoff corresponded well to the majority of extreme drought years (e.g. 1540, 1616, 1669, 1710, 1724 and 1921, as highlighted in Table 4) and previously demonstrated in the OWDA-based PDSI treering reconstructions and previous works (Dobrovolný et al., 2010;Brázdil et al., 2013;Wetter and Pfister, 2013;Cook et al., 2015;Markonis et al., 2018). It is important to note that the presented runoff reconstructions might have missed notably documented dry events, e.g. 1894 (Brodie, 1894), which was associated with unprecedented low levels of rainfall and excessive temperature rises in the south of England, the British Isles and other European regions (Brodie, 1894;Cook et al., 2015;Hanel et al., 2018).

Data availability
The annual runoff reconstructions were prepared using the defined dataset and can be accessed on the public repository Figshare (https://doi.org/10.6084/m9.figshare.15178107, Sadaf et al., 2021). The reconstructed data of precipitation and temperature can be downloaded at https: //www.ncdc.noaa.gov/data-access/paleoclimatology-data (last access: 20 Feburary 2020). The monthly global historical climatological network (GHCN) data can be accessed via the link https://www1.ncdc.noaa.gov/pub/data/ghcn/ (last access: 12 May 2019). The data repositories of GRDC runoff are accessible to the public at https: //www.bafg.de/GRDC/EN/Home/homepage_node.html (last access: 24 November 2016). All analyses and visualizations were done using R.

Conclusions
In this study, hydrological (GR1A) and two data-driven (BRNN and LSTM) models were used to reconstruct the annual runoff during the period 1500-2000, considering various input fields. After comprehensive validation of the simulated series, this work provides annual runoff time series for 14 catchments across Europe. The presented dataset can be used to investigate annual drought duration and severity. The main findings can be summarized as follows: 1. Data-driven methods have proven to be helpful for annual runoff simulations, even when there is high uncertainty in the forcing meteorological data. This contrasts with a conceptual lumped hydrological model, which would require bias correction before hydrological simulation.
2. There is no significant difference between the BRNNand LSTM-simulated annual runoff, neither in terms of the individual values, nor in relation to the validation metrics.
3. Validation skill metrics suggest that for annual runoff prediction, it is beneficial to consider data-driven models that explicitly account for serial dependence, either through input data (e.g. time-lagged input fields) or directly in the model structure (e.g. LSTM networks).
The reconstructed annual runoff relies heavily on the consistency of underlying reconstructed precipitation (Pauling et al., 2006) and temperature (Luterbacher et al., 2004) forcing fields. Unfortunately, those cannot be fully verified directly, due to the lack of sufficient long-term observational datasets. With the limited information provided by the GHCN station, we identified several notable deficiencies in the reconstructed forcings, in particular, underestimation of the variance in precipitation reconstruction. Moreover, proxy records that were used for the derivation of precipitation and temperature input fields are spatially heterogeneous, with some regions being better represented than others. This inevitably leads to poor performance over the latter. The skill of precipitation and temperature reconstructions across the selected catchments to derive annual runoff is still fairly good. In addition, the data-driven methods that were used in the paper were capable of removing systematic bias. We cannot be sure, though, that the link between reconstructed forcing and annual runoff is stationary when going back in time. Moreover, when the number of natural proxies included in the derivation of the forcing dataset decreases, the uncertainty increases. The reconstructed data should, therefore, always be considered with caution. Finally, since the runoff reconstruction is annual, dry summers can be compensated for by wet winters masking years with sub-annual dry periods. However, this should be regarded as a resolution-not methodology-related problem. Future research could consider further improvements of the simulations, e.g. by training a meta-model combining the runoff simulations from several fitted models. In addition, since interest is not often focused on the runoff series but on some other indicator (such as PDSI or deficit volume in the case of drought), it is also possible to simulate the drought indices directly, considering either the precipitation and temperature input fields or the simulated runoff. Finally, discrete classifiers (Kolachian and Saghafian, 2021) could also be used to simulate the drought (or water level) classes directly.

A1 Goodness-of-fit assessment
We used several statistical indicators to assess the skill of annual runoff reconstruction. In following definitions, p and o refer to the predicted and observed series, respectively, and i to year.
The standard deviation (SD) ratio (rSD; Ghiggi et al., 2021) is defined as The variability is underestimated when the value is less than 1 and overestimated when the value is greater than 1.
The root mean square error (RMSE; see, for example, Legates and McCabe, 1999) and mean absolute error (MAE; see e.g. Legates and Mc-Cabe, 1999) measure how well predictions fit the observations. MAE and RMSE values can range from zero to infinity, with the former value indicating a perfect fit. Pearson's correlation coefficient (R) is defined as The Nash-Sutcliffe efficiency (NSE; Nash and Sutcliffe, 1970), is alternatively referred to as model efficiency. NSE = 1 corresponds to a perfect match between predicted and observed data, while a value less than 0 indicates that model predictions are on average less accurate than using the long-term mean of the observed time series o.
Systematic errors can be detected using the absolute bias (BIAS) or relative bias (relBIAS) is calculated using three primary components, R, rSD and relBIAS, as defined above. relBIAS has a zero ideal value, while rSD and R have an ideal value of 1.

A2 Long short-term memory (LSTM)
To build the LSTM model, we use the Keras environment (Arnold, 2017) with its high-level application programming interface (API) for neural networks and Tensorflow (Abadi et al., 2016). Fig. A1 represents the structure of the LSTM neural model for the rainfall runoff relationship in several catchments. We design our network by stacking one LSTM and two dense layers on top of one other. As shown in Fig. A1, the model configured four distinct input combinations, each of which was normalized to [0, 1] in the training and testing phases. The model parameters choose different batch shapes, units (similar as neurons) and epochs as described in Table A1. The model considers the rectified linear unit (ReLU), using component wise multiplication and defining the dropout parameter as 0.1. According to Kingma and Ba (2014), the optimization algorithm plays a significant role in the algorithm's convergence and optimization. For this reason, Adam's optimizer is considered, as it performs stochastic gradient descent (SGD) more efficiently using the backpropagation algorithm. During compilation, the learning rate is set to 0.001 or 0.002, and the mean square error (MSE) is used to measure model accuracy. In addition, the mean abso-lute error (MAE) is used as an objective to minimize residues and achieve optimum value. Model checkpoints are used to save the model having minimum loss during the training with minimum loss and better accuracy.

A3 Bayesian regularized neural network (BRNN)
BRNN is a probabilistic technique for handling non-linear problems. Using the caret package, the model "brnn" was designed to work with a two-layer network as described by MacKay (1992) and Foresee and Hagan (1997). BRNN uses the Nguyen and Widrow algorithm to assign initial weights and the Gauss-Newton algorithm to optimize. Model is first trained on the training dataset, and its performance is checked by making a prediction on the testing dataset. While selecting a model for train control, a simple boot resampling strategy was applied to evaluate performance. We tested the proposed model's predictive ability using a random bootstrap generator, with 75 % of the observations in the training set and 25 % in the testing set. RMSE was utilized as a loss function to compile and verify the model's accuracy. The model was fitted with 20 neurons, one hidden layer and implemented activation function g k (x) = exp(2x)−1 exp(2x)+1 . Af-S. Nasreen et al.: A 500-year annual runoff reconstruction for 14 selected European catchments ter compilation, the train function automatically selected the best model with the smallest RMSE as the final model.
Author contributions. The study was initially designed by RK, MH and YM. Algorithms are coded with the assistance of YM, US and MH. Datasets were collected by MRVG and SN. The research was carried out by SN, MS and MH, who also wrote the paper. OR and RK both helped to revise the manuscript.
Competing interests. The contact author has declared that none of the authors has any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.