GRQA: Global River Water Quality Archive

. Large-scale hydrological studies are often limited by the lack of available observation data with a good spatiotemporal coverage. This has affected the reproducibility of previous studies and the potential improvement of existing hydrological models. In addition to the observation data themselves, insufﬁcient or poor-quality metadata have also discouraged researchers from integrating the already-available datasets. Therefore, improving both the availability and quality of open water quality data would increase the potential to implement predictive modeling on a global scale.


Introduction
Human-driven loads of nutrients to aquatic ecosystems have become the main driver of eutrophication in waterways and coastal zones (Desmit et al., 2018;Sinha et al., 2019). Agricultural production is already one of the major forces behind environmental degradation (Foley et al., 2011), and population growth is increasing that pressure (Mueller et al., 2012). The use of nitrogen (N) and phosphorus (P) fertilizers to increase agricultural productivity is predicted to increase 3fold by 2050 unless more efficient fertilizer use can be im-plemented (Tilman et al., 2001). At the same time, it has been estimated that "globally, over 3 billion people are at risk of disease because the water quality of their water source is unknown due to a lack of data" (UN-Water, 2021). In order to achieve the UN Sustainable Development Goal (SDG) 6, we need better understanding of our water resources and water quality. Monitoring and modeling the hydrochemical properties of rivers is essential for understanding and mitigating water quality deterioration due to agricultural and industrial non-point-source pollution (Krysanova et al., 1998;5484 H. Virro et al.: Global River Water Quality Archive water quality indicators such as nutrients (Caraco and Cole, 1999;He et al., 2011), carbon compounds (Evans et al., 2005;Hope et al., 1994), sediments (Choubin et al., 2018;Ouyang et al., 2018) and oxygen (Radwan et al., 2003;Singh et al., 2009) gives valuable understanding of hydrochemical cycles and enables us to estimate the effect of human influence on them.
Traditional approaches to water quality modeling consist of applying bottom-up, physically based models on the catchment level (Wellen et al., 2015). Calibration and validation data in the form of water quality observations used when developing the model and verifying its performance are usually gathered through in situ observations and, more recently, automated sensor networks. Although airborne remote-sensing-based data acquisition methods have been successfully used to supplement field data for lakes (Chen and Quan, 2011;Toming et al., 2016), applying those methods is only viable in the case of rivers with a large enough surface area (Olmanson et al., 2013). Therefore, improving the river water quality data's spatial and temporal coverage with remote sensing is limited. Significant progress has been made in improving the technical capabilities and lowering the installation and maintenance costs of the field sensors, but the spatial and temporal coverage of observation sites remains an issue (Pellerin et al., 2016).
In order to improve the spatial coverage of water quality and hydrological data, different solutions have been used in predictive hydrological mapping. Until recently, a common approach for predicting water quality and hydrological phenomena in ungauged catchments has been the application of already-existing process-based models to catchments with similar characteristics (Hrachowitz et al., 2013;Strömqvist et al., 2012;Wood et al., 2011). These physical models usually require extensive calibration along with location-specific knowledge, which limits the wider applicability and spatial upscaling that can be done (Abbaspour et al., 2015;McMillan et al., 2012).
Recently, advances in implementing machine learning (ML) methods in hydrology have given rise to a new, datadriven approach to hydrological modeling (Mount et al., 2016). A comparison of physically based and ML approaches has shown that ML methods can achieve a similar accuracy to the physically based ones and outperform them when describing nonlinear relationships (Chau, 2006;Ouali et al., 2017;Papacharalampous et al., 2019). The recent advance of so-called physics-guided ML, which entails combining process-based models with ML methods, is likely to become more applicable in the near future as well (Kratzert et al., 2019;Shen et al., 2018;Marzadri et al., 2021).
Nevertheless, a major problem related to large-scale predictive hydrological modeling has been the lack of available observation data with a good spatiotemporal coverage (Bierkens, 2015). This has affected the reproducibility of previous studies and the potential improvement of existing models (Blöschl et al., 2019;Meals et al., 2010;Stagge et al., 2019). In addition to the observation data themselves, insufficient or poor-quality metadata have also discouraged researchers from integrating the already-available datasets. Here, ambiguities in supplementary metadata such as parameter names, units and methods of measurement have limited the use of open data for large-scale water quality modeling purposes (Archfield et al., 2015;Hutton et al., 2016;Sprague et al., 2017). Therefore, improving both the availability and quality of open water quality data would increase the potential to implement predictive modeling on a global scale. Global ML models have been already successfully used for discharge modeling (Beck et al., 2015;Gudmundsson and Seneviratne, 2015), and recent years have seen the publication of global discharge datasets (Do et al., 2018;Harrigan et al., 2020). The publication of global and continental datasets (Hartmann et al., 2014;Read et al., 2017) could make ML methods applicable for large-scale water quality modeling as well . However, issues related to a lack of training and validation data due to general data scarcity affect model accuracy and, therefore, limit the further adoption of ML for global water quality predictions (Chen et al., 2020).
We aim to address the aforementioned issues by presenting the novel Global River Water Quality Archive (GRQA) by integrating and harmonizing five different global and regional datasets. The resulting dataset has combined observation data for 42 different forms of some of the most important water quality parameters relevant for nutrients (e.g., water temperature and oxygen, phosphorus, nitrogen and carbon compounds). Supplementary metadata and statistics are provided with the observation time series to improve the usability of the dataset. An extensive data catalog with maps showing the spatiotemporal coverage and graphs describing the distribution of all 42 parameters as the supplementary material of the study are presented in the Supplement. We report on developing a harmonized schema and reproducible workflow that can be adapted to integrate and harmonize further data sources. In addition, we provide recommendations for improving the multi-source water quality data compilation, especially focusing on the metadata quality and adhering to the FAIR data principles (Wilkinson et al., 2016). We conclude our study with a call for action to extend this dataset and hope that the provided reproducible method of data integration and metadata provenance shall lead as an example.

Data
A total of five data sources were used to compile the GRQA, with two being at the global, one at the regional and two at the national level (Table 1). All datasets with the exception of GEMStat are publicly available to download online as CSV or Excel file packages. GEMStat data can be requested via email. The number of available observation sites was highly dependent on the source, with the Water Quality Portal (WQP) maintained by the United States Geological Survey (USGS) having the most sites. Files used during the creation of GRQA are listed in Table 2.

CESI
The first dataset included in GRQA originated from the Canadian Environmental Sustainability Indicators (CESI) program operated by Environment and Climate Change Canada (ECCC), which is a Canadian governmental department responsible for coordinating environmental policies and programs. CESI consists of water quality measurements collected by federal, provincial and territorial monitoring programs from Canadian rivers from the 2002-2018 time period (Environment and Climate Change Canada, 2020). CESI data are mainly focused on heavy metals, so out of the 42 parameters included in GRQA, only 8 were available in CESI (Table 1). It is the smallest of the five source datasets with the site count ranging from 2 to 77 per parameter. Mean time series length per site is approximately 13 years, and the average number of observations per site is 145.

GEMStat
The Global Freshwater Quality Database GEMStat is hosted by the International Centre for Water Resources and Global Change (ICWRGC) and provides inland water quality data within the framework of the GEMS/Water program of the United Nations Environment Programme (UNEP). GEMStat contains over 7 million samples from approximately 5700 sites in 75 countries. The data were obtained through a custom request to their data portal (United Nations Environment Programme, 2020).
Approximately 500 water quality parameters were available in the GEMStat database, out of which 32 were used when compiling GRQA (Table 1). Observations cover the period 1950-2020, and the mean observation count per parameter is approximately 41. Mean time series length per site is 9 years. The site count per parameter ranges from less than 10 (dissolved and total carbon) to 4274 (total phosphorus).

GLORICH
The GLObal RIver CHemistry (GLORICH) database (Hartmann et al., 2014) is a collection of hydrochemical data from more than 1.27 million observations and more than 18 000 sampling locations across the globe. The samples originate from various environmental monitoring programs and scientific literature.
Out of 47 water quality parameters available in the raw data, 26 were chosen to be included in the GRQA (Table 1). The samples cover the time period of 1942-2011, but the length of the time series is dependent on the parameter. Mean time series length per site is less than a decade for all parameters. The number of available sites per parameter ranges from just 4 (particulate organic nitrogen) to 9728 (dissolved inorganic phosphorous). The dataset can be downloaded from Pangaea (Hartmann et al., 2019).

Waterbase
Waterbase is the generic name given to the European Environment Agency's (EEA) databases on the status and quality of Europe's rivers, lakes, groundwater bodies, and transitional, coastal and marine waters (European Environment Agency, 2020). The database is compiled from data sent by the national European water agencies involved in the Water Framework Directive (WFD).
Over 600 water quality parameters are included in the full dataset, out of which 15 matched those of GRQA (Table 1). Out of all source datasets, Waterbase had the shortest time series with observations covering only the period 2008-2018. The maximum site count per parameter is 1976, while there were on average only around 19 observations per site.
In May 2020, the ICWRGC announced that parts of Waterbase had also been added to the GEMStat database (United Nations Environment Programme, 2020). However, only sites with more than 3 years of data were included in this update. As mean time series length per site was only 1.4 years in Waterbase, a significant number of sites were left out, which is why we decided to include Waterbase separately in GRQA. Although it is likely that there were many observations which appeared both in GEMStat and Waterbase, the duplication detection procedure discussed in Sect. 3.3 should have identified them.

WQP
USGS, the US Environmental Protection Agency (EPA) and the National Water Quality Monitoring Council developed the Water Quality Portal (WQP), which is so far the largest standardized water quality database (Read et al., 2017;United States Geological Survey, 2020). Although the portal also includes data from a few other countries (e.g., Mexico, Pacific Islands) associated with the National Water Information System (NWIS) network, only a very limited amount of non-US samples were available. For this reason, only US national data were selected to be added to GRQA.
Due to the size of the source dataset, the full set of parameters could not be downloaded at once. Therefore, a scripted download procedure was used to retrieve water quality samples and their corresponding sampling sites separately per parameter. In the case of temperature, the data had to be further divided by state. Unlike other source datasets used in the study, the WQP often had multiple versions of the same parameter available under separate codes in case the parameter had been measured in different units, e.g., using different methods. The final count of parameters used for GRQA was 37 (Table 1). The longest time series of source datasets is present in the WQP, with some dating back to 1898. However, the average time series length per station is just over 3 years. Like GEM-Stat, WQP is still being updated, so most parameters have their latest observations in 2020. The site count ranges from a single station (dissolved inorganic nitrogen) to 59 000 per parameter (total suspended solids).

Methodology
The GRQA compilation workflow was divided into three parts: (1) the preprocessing stage involved converting observation data from the five sources into a common format and harmonizing the corresponding metadata; (2) preprocessed data were merged by parameter, after which outliers and time series characteristics were detected; (3) duplicate detection was conducted in the last processing step. The Pandas (McK-inney, 2010), GeoPandas (Jordahl et al., 2020) and NumPy (Harris et al., 2020) Python libraries were used throughout all data processing stages.

Parameter selection
The parameters included in GRQA cover the four groups of water quality indicators outlined in the introduction: nutrients, carbon, sediments and oxygen (Table 7). GLORICH was used as a reference for parameter selection due to its being one of the two global source datasets and having the least amount of discrepancies within source data, i.e., each GLORICH parameter had a single matching code, unit, etc.

Parameter harmonization
Preliminary analysis showed that there were ambiguities in the parameter names, codes, units and chemical forms in the different source datasets, which has been identified as a recurring issue when dealing with multi-source water quality data (McMillan et al., 2012;Sprague et al., 2017). For this reason, lookup tables were created for each of the source datasets (*_code_map.csv) to use as guides in the following processing stages (Table 3). The purpose of the schemas was to match parameter codes and other metadata with the versions used later in the GRQA. For most parameters, this could be done based on the literal names, remarks and descriptions in the metadata. Relevant literature and online resources were consulted for more ambiguous scenarios. One such example was total suspended solids (TSSs), which can also be reported as suspended particulate matter (SPM) (Neukermans et al., 2012). When a reliable decision could not be made (e.g., biological oxygen demand as BOD vs. BOD5), the parameters were kept separate.

Unit conversion
Units of measurement were harmonized along with other metadata. All parameters except temperature ( • C), pH and dissolved oxygen (%) were converted into milligrams per liter (mg L −1 ), which was the most prevalent unit in source data. When units were converted, observation values had to be changed as well. This was done by calculating conversion constants which were based on both the magnitude of the source unit (e.g., µg L −1 vs. mg L −1 ) and the reported chemical form of the parameter. The latter affected nitrite (NO 2 ), nitrate (NO 3 ) and ammonium (NH 4 ) the most as these parameters had a variety of forms in the source data that were all converted into corresponding nitrogen versions . In some cases, the chemical form could be identified from the source unit (e.g., mg N L −1 or mg NO 3 L −1 ), while others were detected by examining parameter names and method descriptions (e.g., "Nitrate, reported as nitrogen"). Where possible, additional informa-tion about these missing forms was collected from proxy sources, such as other similar datasets (e.g., Börker et al., 2020 in the case of GLORICH). These references have been included in the form_ref column in corresponding lookup tables (*_code_map.csv). For other nitrogen (TKN, TN, etc.), all carbon (DOC, TC, etc.) and phosphorus (TP, TIP, etc.) parameters (see Table 7 for abbreviations), the chemicals were assumed to be either N, C or P even if not reported because there is only one common element in the molecule (Sprague et al., 2017). GLORICH was the only source dataset which also needed conversion constants for carbon and phosphorus parameters as they had been reported as micromoles per liter (µmol L −1 ). All WQP units matched those intended to be used for GRQA, so no conversion was needed. The formula for conversion constants was where x 1 and x 2 are observation values before and after conversion, M is the corresponding molar mass, and n is the magnitude difference between source and converted unit. Some examples of unit conversion are given in Table 4. The full list of all unit conversion procedures is given in Appendix A (Table A1).

Site ID duplication
There were some instances of duplicated site IDs in GLO-RICH (2 site pairs) and Waterbase (101 pairs) source data, which meant that joining observations with sites would have created duplicate time series as well. Site ID duplicates could indicate that there had been small shifts in the site location or that the site had been closed and reinstated at some point. If the distance between the duplicate pairs was less than a kilometer, only the first instance was retained in the output table. When distance was greater than a kilometer, both instances were removed since metadata that could be used to make a decision (e.g., when the site first opened) were not available. Finally, all duplicate pairs were exported as separate files (e.g., GLORICH_dup_sites).

Coordinate conversion
CESI and WQP originally had the site coordinates in the North American Datum of 1983 (NAD83). The pyproj  Python library was used for converting the North American site coordinates into World Geodetic System 1984 (WGS84), which was the coordinate system chosen for the GRQA.

Observation data filtering
Preliminary cleaning included the removal of observations of negative, missing or low-quality values. In this case, low quality refers to measurements that were flagged as either coming from unreliable sources or having any kind of literal quality assessment flag in the source data (e.g., "poor quality"). Additionally, observations marked as below (<) or above (>) detection limit in source data were flagged as such in GRQA as well (column detection_limit_flag). Observations originating from unreliable sources or being otherwise suspect (e.g., unvalidated) were omitted. Three source datasets (GEMStat, GLORICH and Waterbase) had this type of a quality evaluation included in the metadata. Observations from sites marked as "not for publication" due to national legislation in Waterbase were also not included in GRQA.

Filtration information
When possible, supplementary information about whether a sample was filtered or unfiltered was retained as filtration can affect the sample values (Sprague et al., 2017). This information was usually available in a separate metadata column. Both "filtered" and "dissolved" were used depending on the source. GRQA includes the dissolved versions of certain parameters (total nitrogen, total phosphorus and Kjeldahl nitrogen) which originally did not exist as separate parameters in Waterbase and WQP. In those cases, the filtered and dissolved observations of TN, TP and TKN in the two datasets were treated as the corresponding dissolved forms (TDN, TDP, DKN) in GRQA.

Time and date processing
Observations could have invalid timestamps due to formatting or entry errors, so a validity check was included in the preprocessing scripts. Dates were tested against the presumed source format, and observations with incorrectly formatted or implausible dates were removed. The source datasets used different date formats, which were all converted into a common one (%Y-%m-%d). When possible, observation time was extracted as well. A default value (00:00:00) was used to fill missing information. Time zone information was only able to be extracted from the WQP. Other sources lacked time zone information, so it was not possible to determine whether the recorded timestamp was in local or coordinated universal time (UTC), and the time given is up to the user to interpret.

Other metadata
If available, metadata about the upstream basin area, its unit and the name of the greater drainage region of the site were included in GRQA. Additional information about methods used or other available observation remarks in the source data was also retained. The metadata depended on the source and were available only sporadically and could not be concatenated in a reasonable way between the datasets, so the information is given in the GRQA for each source separately in the format of source_meta_sourcecolumnname (e.g., GEM-STAT_meta_Analysis Method Code). Here, the source column names were kept as they appear in raw data, e.g., spaces were not replaced with underscores.
3.2 Outlier treatment, time series availability and continuity

Time series availability and continuity
The analysis of the statistics generated during preprocessing showed that most of the time series extracted from the source datasets are very discontinuous. For example, the mean time series length per site for total phosphorus (TP) in GEMStat was 6.6 and in GLORICH 4.9 years, while the mean observation count per site was only 57.7 and 52.4, respectively. This means that many sites have observations at a monthly time step at best. Similar findings have been previously reported for WQP time series (Read et al., 2017;Shen et al., 2020). In order to illustrate the suspected temporal fragmentation in observation data, monthly availability and monthly continuity statistics appropriated from the strategy used by Crochemore et al. (2019) were calculated for each site in each of the merged parameter time series. Both characteristics can give insight into the granularity of the time series and can affect the applicability of different modeling methods. Monthly availability of observation data was defined as the ratio between number of months with at least one observation and the total number of months a particular site had any observations. A ratio of 1.0 would mean that there was at least one observation in every month of the time series. Monthly continuity was calculated as the ratio between the longest period of consecutive months with any measurements and the length of time series in months. Here, a ratio of 1.0 would mean that there were no months without observations, and the time series is continuous on a monthly level. The resulting characteristics were added as columns in the output files.

Outlier flagging
Water quality modeling often involves dealing with numerous outliers and uncertainties in observation data, particularly when integrating time series from multiple sources (McMillan et al., 2012;Sprague et al., 2017). Due to the differences in environmental conditions and water regimes, the potential range of observation values can vary a lot between catchments. Although extreme outliers caused by faulty equipment or data entry errors can sometimes be detectable by examining distribution plots, it is often difficult to decide whether an outlier is an error or not. For example, sudden spikes in observation time series can be caused by events such as accidental fertilizer spills in the waterway or a cow getting entrapped in an in-stream wetland (Hughes et al., 2016), which can have short-term effects on water quality and, therefore, should not be removed from data. However, flagging outliers can still help researchers troubleshoot potential issues at the modeling stage.
For this reason, no observations were omitted from the time series, and two flags associated with outliers were added to the output tables instead. First flag (obs_iqr_outlier) shows whether an observation was deemed to be an outlier by the interquartile range (IQR) test. IQR is defined as the difference between the third (Q3) and first (Q1) quartile. All values greater than Q3 + 1.5 × IQR or less than Q1 − 1.5 × IQR are considered outliers. The second flag (obs_percentile) was an indicator (0.0-1.0) showing which percentile a particular observation belongs to. Histograms along with box and whisker plots were used to visually show the range and distribution of the parameter observations. The plots were produced for every parameter and are included in the GRQA data repository.

Duplicate observation detection
The global datasets (GEMStat and GLORICH) used in this study had at least partial spatial overlap with the other three sources, which means that merging could have created duplicate sites in the GRQA. Contrary to site ID duplicates within the same dataset discussed in Sect. 3.1, site duplicates from different sources would likely also have different IDs. Therefore, rather than comparing ID information, the duplicates had to be identified by spatial proximity and time series similarity. Similar to procedures described in Sect. 3.2, duplicate detection was done separately for each parameter.
The first stage of duplicate detection was clustering sites based on their geographic location. The DBSCAN (densitybased spatial clustering of applications with noise) algorithm (Xu et al., 1998) from the scikit-learn Python library (Pedregosa et al., 2011) was used to create clusters of sites within a 1 km radius of each other, which is the approximate accuracy of around two decimal points in latitude and longitude degrees. There does not seem to be a consensus for assigning this search radius for duplicate detection, and the assessment of spatial proximity depends on the subjective threshold set by authors. For example, the Global Streamflow Indices and Metadata Archive (GSIM) streamflow dataset (Do et al., 2018) used a radius of 5 km for selecting potential duplicate gauging stations. The 1 km radius was chosen to avoid having too many false positives (e.g., in the case of small headwater catchments) to evaluate in the second stage of deduplication (RMSE calculation). A major advantage of DBSCAN compared to similar density-based clustering methods is that the algorithm can be run without determining a priori the number of output clusters (Birant and Kut, 2007). In addition, DBSCAN has been shown to be more applicable than others when dealing with large-scale datasets (Khan et al., 2014;Parimala et al., 2011).
Although there are time series similarity detection methods that can be applied to irregular time series and can handle some degree of discontinuity, the focus of those methods is on misalignment of the time of observations rather than differences in the pattern of time series gaps (Berndt and Clifford, 1994). Therefore, it is likely that GRQA time series are too fragmented for these advanced methods to yield reliable results. A conservative approach based on root-mean-square error (RMSE) was chosen here instead. Output site clusters were converted into unique site pairs so that all sites within a cluster could be compared to one another (e.g., a cluster of four would yield six unique ID pairs). Site ID pairs were then used to extract corresponding time series from observation data. Only observations made on matching dates were used for calculating the RMSE, and only pairs in which RMSE was equal to zero were considered as potential duplicates. Finally, the duplicates were exported into separate CSV files (e.g., TP_dup_obs.csv) along with relevant metadata to help the user decide whether the sites can be considered duplicates (Table 5). A high number of matching dates with the same observation value (column date_match_count) would indicate a higher likelihood of duplication.

GRQA data model and descriptive overview
The GRQA dataset consists of observation time series for 42 different water quality parameters provided in tabular form as CSV files. Each of the observation files is accompanied by corresponding metadata files (tables and images) describing the spatial and temporal characteristics of the time series.
GRQA is made up of the following files ( Fig. 1): a data catalog (GRQA_data_catalog.pdf) with maps showing the spatiotemporal coverage and graphs describing the distribution of all 42 parameters along with a README file describing the dataset structure water quality observation time series files (named paramcode_GRQA.csv) -GRQA metadata (folder meta) containing descriptive statistics (GRQA_param_stats.csv) and duplicate observation files (source_dup_obs.csv) when relevant the set of overview figures (folder figures) containing The five source datasets are also included in the GRQA data package. Folder source_data includes the raw folder with downloaded source files and harmonization schemas used in the preprocessing stage (source_code_map.csv) for each source dataset along with the original units (source_units.csv) the sourcename/processed folder containing summary statistics of observation values by parameter for each source dataset before (param-code_source_raw_stats.csv) and after (param-code_source_processed_stats.csv) processing along with information about the number of missing values (source_missing_values.csv) and source file size (source_file_info.csv) when relevant, processed/meta also including duplicate site ID files (source_dup_sites.csv).
The structure of GRQA observation files is given in Table 6. In addition to the attributes outlined in Sect. 3, the extracted metadata also include information about the upstream basin and drainage region of the observation site. It has to be noted that the availability of this information was dependent on both the source (i.e., not present in CESI and Waterbase) and the observation site itself and is therefore available only sporadically in GRQA as well (Table 6). Parameter codes, names, forms and observation values in GRQA are given as they appeared in source data alongside their harmonized and processed GRQA versions so that end users could assess the validity of conversion and make corrections if needed.
The statistical overview of the parameters included in GRQA is shown in Table 7. The number of sites per parameter ranges from only 7 (DC) up to 68 592 (TSSs). Parameters having more sites generally also have more observations. Parameters with a small number of sites and observations were usually present in only one or two source datasets. For example, dissolved organic phosphorus (DOP) only existed in WQP. Different versions of biochemical and chemical oxygen demand that could not be harmonized based on source metadata were kept separate, although the median value for BOD and BOD5 ended up being equal.
The spatial distribution of water quality observation sites depended on the parameter and is illustrated in Fig. 2 using dissolved oxygen (DO), dissolved organic carbon (DOC), TP and TSSs. These parameters were the largest in terms of number of sites and observations in their corresponding groups (oxygen, carbon, nutrients and sediments). They are also used in the following figures. Some observations that could be made when examining site maps were the following:   -Europe and North America are the best represented in the case of all parameters.
-Coverage is also good in Australia, New Zealand, parts of East Asia and Brazil in the case of some of the key parameters (e.g., TP, TN).
-The rest of the world (Africa, most of Asia) only has sporadic coverage.
The temporal distribution of the four parameters is given in Fig. 3. Similar to the spatial distribution, the temporal coverage of observations depended on both source data and parameter, with WQP having the longest and Waterbase the shortest time series. Most of the data from GEMStat are from the past decade, while GLORICH has a more even observation distribution throughout the time series.

Statistical characteristics of GRQA observation time series
As mentioned in the previous section, each of the observation files was accompanied by a set of images and tables, giving insight into the characteristics of the observation time series. The structure of tabular summary statistics is shown in Table 8. These files contain some basic statistics (e.g., standard deviation) about observation values per parameter and source. In addition, information about the temporal characteristics of time series (e.g., mean length per site) is given; this can also be important when assessing the suitability of the data for modeling purposes. The applicability of water quality modeling is greatly affected by the distribution of observation values as a majority of modeling methods require a near-normal distribution. The skewness caused by extreme outliers is a common problem in hydrological modeling. The observations often follow a lognormal distribution, which means that the data often need to be transformed and normalized in order to be usable (Helsel, 1987;Hirsch et al., 1982;Parmar and Bhardwaj, 2014). Similar behavior was also examined in GRQA, in which values of most parameters showed a strong positive skew. This can be seen in histograms (Fig. 4) and box plots (Fig. A1). For illustrative purposes, values determined as outliers by the IQR test have been omitted from the figures. In the case of parameters such as TP and TSSs, the skewness remains even after outlier omission. This is confirmed by the box plots, in which the total range of the values greatly exceeds the median. Availability (Fig. 5) and continuity (Fig. 6) plots were used to examine the temporal fragmentation of the time series. In general, observations from national sources (CESI and WQP) exhibited slightly higher availability and continuity than others, likely caused by more consistent data acquisition frameworks. No clear spatial pattern emerged from the analysis, meaning that differences in both indicators exist at the site level even within the same country. Due to how the metrics were calculated, shorter time series outperformed longer ones. An example of this is TP in Brazil, where the examined high continuity correlated with very short mean time series length (less than a year). Parameters with very fragmented time series (e.g., TSSs) had only a limited number of sites where observations had been collected consistently throughout the whole time frame.
The GRQA also includes plots of median observation values, which were calculated over the whole time series for each site. Seasonal fluctuations cannot be identified on this aggregation level, so the maps are meant to be only indicative. An example of median plots can be seen in the Appendix (Fig. A2).

Limitations and considerations regarding the use of GRQA
Taking into account the aforementioned issues encountered during the compilation of GRQA, certain limitations and potential remaining errors have to be considered when using the dataset for water quality modeling.

Potential errors in unit conversion
As described in Sect. 3, several assumptions had to be made when creating harmonization schemas about the chemical form of certain nitrogen parameters (NO 2 , NO 3 and NH 4 ). However, if the assumption made based on this limited ancillary information was incorrect, then using the conversion would have been affected as well. For this reason, the source observation values along with source units were retained, and the users can retrace the conversion steps using the harmonization schemas.

Skewness of observation values
The outlier treatment strategy used for GRQA involved only flagging the values based on the IQR test, which means that the skewness illustrated in Sect. 4 still remains. Although the described strong positive skew existed also in source data, potential unit conversion errors could have exaggerated it. As shown by histograms, omitting flagged outliers is not enough to eliminate the skewness in some cases (TP and TSSs), so additional processing could be needed to transform the data into a normal shape. Power transformation methods like the Box-Cox transformation (Box and Cox, 1964) could be used to further minimize skewness. It is likely that some of the most extreme outliers are caused by data entry errors or equipment malfunction rather than events such as agricultural spills. For setting thresholds to determine whether a value is illogical or not, more sophisticated outlier detection methods based on some general freshwater quality guidelines (Enderlein et al., 1996) could perhaps be used to further filter the observation values.
5.2 Suggestions for improving multi-source water quality data compilation

Metadata quality
When merging datasets from different sources, most of the complications stemmed from inadequate metadata of water quality observations, such as ambiguous parameter names and codes, and missing details on the chemical forms of parameters. This information would be integral for harmonizing units and observation values. The terms used for indicating the filtration status of samples are often dependent on the interpretation of the authors (total vs. unfiltered, dissolved vs.   Sprague et al., 2017). The annotation of suspect or incomplete data is another aspect of good-quality metadata (Gudivada et al., 2017). Internal quality control measures such as the ones in GEMStat and Waterbase would help the end user in the data cleaning stage and eliminate some of the outliers. The following aspects should be considered to make multisource data harmonization more efficient in the future: -Parameter forms should be reported with the units. -The filtration status of the samples should be reported, and the terms filtered and unfiltered should be preferred as opposed to the more ambiguous dissolved and total.
-Machine-readable quality flags as found in GEM-Stat (columns Value Flags and Data Quality) or Waterbase (columns resultObservationStatus, meta-data_statusCode and metadata_observationStatus) should be added.
-Whether observations are daily or monthly at the source level should be clearly defined.
-Area units (m 2 , km 2 , etc.) should be included when the upstream catchment area of the site is reported.
-Other information about potential errors in the data should be included (potential duplicates, typographical errors, etc.).
-When certain assumptions or decisions are made when harmonizing data from different sources, they should be reported when the data are published.

Spatial and temporal discontinuity
Although spatial coverage of water quality observations in GRQA exceeds that of the existing global datasets (GEMStat and GLORICH), large areas of Africa and Asia are empty. A major reason might be a lack of knowledge and funding to update and extend site networks, particularly in hard-to-reach areas. In addition, not all governments adhere to an open-data policy. Therefore, improving the spatial coverage of water quality data still relies mostly on implementing additional measures to encourage countries to share it in accordance with open-data principles. The availability and continuity analysis showed that the GRQA time series are fragmented and that significant gaps remain in the data which will negatively affect large-scale modeling performance. These gaps could be caused by both issues with sensor maintenance or technical limitations under certain conditions (e.g., weather) and inconsistencies in the data acquisition practices on the local level. Recently, MLbased solutions for time series augmentation have been used to fill in gaps in historical monitoring data Ren et al., 2019). However, this kind of gap filling still requires enough good-quality training data in the existing time series fragments to be effective and can potentially only be  (Table 7) are not shown on the plot. of help when improving the temporal rather than spatial coverage.
Another option for improving continuity is using data from one time series to fill in gaps in another. For example, turbidity has been successfully translated into TP and TSS content (Castrillo and García, 2020;Jones et al., 2011). As turbidity data can be acquired at a higher frequency than TP and TSSs, the use of such surrogate parameters can be helpful in data-scarce regions for certain parameters.

General remarks
An important part in improving the spatiotemporal coverage of water quality is raising awareness about the existing datasets (e.g., GEMStat), so that new institutions could join the contributor network and submit their own site data. The continued growth of international collaboration will be vital in improving open global water quality data (Blöschl et al., 2019;Tang et al., 2019). Most of the data collected locally are intended only for regional or national use. Thus, the data are not compatible with those from other countries due to lack of common metadata management practices, with the problems discussed above being a major bottleneck (Hutton et al., 2016;Sprague et al., 2017;Stagge et al., 2019).
Providing those institutions with an example workflow when designing water quality data pipelines, such as the schema recently proposed by Plana et al. (2019), would help them develop their own data management strategy. The workflow used to compile GRQA along with the issues raised in this study will hopefully also help to draw attention to this topic.

Code and data availability
The GRQA dataset, supplementary metadata and figures are available for download on the DataCiteand OpenAIRE-enabled Zenodo repository at https://doi.org/10.5281/zenodo.5097436 .
The data processing scripts used for the compilation of GRQA are available on Zenodo at https://doi.org/10.5281/zenodo.5082147 .

Conclusions
The GRQA dataset was created with the intention to improve the spatiotemporal coverage of previously available open water quality data and provide an example workflow for multi-  source data compilation that can be customized for other data sources as well. The current version of GRQA is mainly focused on different forms of the main nutrients (N and P) and carbon compounds, although GEMStat, Waterbase and WQP also have many other types of parameters that are used as water quality indicators (heavy metals, pesticides, etc.). Other researchers are able to make additions and customize the dataset to their needs for parameter-specific studies using the scripts published with GRQA.
Updates and additions by the hydrological community are encouraged to further develop GRQA. As it stands, GRQA is a set of well-structured CSV files rather than a database able to be queried. We intend to add a Jupyter Notebook example of loading and processing the CSV files to the GRQA GitHub repository. We include an extensive data catalog with graphs and maps for temporal and spatial coverage of every variable in the Supplement. This should help potential users to get a better overview of the data before downloading it. Converting the files into a database would also greatly improve data management and make extending GRQA easier in the future. In the case of a relational database, the schema recommended by Plana et al. (2019) could be followed. We are also considering the addition of an online dashboard for data visualization and download, similar to that of GEMStat or WQP. A versioning system along with a metadata validation strategy similar to Welty et al. (2020) could be implemented to ensure metadata quality.
Future work could also include the development of a dataset for catchment characteristics in order to better study how water quality in rivers and streams is affected by land use changes in their catchments. The CAMELS dataset (Addor et al., 2017) and its regional implementations (Chagas et al., 2020;Coxon et al., 2020) can be used as an example. In addition, interactions between water quality and streamflow can be further studied by linking water quality observations to streamflow data from the Global Streamflow Indices and Metadata Archive (GSIM) (Do et al., 2018).  (Table 7) are not shown on the plot.      Author contributions. HV, AK and EU conceived the idea of the study. GA and EU aided with establishing a theoretical background for the study. HV designed the code and carried out the data processing with contributions from AK. HV, GA and LS developed the observation deduplication and outlier detection strategy. HV prepared the manuscript with contributions from all co-authors.
Competing interests. The authors declare that they have no conflict of interest.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.