The Surface Water Chemistry (SWatCh) database: a standardized global database of water chemistry to facilitate large-sample hydrological research

. Openly accessible global-scale surface water chemistry datasets are urgently needed to detect widespread trends and problems, to help identify their possible solutions, and to determine critical spatial data gaps where more monitoring is required. Existing datasets are limited with respect to availability, sample size and/or sampling frequency, and geographic scope. These limitations inhibit researchers from tackling emerging transboundary water chemistry issues – for example, the detection and understanding of delayed recovery from freshwater acidiﬁcation. Here, we begin to address these limitations by compiling the global Surface Water Chemistry (SWatCh) database, available on Zenodo (https://doi.org/10.5281/zenodo.6484939; Rotteveel and Heubach, 2021). We collect, clean, standardize, and aggregate open-access data provided by six national and international programs and research groups (United Nations Environment Programme; Hartmann et al., 2019; Environment and Climate Change Canada; the United States of America National Water Quality Monitoring Council; the European Environment Agency; and the United States National Science Foundation McMurdo Dry Valleys Long-Term Ecological Research Network) in order to compile a database containing information on sites, methods, and samples, and a geospatial information system (GIS) shapeﬁle of site locations. We remove poor-quality data (e.g., values ﬂagged as “suspect” or “rejected”), standardize variable naming conventions and units, and perform other data cleaning steps required for statistical analysis. The database contains water chemistry data for streams, rivers, canals, ponds, lakes, and reservoirs across seven continents, 24 variables, 33 722 sites, and over 5 million samples collected between 1960 and 2022. Similar to prior research, we identify critical spatial data gaps on the African and Asian continents, highlighting the need for more data collection and sharing initiatives in these areas, especially considering that freshwater ecosystems in these environs are predicted to be among the most heavily impacted by climate change. We identify the main challenges associated with compiling global databases – limited data availability, dissimilar sample collection and analysis methodology, and reporting ambiguity – and provide recommended solutions. By addressing these challenges and consolidating data from various sources into one standardized, openly available, high-quality, and transboundary database, SWatCh allows users to conduct powerful and robust statistical analyses of global surface water chemistry.


Introduction
Globally, 159 million people are reliant on untreated surface water, with only 1 in 3 people having access to safely managed drinking water services (World Health Organization and United Nations Children's Fund, 2017). With two-thirds of the global population (4.0 billion people) already experiencing water shortages at least 1 month per year (Mekonnen and Hoekstra, 2016) and 4.8-5.7 billion people projected to experience water shortages by 2050 (Burek et al., 2016), maintaining the quality of drinking water sources is paramount to human health and society. One of the main obstacles to achieving this goal is a lack of openly available, high-quality, transboundary data (World Health Organization and United Nations Children's Fund, 2017). Existing large-sample water quality datasets have (1) limited availability -for example, raw data may not be published with journal articles (Alsheikh-Ali et al., 2011); (2) limited sample size -for example, datasets may only include one waterbody type (Hartmann et al., 2014); or (3) limited geographic scope -for example, national datasets only include data for one country.
Delayed acidification recovery is an example of a transboundary problem which would benefit from a large-sample dataset. Here, we define a "transboundary problem" to be a water quality issue, or cause of a water quality issue, which crosses international borders. For example, a main driver of freshwater acidification in Atlantic Canada is acid deposition originating from all of the major production regions in North America, including those in the USA (Shaw, 1979). A similar definition of transboundary problem is often used when discussing water availability issues that cross international borders (e.g., Thu and Wehn, 2016). Ecosystem acidification and associated elevated aluminum (Al) concentrations are responsible for the loss of economically significant fish species (Committee on the Status of Endangered Wildlife in Canada, 2011;Dennis and Clair, 2012); reductions in crop success (Collignon et al., 2012); reduced forest health (Collignon et al., 2012;DeHayes et al., 1999;de Wit et al., 2010); and, therefore, carbon sequestration and the increased cost of water treatment (Letterman and Driscoll, 1988). Furthermore, high Al in drinking water resources may contribute to human osteological and neurological diseases (World Health Organization, 2010). Prior large-sample (Björnerås et al., 2017;Monteith et al., 2007) and global-scale (Weyhenmeyer et al., 2019) studies on freshwater acidification have indicated that recovery is delayed in some regions. However, to date, there is no openly available global-scale database of acidificationrelated water chemistry that includes Al, the increased concentrations of which are one of the most biotically toxic effects of acidification (Gensemer and Playle, 1999).
There is a need for harmonized large-sample hydrological research (Blöschl et al., 2019), and global datasets are required to develop global water chemistry models (Harrison et al., 2005a, b). The majority of water quality research has focused on catchment-scale datasets, which nar-rows our understanding of hydrochemical processes to catchments that have historically been studied. Catchment-scale analyses make valuable contributions to our understanding of hydrochemical processes; however, variability in catchment response to perturbation, which is potentially indicative of variability in hydrochemical processes, is difficult to evaluate in a robust manner without an approach that assesses multiple catchments/regions in a harmonized way. In the case of freshwater acidification, the water chemistry response to acid deposition may be altered by geology and land use/land cover; thus, observations made in one watershed/region may not generalize to others (e.g., Clair et al., 2011;Rotteveel and Sterling, 2020). For example, the watershed response to acid deposition is influenced by weak acids in regions with slowweathering, base cation (C B ) poor bedrock but not in regions with higher C B geology (Clair et al., 2011;Stoddard et al., 1999), and watersheds with high-intensity forest harvesting may be more strongly affected by acid deposition than those with less disturbance (Aherne et al., 2008;Feller, 2005).
Obtaining and consolidating water chemistry datasets for transboundary hydrological research is challenging due to limited data access as well as disparate (i.e., dissimilar) data collection programs and data reporting formats. Access may be limited because data are not published and/or kept confidential, as is the case for some sites within the United Nations International Centre for Water Resources and Global Change's Global Water Quality Database and Information System (GEMStat). Data collection programs are dissimilar largely due to a lack of international variable and analysis method definitions (World Health Organization and United Nations Children's Fund, 2017). For example, Al measurements may not be comparable across different functional, operational, and classical species definitions (Namieśnik and Rabajczyk, 2010;Ščančar and Milačič, 2006). Lastly, disparate variable naming conventions, units, and censored data notation complicate the consolidation of datasets from different sources, as these notations must first be standardized.
Here, we aim to address the above limitations by contributing an openly available, standardized, easy-to-use, global water chemistry database. We focus on providing data to address the problem of delayed freshwater acidification recovery by collecting, cleaning, standardizing, and compiling datasets of acidification-related water chemistry variables. Specifically, our research goals are (1) to develop a global database of acidification-related surface water chemistry, (2) to identify the main limitations associated with compiling this database, (3) to identify and characterize critical spatial data gaps within existing datasets, and (4) to provide recommendations for data reporting and storage to facilitate its easy access and use by other researchers.

Data sources
We obtained input data for SWatCh from openly available datasets published by national and international agencies and from datasets available on open-access servers (Table 1). Our search terms were "water chemistry data" or "water quality data" and "global" or a country name, as listed in the member states of the United Nations (United Nations, 2009). Our data search did not have a geographic focus, although our sources were limited to datasets available in English. Datasets likely missed by this approach include those hosted on servers or websites without (English) search engine optimization (SEO) -that is, those not optimized with keywords identifiable by search engines to provide results (Google, 2002). All datasets were originally downloaded in September 2019 and updated during the manuscript review process in April and March 2022 so that newly published data could be included. The GloRiCh dataset was not re-downloaded because it had not been updated since the prior download, and the National Water Quality Monitoring Council Water Quality Portal dataset was not re-downloaded due to unresolved internal server errors.

Data inclusion
SWatCh includes 24 water chemistry variables collected in untreated surface waterbodies. We define "untreated" as water that is not wastewater or receiving treatment plant effluent near to the sample collection site (e.g., sites described as "wastewater" or "effluent"). The included waterbody types are streams, rivers, canals, ponds, lakes, and reservoirs. The included water chemistry variables are metals -Al and iron (Fe); C B species -calcium (Ca), magnesium (Mg), potassium (K), and sodium (Na); other measures of buffering capacity -the acid neutralization capacity (ANC), alkalinity, carbonate (CO 3 ), and bicarbonate (HCO 3 ); acid anions -sulfate (SO 4 ), nitrate (NO 3 ), and nitrite (NO 2 ); other anionsfluoride (F) and chloride (Cl); nutrients -phosphorus (P), phosphate (PO 4 ), and ammonium (NH 4 ); physical parameters -pH and temperature; and carbon -carbon dioxide (CO 2 ), total inorganic carbon (TIC), dissolved inorganic carbon (DIC), total organic carbon (TOC), and dissolved organic carbon (DOC). The included sample fractions are unfiltered, filtered, and extracted (i.e., acid digested). We screened out sites identified as confidential or with other publication restrictions. A visual representation of the data processing completed during the preparation of SWatCh is presented in Fig. 1. Figure 1. Workflow for creating SWatCh. Below detection limit is abbreviated as BDL, and coordinate reference system is abbreviated as CRS.

Flagging of low-quality data
We identified low-quality data using the flag "Rejected"; this included samples flagged as "unreliable", "suspect", or "poor quality" in the source databases. Additionally, we flagged values below zero for all variables except temperature, alkalinity, and ANC; these values are assumed to have been entered incorrectly. A total of 79 910 data points were considered to have low data quality, representing 1.48 % of the data in SWatCh.

Removal of duplicates
We removed duplicate site and sample data. Three of our source databases, GEMStat, the Global River Chemistry Database (GloRiCh), and Waterbase, are compilations of water chemistry data from several sources and, thus, repeat some measurements. We removed duplicated sites based on the unique site identification code. We removed duplicated samples based on the site identification code, date, variable name, variable fraction, variable speciation, and sample value. We define "variable fraction" as the component part of a water sample, such as filtered or unfiltered. "Variable spe-ciation" is defined as the speciation of a reported parameter; for example, NH 4 may be reported as nitrogen (N) or NH 4 .

Flagging of potential outliers
We identify potential outliers for each time series in SWatCh using a 4 times the median absolute deviation (MAD) cut-off value. The MAD is preferred to other methods of outlier removal when the data have a skewed distribution or large outliers are present (Leys et al., 2013;Rousseeuw and Hubert, 2011), as is common in water chemistry data (e.g., Rotteveel and Sterling, 2020). Water chemistry data in SWatCh have a skewed distribution for most variables; thus, the MAD is suitable outlier screening approach. The equation for the MAD is presented in Eq. (1).
Here, x i is an observation, andx is the median (Rousseeuw and Hubert, 2011). A total of 0.27 % (13 309) of values were flagged as potential outliers.

Data standardization 2.3.1 Database format
The SWatCh database conforms to the DataStream Water Quality (DS-WQX) schema, a standardized data format that specifies the allowable elements and dataset structure. The DS-WQX schema is a simplified adaptation of the United States Environmental Protection Agency (US EPA) WQX schema. The US EPA WQX schema is an implementation of the Environmental Sampling, Analysis, and Results data standard which conforms to the National Environmental Information Exchange Network (EN) standards (US EPA, 2021). The complexity of the US EPA WQX schema presents a barrier to entry (DataStream Initiative, 2022) that is addressed by the simplified DS-WQX schema. The US EPA WQX schema is simplified compared with the DS-WQX schema in the following ways: most optional columns are removed, data are stored as one database as opposed to multiple relational datasets, column names are simplified, and date and time information is conformed to the ISO 8601 format to simplify parsing and ensure universal readability (DataStream Initiative, 2022). For further details, the reader is referred to https://github.com/datastreamapp/schema (last access: 9 April 2022). We made two changes to the DS-WQX format: the minimum value for the "MonitoringLocationLatitude" field (i.e., the minimum allowable latitude value) was updated from 0 to −90 so that sites located in the Southern Hemisphere could be included, and "OTHER" was added as an allowable value to the "ResultAnalyticalMethodContext" field (i.e., the context associated with the analysis identifier code -for example, the agency that published the analysis method specifications) so that method information which was undefined in the DS-WQX schema could be included. For samples where the ResultAnalyticalMethodContext was specified as OTHER, information on the analysis identifier code context is provided in the "ResultComment" column.

Variable naming and measurement units
We standardized variable naming conventions in accordance with DS-WQX. Variable names are indicated separately from variable fractions and speciation to facilitate the analysis of different fractions simultaneously. In the input datasets, the fractions are not specified for all variables; for these, we denote the fraction as "Unspecified".
We harmonized the measurement units and variable speciation for each parameter to simplify data analysis. Measurements were reported in different units in the input datasets; we standardized them to the most common SI (International System of Units) unit that we observed for each variable. For example, Ca was reported in micrograms per litre (µg L −1 ), milligrams per litre (mg L −1 ), equivalents per litre (eq L −1 ), and moles (mol), but it was most commonly reported in milligrams per litre; thus, we standardized the measurement unit to milligrams per litre. Concentrations are provided in milligrams per litre except for concentrations of Al and Fe (µg L −1 ), the ANC (mmol L −1 ), pH (unitless, denoted as "None"), and temperature ( • C).
Several input datasets did not include their encoding type, causing corrupted characters and measurement unit ambiguity. To prevent these errors, we omit non-ASCII (American Standard Code for Information Interchange) characters; for example, micrograms (µg) are denoted as "ug". Measurement units in SWatCh conform to the DS-WQX standard.

Censored data notation
We standardized censored data notation to facilitate easier handling of these values. Censored data notation varied across the input datasets and included abbreviations such as "BDL", "<", or the numeric value of the detection limit. The input datasets did not distinguish between samples measured at or below the detection limit. Detection limits differed across and within datasets; thus, we standardized values below detection limit by flagging them and providing the detection limit in separate columns, allowing for various approaches to handle these results.

Mapping
We harmonized the coordinate reference systems (CRSs) of the sample site locations to simplify geographic analysis. Site location coordinates are provided in various CRSs in the input datasets; thus, we re-projected them to the World Geodetic System 1984 (WGS 84) geographic CRS. We selected WGS 84 because it provides a good mean solution across the globe and can easily be projected to local datums (Bajjali, 2018).

Results
The SWatCh database contains water chemistry data across 24 variables, four fractions, 33 722 sites, and 5 062 980 samples collected between 1960 and 2022 (Table 2). SWatCh is available on Zenodo (https://doi.org/10.5281/zenodo.6484939; Rotteveel and Heubach, 2021). The sample collection frequency ranges from approximately twice a day to one-off samples, depending on the parameter and waterbody type. The parameters with the highest average annual sampling frequency are temperature (791) and pH (359). The average annual sampling frequency across all parameters is similar between waterbody types: four samples per year for lacustrine and riverine systems, and eight samples per year for reservoirs. Not all samples included collection and analysis methodologies; for the samples where this information was available, there are 565 different methods.
Sites in SWatCh are located across the globe, but they are concentrated in North America, South America, and Europe (Fig. 2) and encompass a variety of bedrock types (United States Geological Survey, USGS, 2022), land use types (Goldewijk et al., 2011), and climate zones (Kottek et al., 2006). The spatial distribution of sampling locations varies by waterbody type; notably, only riverine sites are available in northern North America and several island nations, such as Guam and New Zealand, and reservoir sites are concentrated in the equatorial and arid climate zones, such as Central America (Fig. 2).
The number of sites available to study freshwater acidification, as well as the spatial extent of the sites, decreases with number of included parameters and time series length (Table 3, Fig. 3). We allocated the available parameters in SWatCh into five groupings, each of which allows freshwater acidification to be studied with increasing detail and certitude. The parameter groupings, in order of decreasing importance, are as follows: 1. Acidity (pH). pH is assigned the to the first grouping because it is the primary indicator of freshwater acidification.
2. Basicity (alkalinity, hardness, ANC, CO 3 , HCO 3 , Ca, and Mg). The primary measures of basicity are assigned to the second grouping because they are also used as primary indicators of freshwater acidification and can be used to determine the stage of freshwater acidification. For example, Stage 2 is characterized by an increase in freshwater C B concentrations as cation exchange from cation exchange sites in soils buffers acid anion deposition (Galloway et al., 1983). K and Na are not included in the basicity grouping because they are usually  present in minor concentrations compared with Ca and Mg (Meybeck, 2004).
3. Acid anions (SO 4 ,NO 3 ,and NO 2 ). Acid anions are assigned to the third grouping because acid deposition is usually the primary driver of freshwater acidification (Galloway et al., 1983), but it may not be a good indicator of freshwater acidification in low-C B waters or in waters with high DOC concentrations (Rotteveel and Sterling, 2020).

Metallic cations (Al and Fe).
Metallic cations are assigned to the fourth group because soils undergo Al or Fe buffering in response to acid deposition once soil C B species are depleted (Björnerås et al., 2017;Galloway et al., 1983); thus, they are a secondary indicator of freshwater acidification.
5. Weak acids (CO 2 , TOC/DOC, and NH 4 ). Although weak acids are not a primary driver of freshwater acidification in most catchments, they are an important driver of the freshwater response to acid deposition in some catchments with low buffering capacity (Clair et al., 2011;Rotteveel and Sterling, 2020); thus, they are assigned to the fifth grouping.
6. Other (temperature, K, Na, Cl, F, P, PO 4 , and DIC). The remaining parameters are assigned to the last group because they characterize catchment-scale processes that may have secondary effects on the freshwater response to acid deposition (e.g., Berger et al., 2015;Harriman et al., 1995;Kopáček et al., 2001). DIC is included in this group because the speciation (i.e., CO 2 , CO 3 , or HCO 3 ) is unknown, although it can be calculated using pH.

Discussion
Here, we discuss the main limitations that we encounter when compiling and analyzing datasets and provide recommendations for data sharing to facilitate more large-sample and global-scale water chemistry research.

Data availability and spatial gaps
Some variables have smaller sample sizes. The number of reported measurements differs greatly per variable, with metals (Fe and Al) and F having the smallest sample sizes and lowest sampling frequencies and pH and temperature having the largest and highest. This discrepancy is possibly due to these parameters being relevant to a wider range of research topics or the cost of measurement: pH and temperature can be measured with a variety of field-or laboratory-based multiparameter probes, whereas metals and anions require laboratory analysis. However, it is currently unknown if analysis results are under-reported for some variables -that is, if all laboratory analysis results are reported for each sample included in the input databases. Prior research on one of the main variables with a low sample size (Fe), includes an openly available research dataset of 340 waterbodies in Europe and eastern North America (Björnerås et al., 2017). Despite the geographical coverage and size of this dataset, it is not included in SWatCh because the data do not adhere to the DS-WQX data schema due to missing variable fractionation information. These types of published research datasets are uncommon (Alsheikh-Ali et al., 2011) and highlight the potential contribution of unpublished raw research data. Critical data gaps exist across large areas on the African, Asian, Australian, and Antarctic continents, mainly representing the equatorial, arid, snowy, and polar climate zones (Kottek et al., 2006). The zones of missing data represent regions where freshwater acidification is an emerging issue (e.g., in China; Li et al., 2019) and regions where the climatechange-induced alteration of freshwater discharge regimes is projected to be the greatest by 2050 (Döll and Zhang, 2010). The lower data coverage in some of these regions represents a limitation in the development of global water chemistry models (Harrison et al., 2005a, b), and it may inhibit the detection -and therefore treatment -of emerging water quality problems related to the climate-change-induced perturbation of freshwater discharge regimes. The observed lower data availability may be because of our reliance on English datasets, less data sharing in these regions due to concerns about "parachute research" (where researchers abscond with local data to their home countries) (Serwadda et al., 2018), a lack of funding for scientific research (Serwadda et al., 2018), a lack of national data-sharing regulations (Serwadda et al., 2018;Thu and Wehn, 2016), or outdated information management systems (Thu and Wehn, 2016).
Despite the aforementioned data gaps, some of the most acidified regions of the world can be studied with a high degree of detail and certitude using SWatCh. Sites with sufficient available parameters (as defined in Sect. 3) and time series length (i.e., 10-15 years) to study drivers and trends are concentrated in the Northern Hemisphere, and they encompass the some of the most acidified regions of North America and Europe ( Fig. 3; Björnerås et al., 2017;Clair, 2012;Clair et al., 2011;Driscoll et al., 2016). We chose a time series length of 10-15 years because this is the minimum duration required to distinguish between short-term hydrological variability and underlying system behaviour (Howden et al., 2011), and it is also comparable to time series lengths commonly used to study freshwater acidification (e.g., Burns et al., 2008;Clair et al., 2011;Driscoll et al., 2016). Based on the available parameters for locations with a 10-or 15-year time series, the following aspects of freshwater acidification can be studied using SWatCh: acidification stage, extent of base cation depletion, catchment buffering processes, the importance of natural and/or weak acids, and other influential catchment-scale processes. The lack of water chemistry data relevant to freshwater acidification in some regions may be related to historical preferential research focus: freshwater acidification research has historically predominantly been focused on Europe and North America (e.g., Björnerås et al., 2017;Holland et al., 2005;Stoddard et al., 1999), where this is an established environmental issue, and less focused on other regions such as China, where this is an emerging concern (e.g., Li et al., 2019).
Alleviating the issue of data availability is complex (Serwadda et al., 2018) but can be facilitated through journals more consistently implementing and enforcing data-sharing policies (Alsheikh-Ali et al., 2011), ensuring coherence with and balance between data-sharing policies and protecting national interests (Thu and Wehn, 2016), and engaging and crediting the people and organizations collecting the data (Serwadda et al., 2018).

Methodology changes and dissimilarity
The analysis of time series and the intercomparison of data collected at different sites is challenging due to the dissimilarity of sample collection programs and methodology changes. Methodology changes throughout a time series may result in spurious trend test results. For example, at site AL05BE0013, located in the Bow River approximately 4.5 km upstream of Canmore, Alberta, Canada, dissolved Al was analyzed using Value Method Variable (VMV) 100195 prior to 2003 and VMV 107941 and 97963 after 2003 (Fig. 4). VMV 107941 and 97963 both use inductively coupled plasma mass spectrometry (ICP-MS) and have comparable low detection level limits, whereas VMV 100195 uses inductively coupled argon plasma emission spectroscopy (ICAP) and has a higher detection limit. Because most values in this time series are lower than the detection limit for VMV 100195 (i.e., 20 µg L −1 ), analysis of the time series without removing samples analyzed via VMV 100195 would result in the detection of a spurious negative trend. Similarly, disparate analysis methods across geographic regions may hinder the comparability and consolidation of data collected by different sources (World Health Organization and United Nations Children's Fund, 2017). For example, in the USA, Al samples may be analyzed by US EPA method 200.7, with an estimated detection limit of 45 µg L −1 (US EPA, 2015), whereas Al samples in Europe may be analyzed by ISO method 15586:2003, with an estimated detection limit of 1 µg L −1 (ISO/TC 147/SC2, 2003); samples analyzed using these two methods cannot be compared if Al concentrations are below 45 µg L −1 . Trend analysis can also not be robustly performed if different sample fractions are present throughout the time series. For example, Environment and Climate Change Canada (ECCC) analyzed the unfiltered Al fraction as extractable Al (Al ext ; comprising the dissolved fraction and weakly bound or sorbed molecules) prior to 2011 in Atlantic Canada and as total Al (Al t ; comprising dissolved, weakly bound or sorbed, and particulate molecules) after 2011 (Rotteveel and Sterling, 2020). To facilitate intercomparison of data and trend analysis, the creation of internationally standardized variable definitions and cross-boundary analysis methodology is needed (World Health Organization and United Nations Children's Fund, 2017).

Ambiguity and inconsistency
We encounter ambiguity and inconsistency in variable and fraction naming conventions, reporting units, analysis methodology, and dataset encoding. Firstly, we find variable and fraction definitions and consistency to be lacking in most input datasets. For example, an Al d sample may be filtered through a 0.45 or 0.10 µm filter; both samples are considered Al d but represent a different set of Al molecules. As naming conventions are variable and there are no internationally standardized variable definitions (World Health Organization and United Nations Children's Fund, 2017), defining variables and their fractions is required to prevent confusion regarding comparability. Similarly, reporting units and censored data notation should be defined and consistent throughout the dataset; this includes spelling, abbreviations, and capitalization. We also observe ambiguity regarding analysis methodology, where analysis methods are inadequately described or are missing entirely. Ideally, analysis method reporting includes all of the following which are applicable: filter size and type; analysis instrument; acid preservative type; location of acid preservation (in field or laboratory); and the analysis/speciation method, method code, its publishing agency, and a link to a reference document. Lastly, we encounter corrupted characters due to unknown dataset encoding. To prevent this ambiguity, the encoding of the dataset should be known and published; this is especially important for datasets not encoded in 8-bit Unicode (UTF-8), which is preferred for data exchange (ISO/IEC JTC 1/SC 2, 2017).

Limitations and future work
In addition to the challenges noted above, the main limitations of SWatCh are a lack of discharge data and information on watershed land use and land cover. We did not include discharge information, as there are numerous openly available global-scale river discharge datasets that cover some of the sites available in the SWatCh database -for example, those available via the European Environmental Agency's Waterbase or the Global Runoff Data Centre. Further development is needed to integrate existing discharge datasets into SWatCh, allowing discharge-weighted water chemistry concentrations to be computed. The DS-WQX schema does not allow for the inclusion of watershed information such as land use and land cover; thus, we do not include these data in SWatCh. Some of this information is available in the input datasets -for example, the GloRiCh database (Hartmann et al., 2014). Catchment characteristics can be identified for sites by using existing global datasets such as HydroATLAS, which provides information on hydrologic, physiographic, climate, land use and land cover, soils and geology, and anthropogenic influences for catchments at a resolution of up to 15 arcsec (approximately 463 m at the Equator) (Linke et al., 2019).

Data availability
The SWatCh database is available on Zenodo (https://doi.org/10.5281/zenodo.6484939; Rotteveel and Heubach, 2021). No account nor sign-up is required to download the data. SWatCh is composed of thirdparty data, as listed in Table 1. The SWatCh database is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, the reader should visit http://creativecommons.org/licenses/by-nc-sa/4.0/ (last access: 1 May 2022) or send a letter to Creative Commons, P.O. Box 1866, Mountain View, CA 94042, USA. GEMStat data, 7401 sites (21.95 % of sites), are not available in SWatCh due to a publication ban. Users may add these data by requesting the GEMStat dataset from the United Nations Environment Programme and running the SWatCh data processing scripts available from the GitHub repository indicated below (Sect. 6).

Conclusion
Prior research has demonstrated that, despite variability in sample size, geographic coverage, and analysis methodology, large-sample datasets facilitate the understanding of global water chemistry processes and the identification of transboundary problems (e.g., Björnerås et al., 2017;Monteith et al., 2007;Weyhenmeyer et al., 2019). Despite these clear benefits, there are few global-scale water chemistry datasets. We created SWatCh to begin to fill this gap; it is a global database of surface water chemistry focused on freshwater acidification-related variables. This database contains water chemistry data across 24 variables, four variable fractions, 33 722 sites, and 5 062 980 unique samples collected between 1960 and 2022. The numerous available variables and large sample sizes in SWatCh allow users to conduct powerful and robust statistical analyses to answer emerging global surface water chemistry questions. To facilitate data use in databases like SWatCh and by other researchers, we recommend making research data openly available; standardizing analysis methodology; and avoiding ambiguity/inconsistency in variable and fraction names, reporting units, censored data notation, analysis method descriptions, and dataset encoding. Future work should focus on filling the spatial data gaps identified in Asia, Africa, and Australia as well as on adding discharge data. With more people experiencing decreased water quantity (Burek et al., 2016;Mekonnen and Hoekstra, 2016), maintaining water quality is paramount. By facilitating the global exchange of their data, researchers can contribute toward this goal.
Author contributions. LR conceived the original idea, compiled, and prepared the data, co-developed the data processing scripts, conducted the geospatial information system (GIS) procedures, conceptualized and prepared the figures and tables, and was the principal author. FH wrote the data validation scripts and codeveloped the data processing scripts. SMS provided supervision and co-edited the manuscript.