The Surface Water Chemistry (SWatCh) database: A standardized global database of water chemistry to facilitate large-sample hydrological research
- Sterling Hydrology Research Group, Dalhousie University, Halifax, B3H 4R2, Canada
- Sterling Hydrology Research Group, Dalhousie University, Halifax, B3H 4R2, Canada
Abstract. Openly accessible global scale surface water chemistry datasets are urgently needed to detect widespread trends and problems, to help identify their possible solutions, and identify critical spatial data gaps where more monitoring is required. Existing datasets are limited in availability, sample size/sampling frequency, and geographic scope. These limitations inhibit the answering of emerging transboundary water chemistry questions, for example, the detection and understanding of delayed recovery from freshwater acidification. Here, we begin to address these limitations by compiling the global surface water chemistry (SWatCh) database, available on Zenodo (DOI: https://doi.org/10.5281/zenodo.4559696) We collect, clean, standardize, and aggregate open access data provided by six national and international agencies to compile a database consisting of three relational datasets: sites, methods, and samples, and one GIS shapefile of site locations. We remove poor quality data (for example, values flagged as suspect
), standardize variable naming conventions and units, and perform other data cleaning steps required for statistical analysis. The database contains water chemistry data across seven continents, 17 variables, 38,598 sites, and over 9 million samples collected between 1960 and 2019. We identify critical spatial data gaps in the equatorial and arid climate regions, highlighting the need for more data collection and sharing initiatives in these areas, especially considering freshwater ecosystems in these environs are predicted to be among the most heavily impacted by climate change. We identify the main challenges associated with compiling global databases – limited data availability, dissimilar sample collection and analysis methodology, and reporting ambiguity – and provide recommendations to address them. By addressing these challenges and consolidating data from various sources into one standardized, openly available, high quality, and trans-boundary database, SWatCh allows users to conduct powerful and robust statistical analyses of global surface water chemistry.
- Preprint
(716 KB) -
Supplement
(304 KB) - BibTeX
- EndNote
Lobke Rotteveel and Shannon M. Sterling
Status: final response (author comments only)
-
CC1: 'Comment on essd-2021-43', Mary Kruk, 13 Sep 2021
Hello,
With your manuscript currently under review I thought this would be a good opportunity to highlight a Canadian water quality database that currently addresses some of the challenges outlined in your paper.
I work on DataStream, an open access platform for sharing water quality data. It allows users to access, visualize, and download water quality datasets collected by monitoring programs across regional hubs in Canada (the Mackenzie Basin, Lake Winnipeg Basin, Atlantic Canada, and a Great Lakes hub coming October). A main focus of DataStream is to help community monitoring groups, citizen scientists, researchers, and governments share their data at a regional-scale by adopting the US EPA/USGS WQX data standard to promote data (re)use and interoperability in transboundary watersheds.
We thought it would be relevant to reach out because DataStream has faced many of the same challenges you address in your paper -- such as differing sample collection/analytical methods, reporting ambiguity, and spatial data gaps across Canada. We have found that the adoption of the WQX schema, used in the US Water Quality Portal, has helped us to align data collected by a wide range of monitoring initiatives. DataStream requires metadata on sample collection and analytical methods with each data point and reduces variable naming ambiguity by using the WQX list of allowed values for water chemistry parameters. We are constantly trying to evolve and improve the DataStream data standard and platform to better address these issues as I’m sure you are aware it is a large undertaking.
Given the alignment between your area of research and our work with DataStream I would encourage you to review the DataStream schema (https://github.com/gordonfn/schema) for consideration in your manuscript.
Sincerely,
Mary Kruk
Water Data Specialist
The Gordon Foundation -
RC1: 'Comment on essd-2021-43', Anonymous Referee #1, 11 Dec 2021
The manuscript by Rotteveel and Sterling presents the global surface water chemistry (SWatCh) database, which contains data for 17 variables (Al, Fe, major ions, nutients, organic C, pH, etc.) from 9 million samples collected between 1960 and 2019. This database has the specific purpose to support research on surface water acidification. To create this database, the authors used data from 6 exiting hydrochemical databases/dataset, which they put in a uniform format, and then removed samples that were flagged as problematic and duplicates that exist as some of the databases used have culled data from the other databases.
I was able to download and use SWatCh without any problem. The download process is straight forward, and the database is easy to use.
While it is a very important task to assemble available data into such a large, publically available database, I feel that the authors have done a very poor job with regard to quality checks. They only discarded data that was already flagged as problematic, or which had very clearly unrealistic values, which was limited to negative values for concentrations. I think a much more robust quality check would be required to publish this dataset with an article in ESSD. Further, I feel that the authors did a rather poor job at analysing and presenting the data. For these two points, please see my major comments further below. Finally, I would like to highlight that this database does not contain any data on alkalinity, acid neutralisation capacity (ANC), DIC or HCO3- concentrations. This information can be found in at least a few of the databases from which data was taken for SWatCh. More importantly, these parameters represent the buffering capacity of a surface water body against acidification, and would thus be of huge importance for the study of surface water acidification and recovery. It is completely incomprehensible for me why these parameters were not included in SWatCh.
I suggest that major revision are necessary before this study can be considered for publication in ESSD. Please, see my major and general comments below:
Major comment #1: Quality checks of database
You should check all parameter values if they are reasonable, even if they are not flagged. You should check for instance for unreasonable high values, which can be due to mistakes made with the units (in particular for a database like GloRiCh, where data was assembled from lots of different dataset). If for instance mg and ug (or mM and uM) have been mixed up at some point (that could already be a mistake in the dataset you are taking data from), this might lead to errors of three orders of magnitude. I would suggest to first define for each parameter a realistic value range. Then, for all values lying outside of that range, you should first check is that concerns only one value in a time series, or the whole time series of a sampling site, or all values of a certain data source (note that for instance GloRiCh gives references of the data sources it used, and already in GloRich such mistakes might be present). If extreme values concern one specific sampling location, it might be worth investigating if that might be due to an exceptional site. For instance extremely high F- concentrations might be due to hydrothermal influence. Extreme PO4- concentrations can be due to phosphate deposits, like in the Peace River catchment, Florida. Sediments from dried out lakes might yield high concentrations in NaSO4. Etc.
For each sampling location you should look at the time-series and try to identify potential outliers within the time series. For each outlier, you might want to check if other parameters are also affected, which could mean that either something exceptional has happened or that data from another sampling location has been wrongly attributed. Anyway, you should flag those values. You cannot assume that all suspicious data has already been flagged accordingly, in particular as data comes from very different sources, and some of them, like GloRiCh, are again assembled from different sources with different degree of quality checks.
Major comment #2: Presentation of database
Your results section is very short, and your discussion section doesn’t make many links to your own results. Figure 2 is a good beginning to represent the available data, but it would be more interesting if the spatial coverage was represented separately for different types of inland water bodies. It is not clear at all from your manuscript how well lakes vs. reservoirs vs. rivers are represented in tat database.
It would also be interesting to know the numbers of samples per water body type that have measurements for a specific combination of parameters that are interesting with regard to acidification, like: How many samples are there with all major ions and pH? Here you should maybe start with an overview of which combinations of parameters are usually used to study acidification. I guess samples where only sodium or phosphate was measure are not that interesting. Maybe you can make a ranking of parameter combinations that allow you to study acidification with a different degree of conclusiveness and certitude. And then list the number of samples that have measurements for these parameter combinations, and do that separately for different kinds of water bodies (lakes, reservoirs, canals, ditches, rivers, etc.) and different world regions (at least continents, or major biomes/climate zones). You should also give an overview about which time-periods are covered in different parts of the world. That would very important if you want to investigate temporal trends in acidification recovery.
You should also think about presenting data density (number of sites, number of samples per site, average length and frequency of time-series, etc. ) for different types of inland waters as a map. You could take inspiration from figure 2a in Regnier et al. 2013 (Nature Geoscience, DOI: 10.1038/ngeo1830), that created a density index for pCO2 values.
You should also take into account global geodatasets that allow for regional classification of water bodies, like for instance the HydroAtlas (Linke et al. 2019, https://doi.org/10.1038/s41597-019-0300-6). Like this you could make more qualified statement about which kind of river or lake is underrepresented in your dataset. You state you cannot link your chemistry data to catchment properties, but with HydroAtlas you could get a good idea what kind of river-catchment systems are well represented and what kind underrepresented.
General comments:
L74-76: How did you perform those checks? Where can I see the results? I think a quality assessment of this kind is very important.
L85-86: I wonder how you identified these “untreated” water bodies. I know that in GloRiCh this information is not given. Here, some analysis of the water chemistry data itself could have been useful to spot suspicious cases, for which some investigation could have been performed based on the location information.
L88: By “phosphorus”, do you mean “total phosphorus”?
L91-92: Error message instead of reference.
Section 4.2: When discussing these effect of methodological changes on time-series data, you should combine that indeed with an analysis of at least the longest time-series you have in your database.
Section 4.4: Here you mention that you often do not have the discharge data associated to the water chemistry data. Did you try to match the river water sampling locations with stream gauges from the Global Runoff Data Centre (GRDC, https://www.bafg.de/GRDC/EN/Home/homepage_node.html)?
-
RC2: 'Comment on essd-2021-43', Anonymous Referee #2, 03 Feb 2022
The authors present a newly created database on chemical composition of surface waters. The database is comprised of several database sources from which specific parameters/variables are extracted and unified for the specific purpose to provide a data base for surface water acidification research.
The collection and harmonization of data on water chemistry is very important to the research community, as it enables more refined global analyses of matter fluxes, temporal developments, climate change impacts, any many more.
The manuscript addresses an important data topic, which makes it worth to be published. However, due to the points stated below, I recommend a major revision.
Data quality
I would argue, from a personal viewpoint, that if the goal is to provide global coverage of data to enable global cross-boundary evaluation of surface waters, it may not be very important to have a high data quality, as the available amount of data will level out “outliers” or differences in the data analyses from a statistical viewpoint.Data harmonisation
The calls for a unified approach in all future data collections are very noble, but I doubt that they will be heard. Data producing authorities very often have their own, historically grown structures and formats, that are so convoluted and unpredictable that it would be and hopeless to expect a globally unified data structureData selection
The authors state that the parameters were specifically selected to evaluate surface water acidification, however I would argue that the most important parameter in this regard is missing: total alkalinity (TA). This is reported in some of the used sources, even if it may be in awkward units sometimes. The TA is fundamental for the understanding of the carbonate system and the interaction of CO2 and natural waters. Alternatively, dissolved inorganic carbon could be included, or both parameters, where available, to be able to calculate the missing parts of the carbonate system (TA and pH or DIC and pH enable the calculation of DIC or TA, respectively). Furthermore, the inclusion of TA would enable the calculation of a charge balance, which could provide an indicator for the data quality.Database structure and presentation
I really appreciate the approach of publishing the scripts for the database of Github. This makes the work very transparent and should be an example for all scientists working with complex data processing.The chosen format of the data is slim and straightforward, however, for the average enduser, the relational style of the files may present a potential problem as data cannot be filtered and used as is, but have to be transformed. It may be an advantage (not a requirement) to provide a python script that converts the data into the “classical” column-row-format. It may, however, increase the filesize to an extent that makes it hard to handle.
Regarding the units, the choice of weight units is okay but may lead to the need to recalculate to molar units as this is needed in geochemical calculations (e.g., charge balance, ratios, chemical formulas)
Text quality
There are several typos, duplications and wording issues in the text. I mention some of them below. Overall, the text could benefit from a revision, which clears out the errors but more specifically narrows focus on the specific arguments for the need of a new and harmonized database.Specific comments
L8 2x “identify”
L18 Define the need for more data collection – how would that improve global models? Little data from arid regions may also be due to the fact that there are less surface waters
L19 “Environs”
L21/22 2x “address”
L29 “a number projected…” is meant to refer to the 4 bln people, but as it stands in the text it rather refers to “at least one month”
L30 “these resources” – which?
L36 Define “transboundary problem”
L47 When I comes to the fate and behavior of compounds in natural water, I would argue, the catchment scale is a good and proven approach. I may not understand the term “transboundary” in your sense, but why should be look transboundary if fluxes are “confined” in catchments anyway. Isn’t this the very idea of catchments to have all waters included in one larger scale area?
L49ff Yes, catchment waters will be influenced by land cover and geology, but so are observation on larger scales.
L51 “affected”
L79 I understand the point that the authors want to make here, however, the example may be a bit too tightly defined. Looking for “water chemistry database sweden” yields the website of the water information system VISS (https://viss.lansstyrelsen.se), I don’t know if data is extractable there but it seems that it is a good starting point. With this approach and a slight variation in search terms, more data should be discoverable.
L94f Can you state how much data was discarded, in %? Maybe leave the data in the dataset but provide a flag so that users can decide based on their needs?
L107 “simplied”
L107 2x “reduce storage requirements”
L129 Replace “standardized” with “harmonized” as probably most coordinates adhere to some kind of standard.
L146 Cost may be one reason but also, these are the most relevant parameters for many fields of research.
L148 What do you mean with under-reported results? Unclear.
L150f If no location data, I can understand the point. But w/o method information, it could still be interesting data in a global context (see argument above).
L156 Unclear logical connection between data gaps and discharge dependency.
L168 “people”
L168 “who collected” -> “collecting”
-
AC1: 'Responses to Community and Reviewer Comments', Lobke Rotteveel, 04 Mar 2022
Dear reviewers,
Thank you very much for your thoughtful and detailed review of our manuscript. You have provided some excellent suggestions which we believe will increase the quality of our manuscript and associated dataset. We have responded to each of your comments in the attached PDF file.
Best regards,
Lobke and Shannon
Lobke Rotteveel and Shannon M. Sterling
Data sets
The Surface Water Chemistry (SWatCh) database Lobke Rotteveel https://doi.org/10.5281/zenodo.4559696
Model code and software
The Surface Water Chemistry Database (SWatCh) Lobke Rotteveel https://github.com/LobkeRotteveel/SWatCh
Lobke Rotteveel and Shannon M. Sterling
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
742 | 209 | 32 | 983 | 71 | 16 | 26 |
- HTML: 742
- PDF: 209
- XML: 32
- Total: 983
- Supplement: 71
- BibTeX: 16
- EndNote: 26
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1