the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
QUADICA: water QUAlity, DIscharge and Catchment Attributes for large-sample studies in Germany
Rohini Kumar
Stefanie R. Lutz
Tam Nguyen
Fanny Sarrazin
Michael Weber
Olaf Büttner
Sabine Attinger
Andreas Musolff
Download
- Final revised paper (published on 17 Aug 2022)
- Preprint (discussion started on 01 Mar 2022)
Interactive discussion
Status: closed
-
RC1: 'Comment on essd-2022-6', Anonymous Referee #1, 01 Apr 2022
Overall, I find this manuscript and dataset to be valuable and accessible. However, I suggest some revisions and changes that I believe will improve the clarity and usability of this dataset. I present my suggestions for some revisions to the text of manuscript and to the content and presentation of the dataset below:
Manuscript Comments
Line 44 While I absolutely agree that the compilation and dissemination of high quality, comprehensive datasets is valuable, data-driven science has always been and will always be constrained by data availability. Therefore, I find statements such as “… harmonized and quality controlled large-sample water quality and quantity data are still not widely available” to be subjective, difficult to evaluate, and unnecessary. I suggest instead emphasizing a more specific description of the significance of this dataset in the context of other large hydrologic data sources, which the authors do elsewhere in the introduction.
Line 57 Awkward language, consider rephrasing. I suggest “… recent large-sample water quality studies have provided a basis for increasing our understanding of catchment functioning...”
Line 70 In addition to their utility in addressing the questions raised here, large-sample, high quality, accessible datasets can also support uses that are un-anticipated by their authors. I think that this benefit of large-sample datasets is worth mentioning in this paragraph, and I have some suggestions below for ways to achieve this (in general, to provide a curated dataset while preserving all information, even information that may not seem useful today).
Line 137 I appreciate the clear description of the inclusion criteria used, yet I would appreciate a more detailed description of the criteria for outlier removal.
Line 145 This river network would be a valuable inclusion in the dataset. While the end user can create an approximation bu using similar parameters (100m DEM, D8, and a 10m burn in), the quality control and manual adaptations described here make a product that is unique to this analysis. Having access to this river network could support more additional analyses that are currently not possible, and that might depend on the exact alignment between the river segments, sampling stations, and catchments.
Line 186 To my understanding, there is no ‘confidence interval’ associated with this method for excluding outliers. If the distribution of the data is correctly represented by the log-normal model, than 1 of 10000 values would be expected to exceed than the specified threshold. Given a large enough dataset (which we have here!), the presence of such values would be expected, even in the absence of any errors that would warrant the exclusion of such data points.
Further, because extreme concentrations of a solute are likely to result from uncommon mechanisms which are not likely to be accounted for in any general distribution model that describes the ‘normal’ behavior, I am skeptical of the use of such distribution models to identify outliers. For example, in my region, there is a small lake which hosts an enormous and unusual population of migratory geese for a few days a year. Whether examined across space or time, macronutrient concentration values from this circumstance appear as outliers, yet in fact they may describe this rare event accurately. The exclusion of this extreme data would appear reasonable to anyone not familiar with this particular circumstance, yet would do a disservice to future users of the dataset.
Unfortunately, I have no perfect method for separating unusual ‘outliers’ from erroneous ‘outliers’. Instead, I suggest that the complete raw dataset should be provided, to allow future users the freedom to develop their own approach to this issue, or to specifically examine the characteristics of this extreme data. If possible, this raw data could be accompanied by a ‘QC’ column indicating the result of the authors entirely reasonable but necessarily imperfect inclusion criteria.
Line 214 I recognize the value of the WRTDS analysis, but I think that the data underlying this analysis is more valuable than the analysis itself in this context. Is all the data that underlies this analysis is present in this dataset? I believe it is, but I would like to see a clear statement to this effect.
Line 277 The relationship between these gap-filling and bias-correcting methods and the dataset is unclear. Are data from these methods included in the dataset, or only used in fitting the WRTDS models? If the ‘corrected’ data are included in the data tables, I think they should be identified as such.
Line 307 I remain unsure of the N sinks included in this calculation. Crop harvest is mentioned as an N ‘output’, and I see no other sinks mentioned. This should be clarified.
Line 345 N deposition on imprevious urban surfaces is not counted as a diffuse N source, but I do not see where is it accounted.
Line 372 Although they may be beyond the scope of this dataset, I suggest that attributes of the rivers may also provide valuable information. Relevant attributes include riparian or floodplain development (urban or agricultural), geomorphic context (e.g., valley confinement), and the presence of absence of impoundments.
Line 524 When allowable, I suggest that the inclusion of dis-aggregated (raw) data is worthwile. However, I recognize that the limitations mentioned here may describe much of the source data for this dataset product. I suggest that the authors make and effort to include as much raw data as possible.
Dataset Comments
I am able to load, combine, and manipulate the two spatial datasets and all of the .csv data without issue. I appreciate that the catchment polygons overlap, with each polygon representing the complete catchment associated with a station. I also appreciate the clear OBJECTID field, usable to join attributes among the catchments, stations, and tabular data. I also appreciate the description of the various columns in the metadata document, and the consistent units used between fields.
I may be out of touch with GIS data norms, but I consider the shapefile format to be antiquated and limiting. If the spatial data were instead presented as a geopackage, any limitation on column names (and the number of columns) would be removed, which would aid in the analysis of this comprehensive dataset. A geopackage is also an open, non-proprietary format.
The naming convention is generally consistent between files, however the concentration tables describe the number of observations with a ‘n_’ prefix, while the source table describes N concentrations with a ‘N_’ prefix. These are easily confused. I suggest renaming the columns in the source table to use a ‘_N’ suffix instead (N_total → total_N).
I also suggest renaming the ‘attributes’ csv file to ‘catchment_attributes’ to clarify it’s affilitation.
If possible, I suggest that the individual monthly concentrations (in addition to the included median concentrations for each month across all years included in the dataset) would add value to this dataset.
Upon examination, I found one oddity in the c_annual table. I calculated the fraction of total N present as NO3-N, and found some values exceeding 1 (indicating more NO3 than total N). Most of these values were barely greater than 1 and likely due to normal measurement error, yet two had much higher values (one of 2.1 and one of 7.8). I suggest examining these values in the context of the scheme for identifying outliers, and considering a refined approach which flags suspected erroneous records but avoids the challenges associated with the absolute exclusion of outliers. A similar test with P values found evidence of normal measurement errors, but little cause for concern. All other data that I examined appeared reasonable (and interesting!).
Citation: https://doi.org/10.5194/essd-2022-6-RC1 -
RC2: 'Comment on essd-2022-6', Anonymous Referee #2, 14 Apr 2022
With this manuscript the authors describe in great detail a dataset that combines several water quality and quantity related variables which covers the extent of Germany.
In my opinion, the manuscript is very well written, follows a clear and understandable structure and contains all the necessary information for someone to comprehend and use the described dataset.
I think that the dataset has a highlscientific value and as the authors state can have multiple applications in environmental sciences. In fact, I believe that the authors could emphasize the main advantages of the dataset, that are the large temporal and spatial coverage of actual measurements and the inclusion of both water quality and quantity data along with drivers, which facilitates the hypothesis testing and finding environmental relationships. Overall, I think that the manuscript is worth publishing as it will help promote the dataset. Perhaps it will inspire other researchers to compile actual field data into large datasets and make them accessible too. Below there are a few minor comments and suggestions for improving some parts of the manuscript.
L40-43: It is not very clear to me why machine learning techniques are highlighted here as a tool for finding relationships between environmental variables or defining patterns. Also I am not sure that machine learning is the best option for hypothesis testing. My point is that there are many options for data analysis. Perhaps this is relevant with a previous use of the presented dataset?
L180: It would benefit the manuscript if a few details about the methods used for the quantification of water quality parameters are included, at least in the Appendix. It could be useful for the user of the data to be able to know this information.
L185: I think it would be optimal to not exclude outlier values from the dataset. Given that these values are not errors and that fall within a possible range of values, excluding outliers could miss extreme events like very large floods. Actually it might be of interest for some researchers to identify anomalies in time series and how these changes across temporal or spatial scales.
L245: There are two different sources of water quantity data. Gauges and I guess field measurements that were taken in parallel with water samples. Is there any statistical difference of the medians between the two types of measurement?
L271: provide is repeated in the same sentence
Citation: https://doi.org/10.5194/essd-2022-6-RC2 - AC1: 'Responses to Reviewer 1 and 2', Pia Ebeling, 23 Jun 2022