the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
OceanSODA-MDB: a standardised surface ocean carbonate system dataset for model–data intercomparisons
Helen S. Findlay
Jamie D. Shutler
Jean-Francois Piolle
Richard Sims
Hannah Green
Vassilis Kitidis
Alexander Polukhin
Irina I. Pipko
Download
- Final revised paper (published on 27 Feb 2023)
- Preprint (discussion started on 27 Jul 2022)
Interactive discussion
Status: closed
-
CC1: 'Citation of R packages', Jean-Pierre Gattuso, 12 Aug 2022
While I am reading your interesting contribution, I note that you have used an R package without a proper citation. It is important to give credit to authors of packages who make their work openly accessible to the community but few people know how to do it.
R has a function to find out out how a specific package should be cited: citation(package = "package name"). For example, for the seacarb, citation(package = "seacarb") returns:
To cite package ‘seacarb’ in publications use:
Gattuso J, Epitalon J, Lavigne H, Orr J (2021). _seacarb: Seawater Carbonate Chemistry_. R package version 3.3.0, <https://CRAN.R-project.org/package=seacarb>.
A BibTeX entry for LaTeX users is
@Manual{,
title = {seacarb: Seawater Carbonate Chemistry},
author = {Jean-Pierre Gattuso and Jean-Marie Epitalon and Heloise Lavigne and James Orr},
year = {2021},
note = {R package version 3.3.0},
url = {https://CRAN.R-project.org/package=seacarb},
}
That being said, I will now continue reading your paper!Citation: https://doi.org/10.5194/essd-2022-129-CC1 -
RC1: 'Comment on essd-2022-129', Anonymous Referee #1, 20 Sep 2022
The authors present a matchup database combining various in situ observations with associated satellite remote sensing and reanalysis products. The challenge of combining these datasets which are collected over vastly different spatiotemporal scales is handled by the novel development of spatiotemporal regions of varying size.
This technique can clearly be applied to other parameters, and the dataset has the potential for reuse.
The dataset is a novel combination of other datasets, some of which have never previously been made publicly available.
Data is accessible as described and easily downloaded without the need of registration. A subset of these data was downloaded for inspection. These netcdf-v4 compatible files were inspected using the ncdf4 R library.
the "_FillValue" for the variables consists of 8-bytes which is not a valid type for R and these are coerced to a double precision value, but otherwise no issues. The files are well described in typical netcdf format and adhere to CF conventions.
The files were also tested in a python 3.8 environment using the netCDF4 package.
The dataset is missing earlier years, 1971, 1964-1967 and 1959. I presume this is due to lack of associated in-situ data, but please confirm.
No erroneous data were found and the carbonate system and associated parameters (e.g. nutrients, temperature) are shown to span appropriate ranges and be consists. At least to the extent of my knowledge of the global distribution of these parameters.
The abstract is a little long. It includes reference to an example application which is not appropriate for an abstract and is also "not shown" - line 30. This appears to have been partially copied from the Introduction section. The article appears otherwise well written.
Each sub-source of data is well described, with collection method, calibration information and uncertainties. However, In the contributions is it mentioned that A. Polukhin assessed the possibility of using certain data due to differences in methods. These differences are not well discussed in the input datasets section so I suggest the inclusion of a short paragraph summarising the rationale for this work.
We note the AMT and Kara Sea datasets start over a decade before the publication of the Dickson et al 2007 best practice document which is given as a reference. I presume the authors are simply saying that the methods are basically the same, but it would be good to clarify.
Line 486 - when and by whom will reassess the appropriateness of the maximum ROI scales? community feedback?
The figures provide a good overview of the nature of the dataset and are of generally good quality.
Figure 1 and 2 could be arranged better, the axis labels very small considering the amount of whitespace between panels
Citation: https://doi.org/10.5194/essd-2022-129-RC1 -
RC2: 'Comment on essd-2022-129', Anonymous Referee #2, 27 Sep 2022
General Comments
Over the past couple decades, large datasets of in situ carbonate system measurements (e.g., SOCAT, Bakker et al., 2016) have been compiled and quality controlled, and are frequently used to train models for prediction and/or spatiotemporal gap filling of carbonate chemistry using proxy variables. However, models trained on sparse and irregularly spaced data can suffer from biases that favor more highly sampled regions or time periods. These biases can be partially alleviated by aggregating data points to bins of constant spatial and temporal extent, but binning in this way can still result in sub-optimal data division due to elongation of bins near the poles, unintentional splitting of clustered data, and unaccounted for interactions with coastlines.
To address these shortcomings, Land and coauthors compile and describe a database of in situ surface carbonate chemistry measurements built around regions of interest (ROIs), which are constructed so as to include as many in situ measurements as possible within a maximum timespan of 10 days and a maximum diameter of 100 km. This database — OceanSODA-MDB — also includes spatiotemporal matchups with satellite, model, and reanalysis datasets corresponding to each ROI. Land et al. display the utility of this newly compiled database by re-training a global algorithm from Takahashi et al. (2014) to predict potential alkalinity (PA) from sea surface salinity, displaying a global reduction in the root mean squared error between measured and predicted PA: from 15 to 12 µmol/kg in marine waters and 32 to 23 µmol/kg in coastal waters.
The ROI strategy is a compelling and creative way to group in situ data and match the grouped data with other datasets for algorithm training. The strategy is explained well in this manuscript, and certainly has potential advantages over fixed spatiotemporal binning. OceanSODA-MDB is easily accessible as a series of NetCDF files. It provides co-located satellite, model, reanalysis, and in situ data that can surely be used to re-train various algorithms that are in use today. This manuscript is a valuable contribution to the literature in its current form, but I’ve added a few comments and suggested corrections in the following sections.
Specific Comments
CO2 system calculations only performed from CT-AT:
In lines 112–113 it is stated that carbonate chemistry calculations were only performed when CT and AT were available. Are there a significant number of cases for which other carbonate chemistry pairs were available (e.g., CT-pH, AT-pCO2, etc.)? I’d be interested in the reasoning for only making carbonate chemistry calculations with the CT-AT pair. Similar accuracy in calculated parameters should be obtainable by pairing either CT or AT with either pH or pCO2 when the measurements are of sufficient quality (e.g., Orr et al., 2018).
Creating the radial in situ data:
I think the explanation of how the data grouping is performed is great. It is clearly a complex process but breaking it down step-by-step in section 3.2 is very helpful. I am curious about the prospect of expanding this methodology to subsurface data and also how easily the methodology might be adapted to form finer resolution ROIs. I hope the authors might consider making their code publicly available at some point in the future to facilitate adaptions of their methodology to other datasets and/or spatiotemporal resolutions.
Updates to OceanSODA-MDB:
It is indicated in section 6 how the MDB could be updated. Are they any concrete plans to make updates to OceanSODA-MDB at regular intervals in the future?
Minor Comments and Technical Corrections
Line 130: The references in this line are repeated.
Line 135: Should this be dataset no. 4? Additionally, I’d be interested in knowing how the Argo data were acquired. A monthly snapshot should have an associated DOI (e.g., doi.org/10.17882/42182#95967 for Sep. 2022). If files were individually downloaded, were individual profile data files used, or interpolated Sprof files? I think this would be very helpful information for anyone who wants to replicate your methodology.
Line 141: “Dickson et al.” should be outside the parentheses here.
Lines 157–161: Is it safe to say that this paragraph and the entries for these datasets in Table 1 should be eliminated since they are added to OceanSODA-MDB along with CODAP-NA?
Lines 275–277: Are discrete pCO2 measurements treated any differently that underway pCO2 measurements? I’d image that in many cases discrete pCO2 data points would spatiotemporally match with those in the SOCAT database, but may be discarded based on the criteria noted here.
Lines 381–382: I don’t think it’s obvious why the advent of Bio-Argo floats would cause the mean number of pH measurements per ROI to decrease rather than increase. I think I can infer: the 10-day sampling cycle of the floats and the fact that only one pH measurement is obtained in the upper 10 meters means one individual ROI is generally created for each float profile? Regardless, since this result seems counterintuitive on first glance, I think it should be explained here.
Figure 16: Should the green line in the legend be T14 fit, rather than TS13?
References: Lauvset et al., 2018 is not in the reference list.
References
Bakker, D.C.E., et al. A multi-decade record of high-quality fCO2 data in version 3 of the Surface Ocean CO2 Atlas (SOCAT), Earth System Science Data, 8, 383–413, 2016.
Orr, J.C., Epitalon, J.-M., Dickson, A.G., and Gattuso, J.-P. Routine uncertainty propagation for the marine carbon dioxide system, Marine Chemistry, 207, 84–107, 2018.
Takahashi, T., et al. Climatological distributions of pH, pCO2, total CO2, alkalinity, and CaCO3 saturation in the global surface ocean, and temporal changes at selected locations, Marine Chemistry, 164, 95–125, 2014.
Citation: https://doi.org/10.5194/essd-2022-129-RC2 -
RC3: 'Reply on RC2', Anonymous Referee #3, 05 Oct 2022
The authors have presented a method by which data can be placed into spatio-temporal bins (which they call regions) and the data within each bin can be averaged and characterized. They have used this method with several existing data sets to produce a hybrid data set that has had some spatial averaging applied. Even though the manuscript is well written and tackles an important issue, I believe the manuscript should be returned to the authors because it is not well motivated and doesn’t demonstrate the efficacy of their proposed approach. The authors should have an opportunity to rewrite the manuscript to address several key issues.
The problem that the authors are attempting to address is a complicated one involving biases in relationships between predictor data and model/regression/algorithm output that can arise when the training data are not homogeneously spaced in space and time. While the authors have well-described this (difficult to describe) problem, their solution to the problem has several flaws:
First is that it is complicated and computationally intensive. Most studies that tackle this problem do so in a paragraph, whereas these authors have dedicated an entire paper to the issue. This wouldn’t be a problem if the solution were meaningfully better than most solutions, but I don’t think the authors have demonstrated that yet (see below).
Second is that their solution doesn’t seem distinct from the approach that studies take, which is to focus on the critical predictor/target variable with the least data density and either construct bins that are focused around retaining that information or avoid binning altogether and remap the more highly-resolved data onto the spatial-temporal locations of these data. Note for example that the average number of AT measurements per bin is around 2 after WOCE suggesting that binning accomplished little for this variable. Given the nominally-30 nm spacing (i.e., <100 km) for many open ocean cruises, this likely just means that adjacent stations along a cruise were averaged in most cases. This approach will have done little or nothing to address the massive variability in data density regionally and seasonally, which can still be seen clearly in Figures 8 through 14. It is perhaps useful for pCO2 despite these limitations due to the extreme disparity in pCO2 data density on the spatial scales proposed by the authors... yet this also has not yet been demonstrated by the AT regression analysis.
Third, having been created without a specific application in mind (or at least without one stated), it seems unlikely that this approach would be optimal for many studies. Binning data always involves some loss of information, so the binning strategy must always be chosen to match the intended application. I doubt that the decisions made in this study would be widely applicable and, when the approach is applicable, it might not be necessary for the reason noted above. I was hoping the binning would have a way of sizing the bins according to the unique information content within a region, as perhaps represented by the heterogeneity of physical measurements within the bin (perhaps allowing smaller bins near fronts or coasts).
Fourth, (and this is my strongest objection and the ones I would most hope the authors would address if they do have a chance for edits) the utility of the method has not been demonstrated. The authors provide an “apples to oranges” comparison refit of an alkalinity regression, but too many things are changed between the original publication and this analysis, making the comparison unhelpful. Instead, the authors should compare fits of the T14 regressions made before and after binning the data in the manner that they suggest, ideally also comparing to regressions made using alternative binning strategies that would, hopefully, show the wisdom of using the spatio-temporal binning strategy that they employ herein.
I wonder if a schematic figure around line 300 could help the presentation of ROI creation methodology (I’m not sure what this would look like, but I found myself wishing for one here).
The authors should be commended for securing data from some rare data sources, but the fact that these original data are not included in the data availability section precludes this manuscript from publication in ESSD, if my understanding is correct.
Line by line comments:
113: information required to constrain the carbonate system is missing from the list of provided constraints (e.g., nutrients and carbonate constant sets).
120: how are these uncertainties expressed? Standard uncertainties? Confidence intervals?
121: I’m curious about the number 0.005 for pH. It seems quite low. Was this meant to have been taken from Table 3 in the listed publication? That table indicates 0.01. Also, that table has the explicit caution that: “Note that these limits are not uncertainties but rather a priori estimates of global inter-cruise consistency….”
135: what uncertainties are assumed for these data?
140: same question after noting that CRMs are meant to be checks on calibration and not themselves calibration materials
145: were unpurified dyes used for these measurements, and, if so, what attempt is made to correct for pH biases from dye impurities? Alternatively, what uncertainty was assumed that accounts for these impurities?
151: same question
159: how were these uncertainties assessed?
163: If these data are not publicly available then they cannot be used in an ESSD publication, correct?
165: what uncertainties are assumed for these methods?
171: see earlier comment on accessibility
181: what was the accuracy?
227: extra space before comma
266: how do you merge two datasets if they have different subdivisions?
286: There is now CO2SYS code that allows the direct propagation of uncertainties through the carbonate system calculations. This code also has the advantage that it includes uncertainties in the carbonate constants which seem to be omitted from the present analysis aside from the random selection of constant sets.
304: RsOI
310: recommended “expand the radius”
367: period used in AT number whereas comma used in other numbers
370: Bio ARGO is a program rather than a sensor
372: there is also a delay before the data are processed through annual GLODAP releases. You can find more recent cruise data at CCHDO.UCSD.EDU and other data repositories.
420: explain this more… why are these time series labeled as 0 and 1?
415: I had difficulty following the logic of this paragraph. Mostly, it seemed to me that the work that had been done to create the combined data product is undermined by the complexity of this analysis, which seemed to be indicating that a lot of additional work was required to further divide the data product into subsets.
430: how large are the validation subsets on average? Have you done anything to avoid including measurements from a single cruise in both the training and the validation data? This is usually considered an important practice because of the strong temporal correlations between measurements on a single cruise.
435: I also had some difficulty understanding what was being done in this paragraph, or why exactly. Why are the authors so motivated to include nearshore data that it would justify this extra text and complexity. This is just a proof-of-concept demonstration, so does it need to be comprehensive in terms of including coastal data? Have the authors shown the consequences of simply including all coastal data?
447: this is not a straight comparison since T14 use data down to 50 m in their training product the algorithms are therefore producing different estimates (and from different predictor data as well, I think, but I’m not sure). A true proof of concept would require creating another version of the algorithm using the same training and validation data set, but using the binning approach of Takahashi et al. 2014
Figure 2: should we be concerned by the large number of TA measurements per bin in ~1978? Why are there so many more measurements per bin in the earlier portions?
Citation: https://doi.org/10.5194/essd-2022-129-RC3
-
RC3: 'Reply on RC2', Anonymous Referee #3, 05 Oct 2022
-
RC4: 'Comment on essd-2022-129', Anonymous Referee #3, 05 Oct 2022
Please see R3 review which was erroneously posted as a reply to reviewer 2. Apologies for any confusion!
Citation: https://doi.org/10.5194/essd-2022-129-RC4 - AC1: 'Comment on essd-2022-129', Peter Land, 23 Nov 2022
-
AC2: 'Comment on essd-2022-129', Peter Land, 19 Dec 2022
I just rechecked the numbers of measurements per region, and I now realise I was looking in the wrong year when I said there was something wrong with the calculation. The spike of 16 TA measurements per region appears to be real, but it's in 1977, not 1978. There seems to have been an intensive field campaign in Galician waters in October 1977, resulting in 154 TA measurements over quite a small geographic area and time period. The MDB divided these into six regions with TA occupancies between 20 and 35, and these together with the other, more normal TA measurements in 1997 resulted in an average TA region occupancy of 16. I also checked the code that generated the figures, and could see nothing wrong with the calculation, hence the figures will be changed cosmetically but the data will remain unchanged.
Citation: https://doi.org/10.5194/essd-2022-129-AC2