Reconstruction of δ13CDIC in the Atlantic Ocean: a probabilistic machine learning approach for filling historical data gaps

Gao, Hui; Wu, Zelun; Sun, Zhentao; Cai, Diana; Jin, Meibing; Cai, Wei-Jun

doi:10.5194/essd-18-2443-2026

Articles | Volume 18, issue 3

https://doi.org/10.5194/essd-18-2443-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/essd-18-2443-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 18, issue 3

Data description article

|

02 Apr 2026

Data description article |

| 02 Apr 2026

Reconstruction of δ¹³C_DIC in the Atlantic Ocean: a probabilistic machine learning approach for filling historical data gaps

Hui Gao, Zelun Wu, Zhentao Sun, Diana Cai, Meibing Jin, and Wei-Jun Cai

Download

Final revised paper (published on 02 Apr 2026)
Preprint (discussion started on 01 Sep 2025)

Interactive discussion

Status: closed

RC1:
'Comment on essd-2025-517', Anonymous Referee #1, 21 Oct 2025

This is an interesting and generally well-written study addressing a worthy topic. The paper has good fundamentals and should be able to made into a solid contribution to the scientific literature. However, I believe it requires iteration, and likely additional analysis, before it will be suitable for publication at this journal.
I have three areas of criticism and one note of caution. The note of caution is just that I’m skeptical of the uinput calculation, see the line by line comments below.
My first criticism is that that validation was not handled as well as it should have been. See line by line comments below for an easy-to-implement and necessary improvement for the validation section. Separately, a suggestion that would further reinforce the validity of the method would be to implement the method in a model environment. This is now common practice for validation of machine learning refits of sparse observations, and is likely necessary for a first attempt with carbon isotopes, particularly one with such unusually sparse observations. There are numerous model simulations available that have explicitly simulated carbon isotopes (e.g., https://doi.org/10.5194/gmd-17-1709-2024 though there are many others). It should be workable to obtain one or more such set of outputs, subsample the distribution(s) across both time and space, apply random and cruise-wide systematic perturbations to the extracted output to represent measurement uncertainties, fit a ML model to the output, reconstruct the full distribution, and then evaluate the strengths and weaknesses of the full 4D reconstruction. This reveals critical information that is not provided by a reconstruction of a sparse data product with uneven and imperfect measurements of an unknown true distribution.
The second criticism is that the paper is not very well motivated at present. The authors state repeatedly that the upsampled distribution can be used for many new analyses, but the new product still has almost all of the limitations that the previous product… it is still sparse and uneven in space in time, just less so, and it now has the added complications from layers of machine learning smoothing. While I admit that the new data product is smoother spatially and less biased temporally, I don’t see that the authors have fully solved any problem with their current presentation. To that point, the authors mostly suggest ways that this might now be used, but do not go so far as to demonstrate any such analysis that would be quantitatively improved with the new product. I would like to see either more concrete examples of new analyses shown (not just listed), or, as such an example, a reorientation of the work toward estimating the full Atlantic distribution of the isotopes across space and time. For a spatially complete record they might apply the ML model to the GLODAPv2 gridded product. For a spatially and temporally complete product they might consider either using a time varying TS product and/or GOBAI-O2 (with estimates of the other predictors from other such ML refits in literature as necessary). In both cases, there would be some meaningful errors in the predictors, but, at least currently, the authors are suggesting that their estimates are completely insensitive to any plausible error in the predictors, so that may or may not be a concern (I suspect it will be after the uinput is re-evaluated).
Finally, the presentation of the dataset is a bit confusing (I only checked the .mat, but I'm assuming this applies to all files at Zenodo). The file contains essentially all of the fields from GLODAPv2 with their adjusted DI13C, which is called adjusted_C13, capitalizing "C" contrary to the GLODAP convention. If the goal is to make the file supplemental to and interoperable with GLODAPv2, then it would be better to release a file that has the full >10^6 rows, but only contains c13 data and has -999 except for the appropriate Atlantic subset. This way, someone could load GLODAPv2 and then load this file and have them both available and ready to access in identical formats. They could also easily sub in data from, for example, other other basins where this data product is missing observations but the GLODAPv2 product has them. This will also remind users to cite both products, rather than just grabbing all of the data from this new product and incorrectly attributing, for example, aou and cfcs to a data product that is only updating C13 and repackaging everything else. Finally, I think the Zenodo link would benefit from more descriptive text or a readme explaining what subset of data is presented, which fields are the new fields, how they are labeled, and how to make the data interoperable with, for example, measurements of DI13C in other ocean basins.

A minor criticism is that the paper is repetitive in places, repeatedly restating key claims throughout the manuscript.

To reiterate, I generally feel this paper can become a worthwhile contribution and should not be rejected unless these elements cannot be addressed. The text above is focused on constructive criticism, but the fundamentals of the paper remain strong.

Line by line comments:

42: lacked

94: this assertion needs further quantification in the North Atlantic, where there are routinely measurable decadal increases in Canth

97: along A61N, no “the” is needed

123: which standard depths?

125: how are adjustments proposed precisely?

125: how are adjustments validated precisely?

133: please explain this metric. How is consistency at 10^-5 level when the measurement uncertainty is orders of magnitude larger?

150: typically in oceanography, the k fold cross validation is separated by cruise rather than by randomly selecting measurements. This is because cruises are synoptic records of the state of the ocean, and having many other measurements at similar times and locations and measured by the same instruments and the same operators, as are provided by other measurements along a cruise, provides an overly-rosy set of validation statistics. It is therefore important to only use other cruises to construct the validation models for measurements along any given cruise. This validation exercise needs to be redone to follow this practice, or re-written to better convey that this practice was already adopted (if it was).

215: following this procedure, I would expect the uinpts to be larger than it was found to be. To be clear, I’m not surprised that it is small, but I am surprised that it is more than 10 orders of magnitude smaller than other sources of error. Surely a temperature input error of 20,000,000 degrees C would be expected to yield a bad estimate, yet this does not currently appear to be the case by that estimate of uinput. Does that suggest that the model is mostly a fit to the coordinate predictors that are assumed to have no uncertainty? If so, would it make sense to include some uncertainty in these predictors, given that CTD rosettes are not always directly below the ship and the ships don’t always stay exactly on station for a profile? Please also check that the uncertainty reported in the abstract isn’t the MBE of the Monte Carlo analysis. If unchanged, please explain this counter intuitive finding.

234: repeating comments from line 150

245: what is normalized sample density?

375: This is hinting at an application, but is not itself an application. We’ve only learned about KDEs here, and not about the ocean.

Figure 8b: the darkness of the borders on the mean values make this plot hard to parse. Consider lightening the width of those black lines, somewhat.

8c: consider changing axis limits from 0 to 3, even if this cuts off a miniscule portion of the sample distribution

395: couldn’t you now further parse this information by holding every predictor except xCO2 constant and varying that to estimate the change in the delta that would be expected had all physical and biogeochemical processes been held constant for a decade?

448: This is a seriously dense sentence. Please break it into two or more sentences and revise them both to employ plain language (limiting jargon and buzzwords) wherever possible.

451: I don’t think a good predictor of local flux is going to lead to a good prediction of local inventory. Consider deleting this sentence.

Citation: https://doi.org/10.5194/essd-2025-517-RC1
- AC3:
  'Reply on RC1', Hui Gao, 21 Dec 2025
  
  The comment was uploaded in the form of a supplement: https://essd.copernicus.org/preprints/essd-2025-517/essd-2025-517-AC3-supplement.pdf
  
  Citation: https://doi.org/10.5194/essd-2025-517-AC3
  - AC4: 'Reply on AC3', Hui Gao, 21 Dec 2025
    
    Note on a minor correction and supplementary explanation to our response regarding Comment 395
    
    We wish to clarify a minor typographical error in our previous response and supplement an explanation for the smaller δ¹³C_DIC shifts in older water masses.
    
    First, the correction: In the section describing Figure R2’s consistency with main manuscript results, we incorrectly referenced “main manuscript Figures 8h and 8d” – the correct references are main manuscript Figures 9h and 9d (reconstructed 2023 δ¹³C_DIC and observed 2023 δ¹³C_DIC, respectively). This typo does not alter the core conclusion of our response.
    
    Second, we supplement the explanation for smaller δ¹³C_DIC shifts in older water masses: These minor variations are mostly within the uncertainties, though they may also reflect hydrological changes over the decade.
    
    We apologize for the earlier typo and kindly ask the reviewers to refer to this attached response as the definitive version.
    
    Citation: https://doi.org/10.5194/essd-2025-517-AC4
RC2:
'Comment on essd-2025-517', Patrick Rafter, 22 Oct 2025

The manuscript “Reconstruction of d13CDIC in the Atlantic Ocean…” as reviewed by Patrick Rafter
First, I’d like to thank the other (anonymous) reviewer for their careful and useful review of this manuscript. If I were the author of this manuscript, I would greatly appreciate the many meaningful and well-informed comments. I don’t fully agree with all their suggestions, but it is undeniably a high-quality review.
For example, I think—for the most part—this study needs less additional work than the other reviewer. The suggestion to implement the ML method in a model environment would be a very interesting and valuable addition to this work, but I predict the authors’ response will be “outside the scope of the current study”. It sounds to me like a huge amount of new work, but I may be incorrect in this (or it may just be a huge amount of work for *me* and not someone else (it almost surely is)). Note that I do not have the experience in this space to comment on whether this model environment application is “now common practice”, but I will say that this would have been a novel (to me), interesting, and seemingly robust application of the methods developed here. But I would like to note that if this manuscript / dataset were to follow the reviewer’s advice, it would boost my score for the “significance” and “data quality” categories into and above the ‘Excellent’ category. As of now, I have scored these as ‘good’.
I also think the motivation is appropriate for this specific study and that the decadal trends in the Kernel Density Estimates (see Fig. 8) are an interesting outcome from this study (as it exists now).
Where I agree with the anonymous reviewer is that I think the new “reconstructed” dataset could be (I think): (1) expanded spatially using the GLODAP gridded product and (2) that this would be a very useful addition to our community. I am assuming these are “minor revisions” as the ML model is already built and I assume the application to the gridded product will be straightforward (and worth the time for the community to use!). I would also urge the authors to consider the other options listed by the anonymous reviewer to expand the ML methods temporally, although I am unfamiliar with the reviewer’s specific suggestions and cannot comment on the time requirements for such new applications.
Likewise, the other reviewer makes strong comments about the dataset itself. I agree that adding the reconstructed dataset as its own column (with -999 for other basins) to the existing GLODAP data would be very useful for the community. Even better would be for the community to have a gridded product!
Below I have listed notes I made on the manuscript as I read through it.

Line by line notes

27: need to define delta notation

79+: I don’t see a need to shorten “Section” here

100: I like the previous paragraph

132: what exactly does “exhibit high internal consistency” mean? Are there statistics to support this statement?

139: Is GPR an acronym? Perhaps not relevant, but I wanted to know

161: Repeated text

Fig. 2: I like the figure, but as the other reviewer noted, it would be better to use completely independent cruise datasets for the validation as well as the “independent” tests

192: I wonder if other Earth scientists would be as surprised to learn of Mean Absolute Error and Mean Bias Error. I think they might and it might therefore be useful to use a sentence or two describing why these additional metrics are useful to the study

202: Propagated error?

212: perturbed not perturbs

230: I’m unsure where the 10-fold cross-validation comes from

249: This text is also somewhat a repetition of earlier text

259: larger?

272: Incredibly / unbelievably low input variable uncertainty (Uinputs). I wonder if this is a propagation of the input variable uncertainties or an error has been made along the way.

295: Maybe this is not important, but lower case “n” is typically used to describe the sample size

302: Is it expected that there would be a model smoothing tendency?

397: Is there an expectation that the model output would closely align with the observed data? Wasn’t the 2023 data used to predict the “reconstructed data”? I’m not diminishing the work—I honestly think this is an expected outcome of using machine learning.

485: quality-controlled (?)

Citation: https://doi.org/10.5194/essd-2025-517-RC2
- AC1: 'Reply on RC2', Hui Gao, 21 Dec 2025
  
  The comment was uploaded in the form of a supplement: https://essd.copernicus.org/preprints/essd-2025-517/essd-2025-517-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/essd-2025-517-AC1
RC3:
'Comment on essd-2025-517', Bin Lu, 15 Nov 2025

General comments:

This manuscript presents a valuable contribution by not only reconstructing δ¹³C_DIC fields in the Atlantic Ocean but also by compiling and providing a high-quality observational dataset that can serve as a fundamental resource for future studies. The authors conduct thorough quality control (QC) and crossover adjustments, and the methodology is overall well organized. The work demonstrates the potential of probabilistic machine learning—particularly Gaussian Process Regression (GPR)—for filling historical data gaps and quantifying uncertainty. These strengths make the dataset a meaningful addition to the ESSD data collection.
However, as a reviewer from an AI background, I am particularly sensitive to the modeling and evaluation methodology. I find that some aspects of the test strategy and dataset partitioning could lead to overestimated performance metrics. In addition, although the study substantially expands the number of δ¹³C_DIC samples (by 7.65 times compared with previous compilations), the reconstructed data remain spatially discontinuous and unevenly distributed. With appropriate revisions to clarify model evaluation, ensure test dataset independence, and temper claims about spatial continuity, the paper will be suitable for publication.
Specific comments:
L140–143: Please elaborate more clearly on the specific advantages of the Gaussian Process Regression (GPR) model in this context. In addition, it would strengthen the methodology section if you could provide a quantitative or qualitative comparison with other commonly used machine learning methods such as XGBoost and Random Forest, which are often applied to similar regression problems.
L219–221: The statement that “u_map may alternatively be estimated as the RMSE between reconstructed and observed δ¹³C_DIC on the training dataset” raises some concern. Because the model is already optimized to fit the training data, such an estimate cannot reliably represent its true mapping uncertainty or generalization ability across the Atlantic. Using the training RMSE in this way likely underestimates u_map, leading to an unrealistically low total uncertainty.
L234–236 (Sect. 3.1) and L249–252 (Sect. 3.2): The two test cruises selected (33MW19930704 and 33RO20050111) have been repeatedly sampled across multiple years, and some data from these cruise lines or neighboring years may have been used in training or validation. Moreover, many of the 2023 observations were collected along the same A16 section, meaning the model likely already learned features specific to this transect. Using these data as the test set could therefore inflate the evaluation metrics. Please clarify how you ensured true independence between training and test datasets.
L286–287 and Fig. 5(a): The paper states that 5,997 samples were used for evaluation, which represent the intersection between reconstructed and observed values from 8,941 acceptable δ¹³C_DIC samples in GLODAPv2.2023. I am concerned about potential overlap between these evaluation samples from GLODAP and the training data. Please clarify how independence was maintained and whether any duplicate or overlapping data points were excluded.
L288–289: When calculating the coefficient of determination (R²) between observed and reconstructed values, you might consider using anomaly-based R² (i.e., R² computed from anomalies relative to a local mean or climatology) rather than raw values. This approach could reduce the influence of large-scale offsets and provide a more realistic assessment of the model’s ability to reproduce spatial–temporal variations.

Citation: https://doi.org/10.5194/essd-2025-517-RC3
- AC2: 'Reply on RC3', Hui Gao, 21 Dec 2025
  
  The comment was uploaded in the form of a supplement: https://essd.copernicus.org/preprints/essd-2025-517/essd-2025-517-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/essd-2025-517-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Hui Gao on behalf of the Authors (21 Dec 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (07 Jan 2026) by Xingchen (Tony) Wang

RR by Anonymous Referee #1 (21 Jan 2026)

Suggestions for revision or reasons for rejection

I commend the authors on their efforts in this revised draft. Generally, the paper has been improved. I previously raised several major objections, and now some of them have been addressed, though others have not:

Improper Validation: I am very pleased to see the authors have added a model-based OSSE. However, the implementation was unexpected and likely not as helpful as I hoped. The point of such a model reconstruction is usually to assess how well the approach works when applied in the same way it is being applied in the real world. Put another way, this should be a test of how well sparse data and machine learning can be used to estimate values where they are desired. Therefore, for this test, I would have expected the analysis to subsample the model at the locations and times where measurements are available (interpolating between data points and/or times as necessary), train an algorithm, and then use that algorithm to estimate the model values at the locations and times where the new dataset is reporting estimated values. Alternatively (or supplementally), a more general test could be done using the same first two steps but then reproducing the full (Atlantic) model distribution over time. Instead, this analysis seems to sub-sample the model in an idealized and homogeneous grid. This is really only a test of what could have been done had we implemented an entirely different, and much more expensive, data collection effort over the last few decades. I'm therefore not sure much is learned.

The k-fold validation experiment is also not attempted. However, the authors do point out something I'd missed previously, which is that the test data set is separated by cruise number. This is good news, though the statistics would still be still more trustworthy if they had used a k-fold validation for this step (note, there need not be the same number of cruises in each set... just an approximately comparable amount of training data with some number of whole cruises).

Lack of Motivation: I'll begin here with an acknowledgement that at least one other reviewer did not seem to share my concern here.

That said, I don't believe the issues I raised about the motivation of the paper were well addressed. The authors provide some evidence from ongoing work, but their analysis actually demonstrates my reasons for concern better than I did. The ongoing work shows that eMLR faces significant problems when it is presented with reduced data density. However, eMLR is actually also an algorithm-based approach that exploits covariance between various measurements (much like the algorithm that they are employing in this manuscript). Therefore, it facing major problems when given fewer data is actually evidence that an algorithm cannot always effectively hide the limitations from sparse sampling. To make the point that the authors were trying to make, they would need to have a third set of panels where they have used an algorithm trained on the sub-sampled data to up-sample the cruise back to its original data distribution (this step is effectively what their paper is attempting to do)... and then use eMLR on that up-sampled data set to test whether the up-sampled data set can faithfully reproduce the canth estimates produced from the original data. If they were to do this, then it should go in the paper. However, this analysis leaves me worried that people using the up-sampled carbon-13 distribution will reach erroneous conclusions.

Errors in the measurement uncertainty sensitivity: my concerns in this area were addressed, thank you.

Re: GLODAP gridded product. I was suggesting using the authors' algorithm with the gridded T and S from GLODAP to produce a gridded field of d13C, not that GLODAP had themselves produced a gridded d13C product. I still think that this would be an easy and useful application for their approach.

In summary, I'm not sure I would trust the new product enough to rely on it. Currently, if I were to want to create a 4D product, I would first go back to the original data and train an algorithm from those measurements rather than trying to chain algorithm estimates together. Generally, sparse data are not useful until they are brought into some kind of analysis, and I cannot think of an analysis that would benefit from starting out with sparse, fixed empirically estimated values. This would only be a way to give users a false sense of confidence when the estimated data confirm the patterns found in the real data. This therefore leaves me still doubtful about the practical utility of this work.

(Separately, I'd urge the authors to write shorter and more focused responses to reviewer comments.)

Hide

ED: Reconsider after major revisions (26 Jan 2026) by Xingchen (Tony) Wang

AR by Hui Gao on behalf of the Authors (05 Feb 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (07 Feb 2026) by Xingchen (Tony) Wang

RR by Patrick Rafter (10 Feb 2026)

RR by Anonymous Referee #1 (03 Mar 2026)

ED: Publish subject to minor revisions (review by editor) (03 Mar 2026) by Xingchen (Tony) Wang

AR by Hui Gao on behalf of the Authors (05 Mar 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (09 Mar 2026) by Xingchen (Tony) Wang

AR by Hui Gao on behalf of the Authors (09 Mar 2026)

Post-review adjustments

AA – Author's adjustment | EA – Editor approval

AA by Hui Gao on behalf of the Authors (30 Mar 2026) Author's adjustment Manuscript

EA: Adjustments approved (30 Mar 2026) by Xingchen (Tony) Wang

Short summary

Observations of stable carbon isotopes in dissolved inorganic carbon are sparse, limiting their potential in carbon cycle studies. We compiled 51 cruises and used a machine learning method trained on 37 cruises that passed secondary quality control to reconstruct isotope values in the Atlantic. The reconstruction expands usable samples from 8,941 to 68,435, reducing noise, filling gaps, preserving decadal trend, and strengthening studies of carbon variability and model validation.