the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
FDU-BTR: a physics-guided ensemble learning reconstruction of global surface-ocean pCO2 (1982–2024) with uncertainty diagnostics
Abstract. The ocean takes up roughly 25% of anthropogenic CO2 emissions, yet quantifying the magnitude and variability of this sink is limited by the uneven, sparse sampling of surface-ocean partial pressure of CO₂ (pCO2). Here we present FDU-BTR, a global monthly 1° × 1° reconstruction of surface-ocean pCO2 for 1982–2024, produced with a background–thermal residual (BTR) ensemble learning framework that embeds first-order physical structure in a machine-learning workflow (Wang and Fu, 2026, https://doi.org/10.5281/zenodo.20152530). Observed pCO2 is decomposed into a multi-product background climatology, an explicit thermal-anomaly term, and a residual field; region-specific CatBoost ensembles then reconstruct the residual, with boundary blending ensuring spatial continuity. This decomposition simplifies the learning target while preserving physically meaningful constraints. Validated against the independent Hawaii Ocean Time-series (HOT) and Bermuda Atlantic Time-series Study (BATS) observations, FDU-BTR achieves a correlation of 0.93 and a root-mean-square error of 8.34 µatm, comparable to leading products, with a mean total uncertainty of 12.90 µatm. Cross-product comparisons and coverage–entropy diagnostics localize structural disagreement to coastal, marginal, and high-latitude regions where observations are sparse and processes are complex. Controlled thinning experiments further reveal a strong asymmetry in the observational error budget: reducing spatial coverage degrades reconstruction skill approximately twice as much as equivalent reductions in temporal coverage. FDU-BTR therefore provides a physically constrained, uncertainty-quantified pCO2 product for air–sea CO₂ flux assessment and identifies spatial observational sparsity – not algorithm choice – as the dominant remaining limit on reconstructing the global ocean carbon sink, with direct implications for the design of future ocean carbon observing systems.
- Preprint
(3132 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 04 Aug 2026)
- CC1: 'Comment on essd-2026-375', Galen McKinley, 26 Jun 2026 reply
-
RC1: 'Comment on essd-2026-375', Anonymous Referee #1, 01 Jul 2026
reply
This manuscript presents FDU-BTR, a global monthly 1° × 1° reconstruction of surface-ocean pCO2 for 1982–2024. The proposed Background–Thermal Residual framework decomposes observed pCO2 into a multi-product background climatology, an explicit thermal-anomaly term, and a learned residual, with basin-specific CatBoost models and boundary blending. The paper addresses an important problem for ocean carbon-cycle research: how to reconstruct surface-ocean pCO2 in the presence of strong spatial and temporal sampling biases, while accounting for the physics. The manuscript is timely, potentially useful for the community, and the attempt to combine physical structure, machine learning, product intercomparison, and coverage diagnostics is valuable.
However, I think the manuscript requires a lot of work before publication. My main concern is that several claims are stronger than what the validation and uncertainty analysis currently support, especially regarding global spatial skill, dominant sources of product uncertainty, and readiness for air–sea pCO2 flux assessment. The method is promising, but the validation framework needs to be strengthened, and the claims should be more carefully qualified. Some “independent” validation is actually not independent as the data is included in SOCAT, so this raises concerns about the stated independence of the validation dataset and should be clarified carefully
Major comments
Independent validation does not fully test the central gap-filling problem
The paper correctly emphasizes that global pCO2 reconstruction is fundamentally a gap-filling problem, especially in regions with sparse SOCAT coverage. However, much of the independent validation focuses on time-series performance at well-observed locations, especially HOT and BATS. These are useful benchmarks, but they mainly test temporal variability at long-term subtropical stations. They do not provide a strong independent test of spatial extrapolation into poorly sampled regions, which is where the product is most needed and where uncertainty is expected to be largest.
The NCEI buoy validation is actually part of SOCAT, so definitely not an independent validation. Besides, it is not necessarily a useful addition, especially because it includes coastal, and coral-reef environments which is not expected to excel in this context. This validation is still based on point time series matched to a 1° monthly product, and the results show clear degradation in coastal and reef environments. The paper should therefore be more cautious in presenting the product as globally validated. Good agreement at HOT/BATS and open-ocean buoys does not necessarily demonstrate skill in the Southern Ocean, high-latitude winters, the South Pacific, the South Atlantic, marginal seas, coastal upwelling zones, or other regions where the product performs the most gap filling.
The authors should either temper claims about global spatial skill or add a stronger spatially independent validation. Possible options include spatial block cross-validation, leave-one-region-out or leave-one-basin-out tests, withholding data from under-sampled regions, validation stratified by SOCAT coverage density, and/or comparison with independent station compilations across broader regions. The SPOTS compilation from Lange et al. (2024), for example, could provide a more convincing independent evaluation across different regions and observing regimes, although still only a temporal independent validation.
Validation of air–sea CO2 flux applications
The manuscript repeatedly motivates the product as a basis for air–sea pCO2 flux assessment. However, the validation focuses primarily on pCO2 RMSE, correlation, and bias. These are useful but not sufficient for the stated flux application. Air–sea pCO2 flux depends on the air–sea pCO2 difference, not only on ocean pCO2 in isolation. A relatively small pCO2 bias can still affect flux estimates if it changes the sign, seasonal phasing, or regional magnitude of ΔpCO2.
The authors should evaluate whether the product preserves the correct ocean–atmosphere pCO2 gradient. This could include validation of ΔpCO2 regional and seasonal bias in ΔpCO2, the frequency of incorrect source/sink classification, and ideally implied air–sea pCO2 fluxes using a consistent atmospheric CO2 field and gas-transfer formulation. If the product is intended to support flux assessment, this should be part of the validation, or the flux-related claims should be softened.
Decomposition framework
The BTR method decomposes pCO2 into a multi-product background climatology, a thermal-anomaly term, and a residual term. This is a central methodological contribution. However, the validation mainly evaluates the final reconstructed pCO2 field. It would be useful to assess the behavior of each component separately.
For example, how biased is the multi-product background climatology relative to independent observations? Does the thermal-anomaly term improve or degrade performance in different regions or seasons? Is the learned residual correcting real non-thermal variability, or is it compensating for errors in the prior climatology and thermal correction? This is especially important because the authors acknowledge that biases in the background climatology or thermal term can propagate into the residual. A component-level evaluation would make the physical interpretation of the BTR framework much more convincing.
Uncertainty framework
The uncertainty diagnostics are a valuable part of the manuscript, but the current framework appears closer to an empirical diagnostic than a rigorous grid-cell-level uncertainty estimate. Mapping uncertainty is estimated from residuals at validation observations, but observations are not uniformly distributed in space or process regime. As a result, the inferred uncertainty may still be biased toward better-observed regions. This is important because the paper makes strong claims about characterizing spatial patterns and dominant sources of product uncertainty.
The authors should clarify that the uncertainty product is diagnostic and partly observation-location dependent. They should also test whether the reported uncertainties are calibrated. For example, do regions with higher predicted uncertainty actually show larger independent residuals? Are uncertainty estimates reliable when stratified by basin, latitude, season, coastal/open-ocean class, and SOCAT coverage density? The phrase “dominant sources of product uncertainty” should be softened unless the authors can more clearly separate model structural uncertainty, prior-field uncertainty, observational uncertainty, and sampling/extrapolation uncertainty.
Use of other reconstructed carbonate-system products as predictors
The model uses pH and DIC from CMEMS/LSCE products as predictors. These variables may improve predictive skill, but they also raise questions about independence and circularity. If pH and DIC are themselves reconstructed fields that use related observations, assumptions, or machine-learning methods, then FDU-BTR is not purely constrained by independent environmental predictors. It may partly inherit structure from existing carbon products.
The authors should discuss whether using reconstructed pH and DIC introduces dependence on another pCO2/carbon-system product, whether it affects the independence of validation and product intercomparison, and how sensitive FDU-BTR is to these predictors. A useful sensitivity test would be to train and validate a version without pH and DIC predictors, or at least quantify how much they affect performance, trends, and uncertainty patterns (that may propagate through the full prediction).
The held-out-year validation still does not test spatial extrapolation
The fixed-year validation design is useful for testing temporal generalization, but it does not prevent spatial information from nearby or repeatedly sampled locations from entering the training set. Since SOCAT contains repeat cruise tracks and long-term sampling locations, holding out complete years may still leave the model well constrained spatially. This design therefore does not fully test the model’s ability to extrapolate into regions with little or no data.
The authors should add spatially blocked validation, preferably with blocks large enough to avoid local spatial autocorrelation. They could also report performance separately for high-coverage and low-coverage regions. This would align the validation more closely with the paper’s central argument that spatial coverage is the dominant limitation.
The claim that spatial sparsity, not algorithm choice, is the dominant limitation is too strong
The thinning experiments are interesting and support the importance of spatial coverage. However, they are conducted within a fixed model framework. They show that, for this model, reducing spatial coverage produces a larger error penalty than reducing temporal coverage under comparable sample-size conditions. They do not prove that algorithm choice is secondary in general, nor that spatial sparsity is more important than all methodological choices.
The authors should rephrase this conclusion more carefully. A more defensible statement would be that, within the FDU-BTR framework and the thinning experiments performed here, spatial representativeness has a stronger effect on independent-year RMSE than temporal continuity. Broader claims about algorithm choice would require controlled comparisons across substantially different algorithms under the same coverage perturbations.
Boundary blending may improve numerical continuity but could also smooth real gradients
The boundary blending scheme is a reasonable solution to discontinuities between regional models. However, the evaluation focuses on reduced differences across regional boundaries. Reduced discontinuity is not necessarily equivalent to improved physical realism, because real gradients may occur across fronts, water-mass boundaries, and biogeochemical regimes. The authors acknowledge this point, but the analysis would be stronger if they evaluated whether boundary blending improves agreement with observations near boundaries, rather than only reducing cross-boundary contrast. It would also be useful to assess whether boundary blending damps real variability or trends in frontal regions.
Claims about coastal and marginal regions should be more cautious
The paper identifies coastal, shelf, marginal-sea, coral-reef, and high-latitude regions as areas of elevated uncertainty, and the buoy validation indicates poorer performance in coastal and reef environments. This is not surprising for a 1° monthly product, but it should be reflected consistently in the abstract, conclusions, and data-use recommendations. The product may be appropriate for basin-scale and open-ocean analyses, but users should be warned more explicitly against interpreting local coastal or reef-scale variability.
Minor comments
L64: Please define SOCOM at first use, and other acronyms too.
L80: Since BTR decomposes observed pCO2 into a background climatology, thermal-anomaly term, and residual term, the authors should validate or at least diagnose each of these components. It is not sufficient to evaluate only the final reconstructed pCO2 field if the decomposition is presented as a key physical advance.
L91: The statement that the study “characterizes the spatial patterns and dominant sources of product uncertainty” is too strong given the current validation. The uncertainty analysis is useful, but the independent validation is still limited, especially for spatial extrapolation. This sentence should be softened unless additional independent spatial validation is added.
L91–93: The statement that the product provides a foundation for air–sea CO2 flux estimation should be supported by validation of Δ pCO2 or air–sea CO2 fluxes. Otherwise, the authors should say that the product is potentially useful for flux estimation, pending flux-specific evaluation.
Specific to methods:
Please clarify how missing predictors are handled in different periods, especially for variables that do not cover the full 1982–2024 period (Several predictors do not cover the full 1982–2024 period: Chl-a starts in 1997, SLA in 1993, MLD ends in 2022, pH/DIC start in 1985, etc. ). CatBoost can handle missing values, but long periods of missingness may still encode time or product-availability information.
Because the product is intended for long-term variability and trend analyses, the authors should evaluate trend robustness explicitly. Are regional pCO2 trends sensitive to the background climatology, xCO2, pH/DIC predictors, and the choice of training years? Please compare trends against independent time series where possible and report trend uncertainty, not only pointwise RMSE.
Please justify the pCO2 > 700 µatm filtering threshold. This may remove extreme but real coastal, upwelling, reef, or marginal-sea values. If the product excludes or smooths such environments, this should be stated clearly. Also, please quantify sensitivity to the fCO2 to pCO2 conversion, the 0.0423 K-1 temperature correction, and the 40 µatm correction filter. These choices may disproportionately affect high-latitude, coastal, upwelling, or reef observations and could influence both the training target and validation metrics.
Because the background climatology is constructed from existing pCO2/fCO2 products, the comparison of FDU-BTR with those products is not fully independent. The authors should discuss this dependence and avoid presenting inter-product agreement as completely independent validation.
Please provide validation metrics stratified by basin, latitude band, season, open-ocean/coastal class, and observational coverage density. This would make the validation much more relevant to the spatial gap-filling problem.
Please include a spatial-block or leave-region-out validation experiment. The current held-out-year validation mainly tests temporal generalization.
Please test uncertainty calibration against independent residuals. For example, do high-uncertainty grid cells actually have larger validation errors?
The conclusion that spatial observational sparsity, rather than algorithm choice, is the dominant remaining limit should be softened. The thinning experiments support the importance of spatial coverage within this framework, but do not isolate algorithm choice across reconstruction methods.
Overall, I find the dataset and framework potentially useful, but the manuscript should better align its claims with the validation performed. The strongest contribution is the physically guided residual-learning framework combined with a careful diagnosis of coverage limitations.
Also, please ensure that the archived dataset contains not only the final pCO2 field but also the uncertainty, and preferably the background, thermal, and residual components.
Citation: https://doi.org/10.5194/essd-2026-375-RC1
Data sets
FDU-BTR: A global monthly gridded pCO2 product (1982-2024) Wang, Z., Fu, W. https://doi.org/10.5281/zenodo.20152530
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 71 | 14 | 6 | 91 | 6 | 4 |
- HTML: 71
- PDF: 14
- XML: 6
- Total: 91
- BibTeX: 6
- EndNote: 4
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The authors use NCEI buoy data (Sutton et al. 2018) to validate their method (Figure 6 in section "5.1 Independent time-series validation"). There is an error here because these data ar not actually independent; they are included in SOCAT. The authors need to change this section to reflect the fact that the buoy data are not independent of SOCAT, their training set.
Galen McKinley, Columbia University and Lamont-Doherty Earth Observatory