the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A Novel Global Gridded Ocean Oxygen Product Derived from a Neural Network Emulator and in-situ observations
Abstract. Ocean deoxygenation, driven by climate change, poses significant challenges to marine ecosystems and can profoundly alter nutrient and carbon cycling. Quantifying the rate and regional patterns of deoxygenation relies on spatio-temporal interpolation tools to fill gaps in observational coverage of dissolved oxygen. However, this task is challenging due to the sparsity of observations, and classical interpolation methods often lead to high uncertainty and biases, typically underestimating long-term deoxygenation trends. In this work, we develop a novel gridded dissolved oxygen product by integrating direct oxygen observations with machine-learning-based emulated oxygen estimates derived from temperature and salinity profiles. The gridded product is then generated through optimal interpolation of both the observed and emulated data. The resulting product shows strong agreement with baseline climatology and captures well-known patterns of seasonal variability and long-term deoxygenation trends. It also outperforms current state-of-the-art products by more accurately capturing dissolved oxygen variability at synoptic and decadal scales, and by reducing uncertainty around long-term changes. This study highlights the potential of combining machine learning with classical interpolation methods to generate improved gridded biogeochemical products, enhancing our ability to study and understand ocean biogeochemical processes and their variability under a changing climate.
- Preprint
(17149 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on essd-2025-288', Anonymous Referee #1, 14 Jul 2025
- AC1: 'Reply on RC1', Said Ouala, 12 Aug 2025
-
RC2: 'Comment on essd-2025-288', Anonymous Referee #2, 26 Sep 2025
General comments
I thank the authors for this interesting work. I think adding the emulated data in combination with Optimal Interpolation is a useful approach of further developing and improving observation based 4D reconstructions of oxygen. Making these reconstructions and using them to identify and explain variability on different scales is a very important goal and the link with PDO looks promising.
The Optimal Interpolation method seems sound, even though the manuscript would benefit from additional details on why these parameters were chosen, especially regarding the differences of the two separate products.
The machine learning part and therefore the quality of the emulated data is not tested robustly enough. There is only the comparison with “test regions” taken from the validation dataset. There needs to be more detail on how these were chosen and how independent they actually are. The world map in figure 8 shows a good coverage of ocean data, but in figure C3 and C4 you still have gaps and have not analysed any seasonal bias. The validation data could also have been partly seen by the machine, since it is often used to control learning and prevent overfitting. There are also no other measures of machine learning performance. It would’ve been good to have an n-fold machine learning ensemble – is there a large spread in the predictions? Do values for some areas differ substantially from one ensemble member to another?
I am also not fully convinced that the decadal and synoptic variability you’ve seen or any additional features you observed is definitely real. That said, I do not exclude the possibility that it is real. I think it is important to explain further why you think it is real – because that’s the main issue faced by anyone using interpolation techniques. Currently, it is not clear to the reader why these results couldn’t still be a product of the sparsity of ocean observations. After all, even if you added many datapoints based and temperature and salinity, this new dataset is still sparse given how vast the ocean is. In some parts of the text it sounds like by simply observing decadal variability it can be declared an improvement. It could indeed represent an improvement, but the manuscript needs to provide stronger justification. One way of doing this could be a model validation, for example. You could also look at which basins are driving this decadal variation and discuss the processes in these basins that could drive the decadal variability. You do this partly in the text with PDO, but the manuscript could benefit from a more detailed analysis.
The citing commands need to be checked, especially the difference between \cite and \citeA if you’re writing in Latex. Author et al. (2025) vs (Author et al., 2025).
Specific comments
L27: You should also mention the seasonal bias. Especially for Polar regions there are still few datapoints in the winter. There is also different data availability in different decades – and different quality of data.
L39: They are not just a reference for model calibration, they provide important observation based estimates of the oxygen budget.
L48: Regarding marginal seas not present in other studies: That's true, but your validation dataset in figure 2 looks like you are also not focusing on them.
L48: Good point, I think it is important to address that.
L50: Perhaps this will be a subject later - but why did you choose to start from 1965? Isn't that optimistic given we have very few datapoints during that time? What about data/measurement quality?
L62: How do you test that your product really fares better in regions where there is no data?
L86: Do you only train once? Or do you use a machine learning ensemble, where the 20% are different for each ensemble member, meaning that eventually you've every datapoint at least once in training before calculating an ensemble mean.
L88: How did you choose the test regions? Did you use an algorithm like the SOM method by Landschützer et al. 2016?
L95: Do you mean Multilayer Perceptron? Perhaps you should also mention that this is a feedforward neural network, which may be more familiar for many readers. Also, there are many different architectures of neural networks.
L101: Perhaps it would be good to give a general summary in the text. You don't need to provide numbers of layers here but at least tell the reader how hyperparameters were chosen. It would be good to provide detail on why you chose this architecture.
L114: The upper ocean is also characterised by a higher variability (i.e. more difficult to predict).
L115: Regarding errors in earlier years: this is actually one of the main issues we face - perhaps a few words more need to be added in the introduction instead of doing this here (in addition to the sparse regions and seasonal bias I mentioned).
Figure 1: This looks like a good match between predicted and true values, but the scatter markers mask each other when it gets more crowded. It would be clearer to make a density plot (where colours indicate the number of datapoints at that location in the plot), similar to figure 3. That way you can also see where most of the datapoints are and where the outliers are.
L124-126: Perhaps I'm misunderstanding something, but it is not clear how you deal with data in sparse regions (i.e. regions where you don't have any or much historical data). I know this is not easy to do, but it is important to address.
L139: 0 m to 2000 m: It would be good to say why you chose these limits
Figure 3: Instead of longitude and latitude I think it would be more informative to use ocean basins. Depth and Year look good - although I would also be interested in data shallower than 300 m. Minor point: perhaps reverse the y-axis here, so that deeper levels are down, like in the real ocean.
L156: It wasn't clear before that you planned to do two separate products: one yearly product and one monthly product each with different time ranges. Why did you chose these years and parameters?
L166: Regarding emulation uncertainty: On other machine learning work this is done via the standard deviation of the machine learning ensemble (e.g. MOBO-DIC in Keppler et al. 2023). Perhaps I’m misunderstanding this, but I'm not sure how you justify the slight increase to 4%. Perhaps it would be good to make that clearer.
L194: You say your product better resolves synoptic scale structures. Why could this be? And how confident are you that this is real? Also, if you use “resolve” it sounds like you could be referring to resolution, but both GOBAI-O2 and Ito et al. 2024 use the same resolution of 1 degree.
Figure 5: I like that you examined the wavelength in such a way. Could you perhaps add one more wavelength label on the x-axis, so that it's clear at what wavelengths the other differences are?
L196: It is not clear that you are now talking about your monthly product.
L197: Minor point: You can just say "compare with GOBAI-O2 and Ito et al. 2024" instead of "previous ML-based products, including GOBAI-O2 and Ito et al. 2024". Otherwise, it sounds like you have more products to compare with.
L201: Same here, fewer words might be easier on the reader. Just say GOBAI-O2 and Ito et al. 2024.
Table 1: To a reader not familiar with this correlation it would be good to explain why this is desired and important.
L209: Is there any ocean further south than -80 degrees?
L217: Minor point: there are a varying number of blank spaces around the plus-minus signs.
L219: The reference to the “full 1965–2022 period” again highlights that the manuscript does not clearly distinguish between the different uses of the yearly and monthly products across their respective time ranges. In particular, the monthly product covering 2005–2022 has so far been rarely discussed, and its role relative to the yearly product remains unclear.
L233: Language like "it is well established" sounds a bit too grand here, especially if both citations in that sentence are from Ito et al.
L236: You're right that your data density is higher - but it is still a sparse dataset, even with emulated data added. It would be good to acknowledge that there may still be some bias (or argue why you think there isn't).
L245: You should refer again to figure 7 here again, as it matches what you say about PDO phases. Also, based on that and analysis of the figure by eye a reader could agree. To make it clearer you could also compare your results with the PDO Index, for example. If there is a strong correlation this would improve confidence in these results, as there is always the possibly that part of it is noise (or part of it comes from other processes).
L251: I know you already talked about some of this in section 2.4 with the covariance matrix, but a dedicated section on uncertainty estimates should also explicitly address the prediction/emulation error of the machine learning model, measurement uncertainty, and other relevant components. At present, these elements are just contained in the covariance matrix. At minimum, it would be helpful to restate which uncertainty components are included when you write, “As described in Section 2.4, Σa is a diagonal matrix, representing the variance at each grid point.”
Figure 8: These maps along with figures C3 and C4 are useful, but it would be good to see time and space together, i.e. is there a seasonal bias in some regions? You should also address that even with the emulated data added, there are spatial gaps in certain decades (figure C3 and C4)
L270: I’m not convinced if you are actually avoiding these additional sources of errors. You are still interpolating; it’s just in a different order. Don’t get me wrong, I think this approach is useful, but I wouldn’t go so far as suggested here. I think there is not enough testing of remaining biases, as it seems like even with the emulated data added the dataset is still sparse. If you think there has been enough testing, it would be good to make that clearer.
L273: Regarding the two products: You talked very little about the differences between the two. It is a bit unclear to the reader what the purpose of creating two different ones is. Are there any things one is not representing as well as the other - for example, is the only purpose of the monthly product to look at the seasonality? Or are there any other advantages? Did the smaller e-folding length scale make a difference? It would be great if you could provide more detail regarding your choices and why these had to be separate products. It would also be good to think about why 1 degree was appropriate.
L281: I can see from the plots why you are excited and use the words “outperforms current state-of-the-art products” but I think you haven’t shown enough why you are confident in your results.
Figure C2: The upper left figure, not right.
Figures C3 and C4: I think these are important figures that further show how you add emulated data over different 5-year windows. However, they don’t seem to be referenced in the text.
Citation: https://doi.org/10.5194/essd-2025-288-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
1,313 | 57 | 33 | 1,403 | 24 | 43 |
- HTML: 1,313
- PDF: 57
- XML: 33
- Total: 1,403
- BibTeX: 24
- EndNote: 43
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
I would like to thank the authors for the interesting and timely work. The paper presents a novel approach to generating a gridded dissolved oxygen product by integrating direct observations with ML–based emulations derived from temperature and salinity profiles, followed by optimal interpolation. The methodology is simple to follow and technically sound, the results are compelling, and the product demonstrates clear improvements over existing datasets, especially in capturing long-term trends and reducing uncertainties. I recommend acceptance with minor revisions, but I would like to note that my review is primarily focused on the ML aspect.
Comments:
* The training/test split was done randomly, how the authors ensure there is no data leakage? It would have been more interesting if the trains was done in a temporal way.
* It would have been also more robust to use a validation dataset, instead of only train/test
* Any reason why the test locations do not include any points near Europe?
* Any reason why using Month of the year + Day of the month in the MLP inputs instead of just using Day of the year?
* Can the authors describe the hyper parameter search procedure to tune the MLP?
* Figure 1 would have been more informative if the plots where done per test region
* Any explanation of what's happening at depth 500 in test region J (Figure 2)?
* It would be interesting to use any XAI method to study feature importance for the MLP
* Any plans to share the code used and not only the dataset?
Typos:
* Line 35: "weather forecasting" instead of "forecasting"
* Many citations are badly formatted, /citet vs /citep