the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
OceanTACO: A Multi-Sensor Global Ocean Sea Surface State Dataset
Abstract. We present OceanTACO, a harmonised global collection of sea surface state datasets designed to support reproducible Earth system research. The collection integrates satellite altimetry, sea surface temperature, salinity, surface winds, reanalysis fields, and Argo in situ observations within a unified cloud-optimised specification based on Transparent Access to Cloud-optimised datasets (TACO). It includes Level-3 observations, Level-4 gap-filled products, and reanalysis outputs while preserving native spatial and temporal resolution. The core dataset spans 29 March 2023 to 1 August 2025, covering the Surface Water and Ocean Topography (SWOT) mission, with an extended record from 1 January 2015 until 29 March 2023 for non-SWOT sources.
Datasets are harmonised through standardised metadata, spatial referencing, and temporal indexing, enabling consistent spatiotemporal queries across sensors and processing levels. A uniform internal structure reduces product-specific preprocessing and allows the same data-access routines to be applied across regions, sensors, and studies. This supports Earth systems analyses workflows such as validation against in situ observations, comparisons between observation and mapped products, observation system experiments, and multivariate sensor analyses.
Example applications demonstrate cross-product collocation with Argo, analysis of sea surface height variability during extreme events, and relationships between surface variables relevant for data-driven reconstruction. OceanTACO improves accessibility to coordinated multi-source analyses while preserving data provenance and native observation characteristics, and can be extended with new missions without restructuring the dataset. The core and extended dataset are available at https://doi.org/10.57967/hf/8171 (Lehmann and Aybar, 2026a) and https://doi.org/10.57967/hf/8172 (Lehmann and Aybar, 2026b) respectively.
- Preprint
(10906 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 25 Jul 2026)
- CC1: 'Comment on essd-2026-232', Giuseppe M.R. Manzella, 01 Jun 2026 reply
-
RC1: 'Comment on essd-2026-232', Anonymous Referee #1, 15 Jun 2026
reply
The manuscript describes a collection of ocean surface fields from observations and one ocean and one atmospheric reanalysis, which already existed on, or were brought to, regular grids close to their original resolution. Apart from the simple interpolation/aggregation, the product compresses the data by using int16 representation together with a standard compression routine. As regridding and lossy compressions (Reichelt et al. 2026) are widely available standard procedures, the main merit of the product is mostly the provision of a unique STAC catalog that unifies all products. This lowers the barrier to data use, particularly for disciplines unfamiliar with ocean and atmosphere data.
The manuscript could be better organized. The most important information about the temporal and spatial resolution of the source data and of the final product is not available for all products. It is also hard to find since listed at diverse places. Also, information about depth for GLORYS-12 and Argo is missing, suggesting that some data include vertical information.
Details:
What is the reason to limit the range to 2015-2025? Most data is available for decades before. Altimeter and GLORYS-12 are from 1993 and Argo from early 2000s. I think the short range limits the applications because the collection will be most useful for statistical analyses.
Section 2.1 Information about temporal and spatial resolution should be provided along with the data information.
L 150-151 It would be useful to know the temporal and spatial resolution in the OceanTACO product for the other data as well.L158-159 Are the original time and location stamps are preserved or is the data interpolated in any way?
L 163-164 Why are products listed that do not fall within the OceanTACO period?
L189-190 Are the complete profiles provided or only surface data. If only near surface data, which depth?
L221 Not up to 4? For instance, if the region covers a corner where 4 regions meet.
2.3.2/2.3.3 Strange way of numbering the sections. 2.3.2 seems to contain 2.3.3 otherwise it is and empty section. Consequently 2.3.3 should named 2.3.2.1.
Equation 2: My understanding is that the along-track data has a resolution of about 7 km, such that the number of SLA observations per cell is about one. Do these metrics make sense with such low number of samples or is the input data of much higher resolution?
Compression: Is it possible to compare your compression with other methods discussed in (https://doi.org/10.5194/egusphere-2026-60) and maybe evaluate by ClimateBenchPress?
Fig.4 What is shown is obviously a rather trivial consequence of how many significant digits int16 can represent. At least add a word about that.
The scatter plot is not useful since only pretty large errors will be visible there. I could be interested in a relative error normalized by the std of the data. For regions with low variability of a few cm, the error may reach 1% or so.L324-325 Maybe mention other eddy rich regions, e.g Gulf Stream or in the Southern Ocean as well.
L 332-336 Not so clear what the advantage is here. It seems much harder to do along-track spectra with the gridded fields, since it is not so easy anymore to find the points belonging to one track. Also it seems that the analysis of along-track SSH and SWOT is performed over different points. I guess the whole regions shown in the Figure were considered. That would explain the different PSD levels for the longer wavelengths where the difference in resolution should not be an issue. Seems this example is more a demonstration of pitfalls if analysis becomes to easy, people will not care about important details anymore.
Table A1 It would be good to add the temporal resolution and for GLORYS-12 that this is only surface data.
Table B1 It would be useful to report the error as a noise-to-signal error to provide a better idea if the resulting product is still useful. Or for which applications it may still be usable. For instance, the salinity signal is often less than 0.1. The RMSE value suggests that in some regions the error reaches 1% leaving only 2 digits significant. For SSS better use g/Kg, if this is meant.
Citation: https://doi.org/10.5194/essd-2026-232-RC1
Data sets
OceanTACO Core Dataset Nils Lehmann https://doi.org/10.57967/hf/8171
Model code and software
Data Generation Code Nils Lehmann and Cesar Aybar https://github.com/nilsleh/oceanTACO
Interactive computing environment
ReadTheDocs Documentation Page Nils Lehmann https://oceantaco.readthedocs.io/en/latest/index.html
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 397 | 149 | 28 | 574 | 24 | 25 |
- HTML: 397
- PDF: 149
- XML: 28
- Total: 574
- BibTeX: 24
- EndNote: 25
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The paper presents a "data system" composed of data from various sources and therefore subject to potentially different processing methods and/or algorithms. Highlighting these possibilities and providing information on the challenges encountered when using data from various sources is a plus.
The data system can certainly be useful for many purposes, and therefore the paper, while not original, is valuable.
Some information should be added for further useful references. Any data set contains noise and spurious data. Have these been eliminated from the data residing in the various repositories? Was a check performed within the "OceanTACO" system?
A weakness of the article is represented by the usage examples, which are rather superficial and do not provide sufficient detail, as in 3.3 and 3.5. In the case of Observation System Experiments and Mission Impact Studies, how much of an impact might the definition of very large regions (perhaps unable to highlight subregional phenomena) have on (e.g.) impact studies? Even more vague is the paragraph regarding machine learning, which could also be considered data mining. There are many machine learning models, including supervised, unsupervised, semisupervised, and reinforcement learning. Some details on what has been used by the cited authors could be very useful.