the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Attention Enhanced 3D-U-Net++ Ocean Temperature and Salinity Reconstruction in the Northwestern Pacific based on Transfer Learning
Abstract. Real-time and accurate three-dimensional ocean temperature-salinity (T-S) field are of great significance for a deeper understanding of ocean dynamics and prediction skill improvement of numerical models. However, current ocean observations, especially those below the sea surface, still suffer from significant limitations in temporal and spatial resolution. Several neural network methods using multi-source satellite data for underwater temperature and salinity reconstruction have been proposed, achieving real-time temperature and salinity reconstruction, but their biases relative to in-situ observations are still significant. This study focuses on the northwestern Pacific region (0–40° N, 120–160° E) and proposes an attention-enhanced three dimensional U-Net++ model, which reconstructs daily T–S fields (26 layers, 1/4° resolution, 5–2000 m depth) using real-time available sea surface temperature (SST) and sea surface height (SSH) data. The model introduces cross-scale feature aggregation and selective information gating, allowing it to emphasize temporally coherent surface features most relevant to subsurface variability, while suppressing noise propagation and over-smoothing. By integrating 26 consecutive days of SST and SSH as inputs, the model effectively alleviates the underdetermined problem of mapping limited surface observations to full-depth structures. In addition, a two-stage transfer learning strategy is employed: the model is first pretrained using monthly SST/SSH data and the gridded Argo data to learn observation-dominated low-frequency spatiotemporal patterns, and then fine-tuned using daily SST/SSH data and the high-resolution reanalysis to capture the meso-scale dynamic processes. Evaluation results demonstrate that the reconstructed T-S fields exhibit better agreement with in-situ T-S profiles from World Ocean Database than previous studies, both during the validation period and in long-term statistical analyses, indicating the reliability and accuracy of the proposed approach for subsurface ocean field reconstruction. The reconstructed T-S field is available at https://doi.org/10.57760/sciencedb.31950 (Wang et al., 2025).
- Preprint
(17010 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on essd-2025-742', Anonymous Referee #1, 26 Jan 2026
-
RC2: 'Comment on essd-2025-742', Anonymous Referee #2, 03 Feb 2026
The manuscript “Attention Enhanced 3D-U-Net++ Ocean Temperature and Salinity Reconstruction in the Northwestern Pacific Based on Transfer Learning” investigates the problem of subsurface ocean temperature and salinity reconstruction in the northwestern Pacific region (0–40°N, 120–160°E). The study proposes an attention-enhanced three-dimensional U-Net++–based framework to reconstruct daily three-dimensional temperature and salinity fields from surface satellite observations. The model produces temperature–salinity estimates on 26 vertical layers at 1/4° horizontal resolution over depths ranging from 5 to 2000 m, using sequences of sea surface temperature (SST) and sea surface height (SSH) as inputs.
The proposed approach combines cross-scale feature aggregation with attention-based gating mechanisms intended to emphasize surface features most relevant for subsurface variability. By integrating 26 consecutive days of SST and SSH, the method aims to alleviate the inherently underdetermined nature of mapping limited surface observations to full-depth ocean structures. In addition, the authors employ a transfer-learning strategy in which the model is pretrained using monthly SST and SSH data and subsequently fine-tuned for daily reconstruction. Model performance is evaluated against in situ temperature and salinity profiles from the World Ocean Database, and the reported results indicate generally good agreement with observations and modest improvements relative to baseline datasets.
While the overall results appear reasonable and the proposed ideas are physically well motivated, the manuscript suffers from a lack of methodological clarity that makes it difficult to fully assess, reproduce, and interpret the approach. In particular, the network architecture is not specified in sufficient detail: the manuscript does not clearly define the input and output tensor dimensions, the dimensionality of the convolutional operations, or what precisely constitutes the “three-dimensional” aspect of the U-Net++ architecture in practice. It remains unclear how spatial, temporal, and vertical dimensions are represented within the network, and whether time and depth are treated as explicit dimensions or implicitly as stacked channels.
Similarly, the practical implementation of the transfer-learning strategy is valuable but insufficiently described. The pretraining and fine-tuning stages involve substantial changes in spatial resolution (1° to 0.25°), temporal resolution (monthly to daily), and target datasets (IPRC-Argo to GLORYS2V4), yet the manuscript does not clearly explain how these transitions are handled in practice. Key details regarding input normalization, adaptation of temporal windows, and the aspects of the learned representations expected to transfer between stages are either missing or only briefly mentioned in the results section. As a result, the reader is left with an incomplete understanding of how the proposed training strategy operates beyond a high-level conceptual description.
Finally, although the Results section is extensive and includes a wide range of diagnostic figures, it often reiterates similar validation setups and comparisons across multiple subsections. This repetition tends to obscure the main findings rather than sharpen them, and clearer structuring or consolidation of overlapping analyses would improve readability and focus.
Overall, the study addresses an important problem and presents promising results, but substantial improvements in methodological transparency and presentation are required before the contribution can be fully evaluated and appreciated.
Major Comments
Comment 1: The manuscript refers throughout to an “attention-enhanced 3D U-Net++” architecture, yet the network design is not described with sufficient precision to assess what makes it three-dimensional in an architectural sense. While Section 2.2.1 provides a general description of U-Net++ and the integration of CBAM attention modules, it remains unclear how this design is extended beyond a conventional two-dimensional framework. In particular, the manuscript does not clearly state whether the third dimension corresponds to depth or time, nor whether either is treated as an explicit spatial dimension within the network or implicitly as stacked input or output channels. It is also unclear whether a single network input consists of full regional SST–SSH maps or of point-wise surface time series. The absence of explicit input and output tensor definitions, together with a clear description of the dimensionality of the convolutional, pooling, and upsampling operations, makes it difficult to evaluate reproducibility, interpretability, and the validity of the architectural claims.
Comment 2: Related to this, the treatment of temporal information remains insufficiently specified. Section 2.2.3 provides a conceptual motivation for incorporating multi-day SST and SSH inputs to alleviate the underdetermined nature of subsurface reconstruction; however, it remains unclear how temporal information is handled within the network in practice. In particular, the manuscript does not clarify whether time is modeled explicitly (e.g., via temporal convolutions or sequence-aware components) or simply treated as an expansion of the input feature space. The choice to use a 26-day surface input window is reasonable and physically plausible, but the justification for this specific value is limited to its correspondence with the number of reconstructed depth levels. A brief explanation of whether this choice is motivated by physical timescales, empirical tuning, or practical considerations would improve clarity and help readers assess the generality of the approach.
Comment 3: The implementation and interpretation of the transfer-learning strategy are beneficial for the manuscript but also require clarification. Although the manuscript evaluates different layer-freezing strategies and ultimately adopts full fine-tuning, the practical mechanics of transferring between pretraining and fine-tuning remain unclear. In particular, the transition from monthly, coarse-resolution SST/SSH and IPRC-Argo targets to daily, higher-resolution SST/SSH and GLORYS2V4 targets involves substantial changes in temporal resolution, spatial resolution, and target data characteristics. The manuscript would benefit from explicitly describing how input structures, normalization, and temporal windows differ between stages, and what aspects of the learned representation are expected to transfer. Since the final model is fully fine-tuned on GLORYS2V4, a clearer discussion is also needed on whether the resulting product should be interpreted as approximating observational variability or as producing GLORYS-consistent reconstructions with improved alignment to WOD profiles.
Comment 4: More generally, the training and evaluation strategy relies heavily on model-assisted and empirically reconstructed datasets. This is a reasonable and widely used approach for large-scale subsurface reconstruction; however, its implications for the interpretation of the resulting data product are not discussed in sufficient detail. In the fine-tuning stage, GLORYS2V4 reanalysis fields are used as training labels, and the reconstructed outputs are subsequently shown to closely resemble GLORYS in both spatial structure and error characteristics. While validation against independent WOD profiles is included, additional comparisons are largely performed against other reconstructed or fusion-based products (e.g., HGEM-derived regional datasets and CGOF1.0). Without a more explicit discussion of these dataset dependencies, it remains unclear to what extent the proposed method reconstructs independent oceanic variability versus reproducing the statistical structure of specific reanalysis products. Clarifying this distinction, rather than introducing additional comparative datasets, would help users interpret the dataset appropriately and assess its generalizability to other regions or observational systems.
Comment 5: The Results section is extensive but often repetitive in its presentation of the validation setup. In particular, multiple subsections repeatedly restate the same input data, reference datasets, and evaluation procedures. While this information is important, reiterating it throughout the Results disrupts the narrative flow and obscures the main findings. A clearer separation between a concise description of the evaluation framework (stated once) and the subsequent presentation of results would improve readability and focus without requiring additional analyses.
Comment 6: The manuscript also makes relatively strong claims regarding the capture of underlying physical laws and the ability to deliver real-time three-dimensional reconstructions. These claims are not fully supported by the methodological description. The surface inputs are derived from mapped and interpolated satellite products, and the subsurface targets are drawn from reanalysis and empirically reconstructed datasets. In this context, it would be more appropriate to frame the method as producing reanalysis-consistent subsurface fields conditioned on surface information. In addition, the concept of “real-time” reconstruction is not clearly defined in terms of data latency, update cycles, or computational requirements.
Comment 7: Finally, the Discussion and Conclusion section largely reiterates the methodological design and reported performance improvements, but offers limited reflection on limitations, uncertainties, and appropriate use cases. Given the dependence on specific training products and the use of spatially sparse in situ profiles for validation, a brief discussion of known constraints (e.g., sampling density, depth-dependent performance, and dataset dependence) would improve transparency and help users interpret and apply the dataset appropriately.
Minor Comments:- The references cited in the Introduction are generally relevant to the topic. However, in several places the way individual studies are positioned in the narrative does not fully reflect the specific emphasis of those works. In some instances, the surrounding text highlights particular methodological aspects, while the cited studies focus on different elements of the approach. This can make it harder for readers to clearly understand how individual contributions relate to the conceptual structure presented. A clearer alignment between the narrative and the cited literature would improve clarity.
- Several in-text citations appear in non-chronological order (e.g., Xie et al., 2025; Wu et al., 2012). Reordering references chronologically within sentences would improve clarity and consistency.
- In line 93 "A transfer learning strategy" needs a reference.
- WOD is referenced before being defined (e.g., “WOD in-situ T–S profiles” in line 104); the acronym should be introduced at first use.
- Figure 3 appears to be cropped on the right-hand side; please verify whether this is intentional.
- In Section 3.1 (Transfer Learning), it is unclear whether the reported results correspond to the single-day or multi-day input configuration. The specific setup used for the presented results should be stated explicitly.
- In Figures 5 and 6, monthly correlation profiles are shown using very similar color maps, making it difficult to distinguish individual months (e.g., December versus February). Using more distinct colors or line styles would improve readability.
- Figures 7 and 8 effectively illustrate spatial reconstruction examples at selected depths for a single day. However, the analysis would benefit from accompanying spatial RMSE maps aggregated over the full test period at the same depths, to provide a more representative assessment of performance.
- For the spatial RMSE maps in Figure 10 derived from WOD profiles, it would be helpful to include a map of WOD profile density per grid cell (e.g., in the Appendix or Supplementary Material) to clarify observational support. In addition, visually distinguishing grid cells with missing or insufficient data from genuinely low-RMSE regions (e.g., via masking or a distinct color) would reduce ambiguity.
- Density scatter plots are presented both for a single validation year (Figures 11–12) and again for a longer period (1993–2023; Figures 16–17) using similar diagnostics. While both analyses are informative, their purposes partially overlap. Clarifying the distinct intent of each or streamlining one of them (e.g., moving it to supplementary material) would improve focus.
Citation: https://doi.org/10.5194/essd-2025-742-RC2
Data sets
Attention Enhanced 3D-U-Net++ Ocean Temperature and Salinity Reconstruction in the Northwestern Pacific based on Transfer Learning Hao Wang et al. https://doi.org/10.57760/sciencedb.31950
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 313 | 380 | 26 | 719 | 14 | 22 |
- HTML: 313
- PDF: 380
- XML: 26
- Total: 719
- BibTeX: 14
- EndNote: 22
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This paper presents a 3D reconstruction method of ocean temperature and salinity based on Attention Enhanced 3D-U-Net++ and Transfer Learning for the Northwest Pacific region, using real-time Sea Surface Temperature (SST) and Sea Surface Height (SSH) data to generate a daily high-resolution (1/4°, 5-2,000 m depth) temperature-salinity field. The methods, despite incorporating designs such as attention and transfer learning, are those that have been widely used by previous authors and lack substantial innovation. Overall, this study requires significant refinement in terms of methodological detail, generalizability, and interpretability. I recommend a Major Revision.
1.In line 60 of the article, it is mentioned that the data assimilation has the problem of “there remain significant challenges in accurately reproducing the vertical structures of mesoscale eddies”, but in this study, the reanalysis products based on the numerical model and data assimilation are used as the labels for training in fine-tuning stage, so is it possible that the problem of “inaccuracy of the vertical structure of mesoscale eddies” also exists in the present dataset? Please add extensive experimental analyses to explain how this study used “inaccurate” reanalysis products as labels to train the model to obtain “accurate” 3D thermohaline fields?
2.Table 1 appears to have a non-English “、”。
3.The combination of UNet and CBAM does not have novelty; many studies have been carried out by previous researchers [1], [2], [3], and this paper does not have a substantial improvement and is not innovative enough.
4.The two-stage transfer learning is equally uninspiring. Combined with Fig. 4, the reconstruction accuracy is improved by less than 10% after transfer learning, and the result is not listed in the table with detailed values; is it intentionally avoided? Meanwhile, as shown in Figure 9, the reconstruction results are almost no different from GLORYS, especially the salinity reconstruction results, and the improvement of reconstruction accuracy is extremely limited. Based on this, is it necessary to carry out the process of such a complex reconstruction? Is it possible to achieve better results with more detailed model tuning? Or is it possible to train to a higher accuracy by replacing the training labels with reanalysis products that have a higher accuracy than GLORYS2V4?
5.Why is it straightforward to say that inputting 26 days is optimal without any ablation experiments for other time periods, such as 2, 4, 6, 15, etc., up to 100 days, and is it not necessary to take into account the temporal correlation of the thermohaline high elements? Please analyze the temporal correlation of temperature and salt elements in this sea area with historical data, and also add ablation experiments for multiple days, and analyze the results of the experiments against the temporal correlation, so as to make a strong case that 26 days is the optimal option.
6.What is the quality control method for the profiles described in line 330? The number of profiles selected, 7833, is much less than the number of original profiles. Was it an intentional effort to select profiles that favored this study? Also, in terms of spatial distribution, the profiles are not uniform, so large blank areas of the profiles are not assessable for reconstruction accuracy, so the dataset is not entirely credible. How can we verify the reconstruction accuracy of the model in regions with no or sparse profiles?
7.This paper has repeatedly emphasized that this dataset has the advantage of “real-time”, so please add detailed information on the update cycles of various products, the hardware environment for model training and inference, and the time spent. At the same time, please list the update cycles of several mainstream ocean reanalysis products and real-time objective analysis products. By comparison, please illustrate the advantages of this dataset in “real-time”.
8.This study lacks comparisons with other mainstream marine reanalysis products, such as HYCOM, ECCO2, ORA5, CORA2, SODA3, and so on.
9.Some sentences are too long and could be split to improve readability.
10.As shown in Figures 7 and 8, the reconstruction results are almost identical to the GLORYS reanalysis. However, the model inputs are sea surface temperature and height information, and not even sea surface salinity information. How to restore so many small- and medium-scale details of the approximate reanalysis products with so little information? Please give a more detailed description of the training process.
References
[1] H. Xie, Q. Xu, Y. Cheng, X. Yin, and K. Fan, “Reconstructing three-dimensional salinity field of the South China Sea from satellite observations,” Front. Mar. Sci., vol. 10, p. 1168486, /05/08 2023, doi: 10.3389/fmars.2023.1168486.
[2] H. Xie, Q. Xu, Y. Cheng, X. Yin, and Y. Jia, “Reconstruction of Subsurface Temperature Field in the South China Sea From Satellite Observations Based on an Attention U-Net Model,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–19, 2022, doi: 10.1109/TGRS.2022.3200545.
[3] J. Qi, B. Xie, D. Li, J. Chi, B. Yin, and G. Sun, “Estimating thermohaline structures in the tropical Indian Ocean from surface parameters using an improved CNN model,” Front. Mar. Sci., vol. 10, Apr. 2023, doi: 10.3389/fmars.2023.1181182.