the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Optimal feature selection for improved ML based reconstruction of Global Terrestrial Water Storage Anomalies
Abstract. Understanding long-term Terrestrial water storage (TWS) variations is vital for investigating hydrological extreme events, managing water resources, and assessing climate change impacts. However, the limited data duration from the Gravity Recovery and Climate Experiment (GRACE) and its follow-on missions (GRACE-FO) poses challenges for comprehensive long-term analysis. In this study, we reconstruct TWS anomalies (TWSA) for the period Jan 1960 to Dec 2022 thereby filling data gaps between GRACE and GRACE-FO missions as well as generating a complete dataset for the pre-GRACE era. The workflow involves identifying optimal predictors from land surface model (LSM) outputs, meteorological variables, and climatic indices using a novel Bayesian Network (BN) technique for grid-based TWSA simulations. Climate indices, like the Oceanic Niño Index and Dipole Mode Index, are selected as optimal predictors for a large number of grids globally, along with TWSA from LSM outputs. The most effective machine learning (ML) algorithms among Convolutional Neural Network (CNN), Support Vector Regression (SVR), Extra Trees Regressor (ETR), and Stacking Ensemble Regression (SER) models are evaluated at each grid location to achieve optimal reproducibility. Globally, ETR performs best for most of the grids which is also noticed at the river-basin scale, particularly for the Ganga-Brahmaputra-Meghana, Godavari, Krishna, Limpopo, and Nile river basins. The simulated TWSA (BNML_TWSA) outperformed the TWSA from LSM outputs when evaluated against GRACE datasets. Improvements are particularly noted in the river basins such as Godavari, Krishna, Danube, Amazon, etc., with median values of the correlation coefficient, Nash-Sutcliffe efficiency, and RMSE for all grids in Godavari, India, being 0.927, 0.839, and 63.7 mm respectively. A comparison with TWSA reconstructed in recent studies indicates that the proposed BNML_TWSA outperforms them globally as well as for all the 11 major river basins examined. The presented dataset is published at https://doi.org/10.6084/m9.figshare.25376695 (Mandal et al., 2024) and updates will be published when needed.
- Preprint
(39262 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 26 Jul 2024)
-
RC1: 'Comment on essd-2024-109', Anonymous Referee #1, 17 Jun 2024
reply
Review comments for the manuscript "Optimal feature selection for improved ML based reconstruction of Global Terrestrial Water Storage Anomalies" of Mandal et al.:
The manuscript describes the derivation of a global long-term terrestrial water storage anomalies data-set, derived from a blending of GRACE satellite observations and global land surface models based on Bayesian networks and machine learning methods.
Such long term information about terrestrial water storage variations is valuable and can help to assess long term trends and to localize extremes. Therefore, I think the manuscript is relevant for publication, however in its current form it lacks to address uncertainties of the product and it is rather structured like a classic scientific study than a data description paper. The structure of the dataset itself is not suitable for efficient usage in the current form and needs revision. Thus for a publication in ESSD, my concerns are as follows.
1) From the title it does not become clear what dataset you would like to advertise. Should it be the TWSAs or the optimal features? I guess its the TWSAs that you would finally like to advertise so you should find a new title in the sense of "ML-based ... long-term terrestrial water storage anomalies from satellite and land-surface model data ...". The term "Optimal feature selection" does not generate an association with a data product, at least for me.
2) In the abstract, you write that you reconstruct TWSA but you don't specify the a grid type and the spatial resolution of the produced dataset. Do you only provide the gridded dataset or also basin aggregates? You should provide this information already in the abstract, although very briefly, so that the reader knows what to expect. Further, I suggest to add a section for data description that explains the structure and content of the final data product in the repository
3) The term "optimal predictors" is mentioned in the title and introduction but the explanation in the methods section (3, 3.1) is not fully clear. What are the optimal predictors? Are they a subsection of your full predictors list? Do you drop training data sets? Are the optimal predictors the ones that have the maximum impact (weight) in the ML algorithms? This should be made more clear and the benefit of knowing the optimal predictors should be outlined.
4) You are not always consistent with your vocabulary, in 3.1 you introduce the term features for what you named previously predictors. I think you should keep a single notion here (and mention the term feature only once, maybe in brackets if this is needed because it's well known by the community).
5) Reproducibility: for making the creation of your data set reproducible, you should at least mention which software tools you used for the machine learning and eventually publish the configurations alongside with your datatset or in another DOI based repository.
6) The selection of evaluation metrics may not be ideal for the global evaluations. CC will always be high for regions with a clear annual amplitude whereas for the deserts with less variations and seasonality it is hard to get a good score in CC. NSE is especially designed for assessing peak flows. Maybe KGE would suit better here. And wouldn't a directed error metric like ME provide additional insight on over- or underestimation tendencies?
7) I think the introduction and the 4.6. Section could be shortened a bit in favor of a data(set) description section.
8) For many of the references DOIs are missing. For several DOI links are incorrect with duplicates in their URLs.
9) Your results should be evaluated in the light of uncertainties of GRACE based water storage anomalies, e.g., https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2021JB022081; there are several different GRACE solutions available which have different levels of uncertainty (https://doi.org/10.1029/2023JB026908) so why did you select exactly one of them and how would the uncertainties of the GRACE product propagate into your BN_TWSA product? The characterization of uncertainties of your gridded and aggregated data sets would be important with respect to deriving any long term trends.
10) The structure of the published dataset does not follow any data standards. Further the naming of the downloadable zip file, Mississipi_Data.zip does not comply with the contained global grids. You should use descriptive filenames and use modern standard data formats, e.g., CDF conform (netCDF) self-describing data, geotiff, ... and a self describing tree structure. You can get inspiration for instance from other publications in ESSD
Minor things:L86/87: no commas in large numbers
Table 1: Provide not only publications for the data-sets but also the DOI references where they can be obtained; add the acronyms / abbreviations that you later use in the analysis and figures (e.g. Fig. 2, NTWSA, CTWSA)
L199: you name three types of ML algorithms but then are 4 listed and described
L274: That's the third different usage of P in the manuscript (Probability, Precipitation, and Prediction in the evaluation metrics)
L279: A grid is usually defined as a collection of adjacent pixels. You are using the term grid instead of pixel. I suggest to change it to either pixel or grid cell / cells.
Fig.2 Expand acronyms in the figure caption, make the caption more explanatory. From the colors it appears that several optimal predictors overlap for the same regions / pixels
Fig.3 Avoid red and green in the same figure (colorblind check, you can use https://www.color-blindness.com/coblis-color-blindness-simulator/ for checking)
L333: grid-based -> pixel based
Fig. 6: describe gray bars in figure caption (gaps in GRACE solutions)
Fig. 7: Change 1:1 line to non-dashed gray with thicker linewidth to make it distinguishable from the data. Use colors with better contrast for BNML and CTSWA
Abstract L18: remove "and updates will be published when needed"
Citation: https://doi.org/10.5194/essd-2024-109-RC1
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
293 | 50 | 14 | 357 | 12 | 11 |
- HTML: 293
- PDF: 50
- XML: 14
- Total: 357
- BibTeX: 12
- EndNote: 11
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1