the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
GSSM: A global seamless soil moisture dataset from 1981 to 2022 matching CCI to SMAP with a novel bias correction method
Abstract. Surface soil moisture is vital for Earth's environmental and energy cycles. However, it is still rare to have remote sensing soil moisture data with a long-term temporal extent, a global seamless spatial coverage, and a near-real-time update frequency. Here, we provided a global seamless soil moisture dataset from July 1981 to December 2022, matching CCI with SMAP through a novel soil moisture data bias correction method (fitting beta CDF matching, BCDF), and filling the gaps of corrected soil moisture through XGBoost Algorithms along with various soil moisture covariates. The new soil moisture dataset was abbreviated as GSSM and it has been validated with in situ observations, original CCI and SMAP data, and simulated gap areas. Results demonstrated that 1) the GSSM has similar accuracy with the SMAP and they are both more accurate than the original CCI data as compared with in situ observations at 399 global sites (averaged R=0.72, averaged ubRMSE<0.05); 2) the GSSM has the global spatial coverage, while filling the gaps of original CCI data through various soil moisture covariates (in artificial gaps verification, averaged R>0.86, averaged ubRMSE<0.04); 3) the GSSM has the same temporal variation characteristics with the original CCI dataset, while it can be combined with SMAP to obtain a long-term and near-real-time soil moisture dataset. Thus, GSSM provides long-term and seamless soil moisture data, paving the way for environmental disaster and water cycle process research.
- Preprint
(4291 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on essd-2024-200', Anonymous Referee #1, 15 Oct 2024
The study proposes a new dataset, called GSSM ("Gap-filled S, which first scales ESA CCI COMBINED v8.1 soil moisture against 9km SMAP observations and then fills gaps using an XGBoost machine learning approach with environmental drivers as input. Although gap filling remotely-sensed (soil moisture) data can be useful for several applications and developing enhanced gap-filling approaches is an important field of research, the proposed methodology and evaluation do not allow to draw any corroborated conclusions on the quality of the dataset.
First of all, the authors do not seem to be familiar with the data they are trying to gap-fill and hence the motivation of the study is unsound. As described in the product documentation, ESA CCI v8.1 already assimilates SMAP observations (e.g. https://catalogue.ceda.ac.uk/uuid/ff890589c21f4033803aa550f52c980c/). Instead, the authors write (line 80) that "... SMAP data have the potential to be integrated into existing long-term ESA CCI products to form a more reliable and useful product...".
Most of the paper (both the methodology, results and discussion sections) is about the bias correction between ESA CCI and SMAP. First of all, the justification of this step is unclear, is it because you want to provide ESA CCI in the climatology of the 9km NASA SMAP product? As mentioned before, SMAP is already used in ESA CCI. Second, the role of daily versus monthly data is unintelligible: is the bias correction based on monthly resampled data? If so, how was this resampling done? Or was the bias correction based on monthly data and the scaling functions then applied to daily data, as suggested in line 165? Note, that monthly data are much smoother and have less extreme values than daily data, so one cannot easily transfer bias correction functions from monthly to daily data. This is all very confusing.
The success of the bias correction is assessed by validating the native and the bias-corrected CCI data against SMAP data, which previously served as the scaling reference. No wonder that scores like the bias and the RMSE improve (e.g Fig.5). You basically compare SMAP with SMAP. The fact that the R and ubRMSE do not change with scaling tells you that the scaling itself does not significantly improve the dataset. As a consequence, the entire section can be considered redundant.Â
Why are some results only assessed for selected regions or some hard conclusions (e.g. on the performance of the various CDF-matching implementations) even based on only five points world-wide?
line 165: step 4 in the methodology mentions the application of a post-processing freeze/thaw masking to the gap-filled data. Why? One of the main reasons of the initial gaps in ESA CCI, is the flagging of spurious retrievals under frozen conditions. This is why the data are masked for such events, so gaps are there for a good reason. To me, reintroducing gaps after all the gap-filling effort, is questionable.
line 201-202: if there is only one observation pair between ESA CCI and SMAP you use a nearest neighbour for scaling. But if you have 2 or 3 or another limited number of observations, how reliable is your standard deviation in those cases?
The data sections are very short, hardly any dataset characteristics are given, e.g. what sensors were used in ESA CCI, what flags were applied, what sort of gaps are found and what are their causes? A more careful examination of the input dataset characteristics could have prevented many of the deficient analyses made and conclusions drawn in this paper.
A proper validation of the gap filling performance, which should be the core of the paper, is entirely lacking. In section 3.4 you perform a presumed validation of the pap-filling method, but this approach is incorrect: From the scaled and gap-filled data (you call them "Original values") you remove some regions and then use XGBoost again to predict these removed areas ("Predicted value"). Next, you compare the "Original values") with the "predicted values". Obviously, it's not surprising that these values correspond very well, as you basically assess how well XGBoost repredicts the predicted values. This is not a sound validation of the accuracy of the filled gaps and it's not surprising at all that scores are so high.
The authors spend many words on a very generic introduction, e.g. about microwave sensors that are available for the retrieval of soil moisture. Besides, the introduction contains several false claims, e.g. line 43-44: "there are three methods to obtain high-accuracy soil moisture data with global seamless spatial characteristics... Â "traditional ground-based measurements ..." -> In situ data are not seamless at all.; lline 58: it is claimed that remote sensing "...has become the most promising way to obtain data in long-term series, near-real-time, and high spatial coverage", yet most studies still show the superiority of reanalysis data over remote sensing soil moisture.
Line 73: CCI data are updated only once a year through ESA, that's correct. However, C3S is responsible for its operational production and regular update, every 10 days.Â
Individual networks of the ISMN shall be all properly acknowledged.
The discussion section mostly presents new results, not a discussion.
Citation: https://doi.org/10.5194/essd-2024-200-RC1 -
RC2: 'Comment on essd-2024-200', Anonymous Referee #2, 11 Nov 2024
This manuscript presents a dataset called 'GSSM: Global Seamless Soil Moisture'. It is derived by applying bias correction to the ESA CCI product using SMAP as a reference, followed by gap-filling with the XGBoost method which utilizes environmental variables as input features. The final product is at a 0.25° spatial and monthly temporal resolution.
While there are multiple global soil moisture fusion products, such as SMOPS, which provide high accuracy and daily resolution, the specific added value of this product remains unclear, particularly with a monthly temporal resolution.
The descriptions of the CCI and SMAP products should be expanded, especially given that the latest CCI version includes SMAP data. Additionally, since SMAP descending data typically shows higher accuracy than ascending, please provide justification for using ascending data. Note that the SPL3SMP_E product exclusively employs radiometer measurements.
Throughout the manuscript, daily and monthly GSSM data are utilized in the results and discussion sections. Section 2.2 and Figure 2 require a more detailed and justified explanation of the procedures involved. The validation of bias-corrected data against SMAP is redundant since SMAP serves as the reference; thus, it is expected that the corrected data will align more closely with it. It is essential to use an independent dataset, such as ISMN, for validation.
For the gap-filling approach, it would be valuable to compare XGBoost with simpler numerical methods to better demonstrate the advantages of XGBoost.
Further detailed comments are provided below.
Lines 43-45, ground-based measurements are not seamless and globally accessible.
Line 48, it should be ‘land surface model’
Line 53, you mean ‘climatology’ here?
Line 66, please describe the latency if you would like to emphasize ‘near-real-time’. Also, it does not make sense for a monthly product to provide near real time information.
Line 76, the missing data are mostly before 2002 and due to quality control.
Lines 141-142, it should be Table 2. What is ‘monthly fusion’? What methods are used for resampling?
Table 2, NDVI data link provided is not working.
Lines 149-151, this sentence is not clear; please rewrite it.
Line 153, ‘Only validation...’, this sentence is unclear.
Figure 1, showing networks is not preferred because many sites across different land conditions and regions are within one network. Please show sites instead. Also, the base map should be land cover or NDVI. DEM does not provide too much helpful information here.
Line 200, ’LR’ is not defined beforehand.
Lines 201-202, It is unclear how this is done properly. If there is only one value, which typically means there isn’t enough high-quality data here, then gap-filling can just lead to unreliable interpolated values.
Line 239, different terms have been used throughout the manuscript when referring to the generated product, it is unclear how these results are generated at different stages. Here, you are referring to it as ‘CCI/SMAP’.
Lines 258-259, how accurate are these filled values? And are they going to be masked out considering freezing conditions?
Lines 282-283, why are daily values used here again?
Figure 5 caption, do you mean ‘BCDF’?
Lines 296-297, it is unclear how this conclusion was derived based on the results.
Figure 6 caption, this is a scatterplot, not spatiotemporal analysis.
Line 318, it should be BCDF-corrected CCI. Again, another term is used here.
Citation: https://doi.org/10.5194/essd-2024-200-RC2
Status: closed
-
RC1: 'Comment on essd-2024-200', Anonymous Referee #1, 15 Oct 2024
The study proposes a new dataset, called GSSM ("Gap-filled S, which first scales ESA CCI COMBINED v8.1 soil moisture against 9km SMAP observations and then fills gaps using an XGBoost machine learning approach with environmental drivers as input. Although gap filling remotely-sensed (soil moisture) data can be useful for several applications and developing enhanced gap-filling approaches is an important field of research, the proposed methodology and evaluation do not allow to draw any corroborated conclusions on the quality of the dataset.
First of all, the authors do not seem to be familiar with the data they are trying to gap-fill and hence the motivation of the study is unsound. As described in the product documentation, ESA CCI v8.1 already assimilates SMAP observations (e.g. https://catalogue.ceda.ac.uk/uuid/ff890589c21f4033803aa550f52c980c/). Instead, the authors write (line 80) that "... SMAP data have the potential to be integrated into existing long-term ESA CCI products to form a more reliable and useful product...".
Most of the paper (both the methodology, results and discussion sections) is about the bias correction between ESA CCI and SMAP. First of all, the justification of this step is unclear, is it because you want to provide ESA CCI in the climatology of the 9km NASA SMAP product? As mentioned before, SMAP is already used in ESA CCI. Second, the role of daily versus monthly data is unintelligible: is the bias correction based on monthly resampled data? If so, how was this resampling done? Or was the bias correction based on monthly data and the scaling functions then applied to daily data, as suggested in line 165? Note, that monthly data are much smoother and have less extreme values than daily data, so one cannot easily transfer bias correction functions from monthly to daily data. This is all very confusing.
The success of the bias correction is assessed by validating the native and the bias-corrected CCI data against SMAP data, which previously served as the scaling reference. No wonder that scores like the bias and the RMSE improve (e.g Fig.5). You basically compare SMAP with SMAP. The fact that the R and ubRMSE do not change with scaling tells you that the scaling itself does not significantly improve the dataset. As a consequence, the entire section can be considered redundant.Â
Why are some results only assessed for selected regions or some hard conclusions (e.g. on the performance of the various CDF-matching implementations) even based on only five points world-wide?
line 165: step 4 in the methodology mentions the application of a post-processing freeze/thaw masking to the gap-filled data. Why? One of the main reasons of the initial gaps in ESA CCI, is the flagging of spurious retrievals under frozen conditions. This is why the data are masked for such events, so gaps are there for a good reason. To me, reintroducing gaps after all the gap-filling effort, is questionable.
line 201-202: if there is only one observation pair between ESA CCI and SMAP you use a nearest neighbour for scaling. But if you have 2 or 3 or another limited number of observations, how reliable is your standard deviation in those cases?
The data sections are very short, hardly any dataset characteristics are given, e.g. what sensors were used in ESA CCI, what flags were applied, what sort of gaps are found and what are their causes? A more careful examination of the input dataset characteristics could have prevented many of the deficient analyses made and conclusions drawn in this paper.
A proper validation of the gap filling performance, which should be the core of the paper, is entirely lacking. In section 3.4 you perform a presumed validation of the pap-filling method, but this approach is incorrect: From the scaled and gap-filled data (you call them "Original values") you remove some regions and then use XGBoost again to predict these removed areas ("Predicted value"). Next, you compare the "Original values") with the "predicted values". Obviously, it's not surprising that these values correspond very well, as you basically assess how well XGBoost repredicts the predicted values. This is not a sound validation of the accuracy of the filled gaps and it's not surprising at all that scores are so high.
The authors spend many words on a very generic introduction, e.g. about microwave sensors that are available for the retrieval of soil moisture. Besides, the introduction contains several false claims, e.g. line 43-44: "there are three methods to obtain high-accuracy soil moisture data with global seamless spatial characteristics... Â "traditional ground-based measurements ..." -> In situ data are not seamless at all.; lline 58: it is claimed that remote sensing "...has become the most promising way to obtain data in long-term series, near-real-time, and high spatial coverage", yet most studies still show the superiority of reanalysis data over remote sensing soil moisture.
Line 73: CCI data are updated only once a year through ESA, that's correct. However, C3S is responsible for its operational production and regular update, every 10 days.Â
Individual networks of the ISMN shall be all properly acknowledged.
The discussion section mostly presents new results, not a discussion.
Citation: https://doi.org/10.5194/essd-2024-200-RC1 -
RC2: 'Comment on essd-2024-200', Anonymous Referee #2, 11 Nov 2024
This manuscript presents a dataset called 'GSSM: Global Seamless Soil Moisture'. It is derived by applying bias correction to the ESA CCI product using SMAP as a reference, followed by gap-filling with the XGBoost method which utilizes environmental variables as input features. The final product is at a 0.25° spatial and monthly temporal resolution.
While there are multiple global soil moisture fusion products, such as SMOPS, which provide high accuracy and daily resolution, the specific added value of this product remains unclear, particularly with a monthly temporal resolution.
The descriptions of the CCI and SMAP products should be expanded, especially given that the latest CCI version includes SMAP data. Additionally, since SMAP descending data typically shows higher accuracy than ascending, please provide justification for using ascending data. Note that the SPL3SMP_E product exclusively employs radiometer measurements.
Throughout the manuscript, daily and monthly GSSM data are utilized in the results and discussion sections. Section 2.2 and Figure 2 require a more detailed and justified explanation of the procedures involved. The validation of bias-corrected data against SMAP is redundant since SMAP serves as the reference; thus, it is expected that the corrected data will align more closely with it. It is essential to use an independent dataset, such as ISMN, for validation.
For the gap-filling approach, it would be valuable to compare XGBoost with simpler numerical methods to better demonstrate the advantages of XGBoost.
Further detailed comments are provided below.
Lines 43-45, ground-based measurements are not seamless and globally accessible.
Line 48, it should be ‘land surface model’
Line 53, you mean ‘climatology’ here?
Line 66, please describe the latency if you would like to emphasize ‘near-real-time’. Also, it does not make sense for a monthly product to provide near real time information.
Line 76, the missing data are mostly before 2002 and due to quality control.
Lines 141-142, it should be Table 2. What is ‘monthly fusion’? What methods are used for resampling?
Table 2, NDVI data link provided is not working.
Lines 149-151, this sentence is not clear; please rewrite it.
Line 153, ‘Only validation...’, this sentence is unclear.
Figure 1, showing networks is not preferred because many sites across different land conditions and regions are within one network. Please show sites instead. Also, the base map should be land cover or NDVI. DEM does not provide too much helpful information here.
Line 200, ’LR’ is not defined beforehand.
Lines 201-202, It is unclear how this is done properly. If there is only one value, which typically means there isn’t enough high-quality data here, then gap-filling can just lead to unreliable interpolated values.
Line 239, different terms have been used throughout the manuscript when referring to the generated product, it is unclear how these results are generated at different stages. Here, you are referring to it as ‘CCI/SMAP’.
Lines 258-259, how accurate are these filled values? And are they going to be masked out considering freezing conditions?
Lines 282-283, why are daily values used here again?
Figure 5 caption, do you mean ‘BCDF’?
Lines 296-297, it is unclear how this conclusion was derived based on the results.
Figure 6 caption, this is a scatterplot, not spatiotemporal analysis.
Line 318, it should be BCDF-corrected CCI. Again, another term is used here.
Citation: https://doi.org/10.5194/essd-2024-200-RC2
Data sets
GSSM: A global long term seamless soil moisture dataset (1981-2022) Hao Sun and Yunjia Wang https://data.tpdc.ac.cn/en/disallow/0f28a9b5-92eb-470a-80fe-472aa50a136f
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
559 | 112 | 23 | 694 | 18 | 21 |
- HTML: 559
- PDF: 112
- XML: 23
- Total: 694
- BibTeX: 18
- EndNote: 21
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1