GSSM: A global seamless soil moisture dataset from 1981 to 2022 matching CCI to SMAP with a novel bias correction method

Wang, Yunjia; Sun, Hao; Xu, Zhenheng; Gao, Jinhua; Xu, Huanyu; Zhang, Tian; Wu, Dan

doi:https://doi.org/10.5194/essd-2024-200

Preprints

https://doi.org/10.5194/essd-2024-200

Preprints

15 Jul 2024

| 15 Jul 2024

Status: this discussion paper is a preprint. It has been under review for the journal Earth System Science Data (ESSD). The manuscript was not accepted for further review after discussion.

GSSM: A global seamless soil moisture dataset from 1981 to 2022 matching CCI to SMAP with a novel bias correction method

Yunjia Wang, Hao Sun, Zhenheng Xu, Jinhua Gao, Huanyu Xu, Tian Zhang, and Dan Wu

Abstract. Surface soil moisture is vital for Earth's environmental and energy cycles. However, it is still rare to have remote sensing soil moisture data with a long-term temporal extent, a global seamless spatial coverage, and a near-real-time update frequency. Here, we provided a global seamless soil moisture dataset from July 1981 to December 2022, matching CCI with SMAP through a novel soil moisture data bias correction method (fitting beta CDF matching, BCDF), and filling the gaps of corrected soil moisture through XGBoost Algorithms along with various soil moisture covariates. The new soil moisture dataset was abbreviated as GSSM and it has been validated with in situ observations, original CCI and SMAP data, and simulated gap areas. Results demonstrated that 1) the GSSM has similar accuracy with the SMAP and they are both more accurate than the original CCI data as compared with in situ observations at 399 global sites (averaged R=0.72, averaged ubRMSE<0.05); 2) the GSSM has the global spatial coverage, while filling the gaps of original CCI data through various soil moisture covariates (in artificial gaps verification, averaged R>0.86, averaged ubRMSE<0.04); 3) the GSSM has the same temporal variation characteristics with the original CCI dataset, while it can be combined with SMAP to obtain a long-term and near-real-time soil moisture dataset. Thus, GSSM provides long-term and seamless soil moisture data, paving the way for environmental disaster and water cycle process research.

Received: 24 May 2024 – Discussion started: 15 Jul 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Yunjia Wang, Hao Sun, Zhenheng Xu, Jinhua Gao, Huanyu Xu, Tian Zhang, and Dan Wu

Status: closed

RC1: 'Comment on essd-2024-200', Anonymous Referee #1, 15 Oct 2024

The study proposes a new dataset, called GSSM ("Gap-filled S, which first scales ESA CCI COMBINED v8.1 soil moisture against 9km SMAP observations and then fills gaps using an XGBoost machine learning approach with environmental drivers as input. Although gap filling remotely-sensed (soil moisture) data can be useful for several applications and developing enhanced gap-filling approaches is an important field of research, the proposed methodology and evaluation do not allow to draw any corroborated conclusions on the quality of the dataset.
First of all, the authors do not seem to be familiar with the data they are trying to gap-fill and hence the motivation of the study is unsound. As described in the product documentation, ESA CCI v8.1 already assimilates SMAP observations (e.g. https://catalogue.ceda.ac.uk/uuid/ff890589c21f4033803aa550f52c980c/). Instead, the authors write (line 80) that "... SMAP data have the potential to be integrated into existing long-term ESA CCI products to form a more reliable and useful product...".
Most of the paper (both the methodology, results and discussion sections) is about the bias correction between ESA CCI and SMAP. First of all, the justification of this step is unclear, is it because you want to provide ESA CCI in the climatology of the 9km NASA SMAP product? As mentioned before, SMAP is already used in ESA CCI. Second, the role of daily versus monthly data is unintelligible: is the bias correction based on monthly resampled data? If so, how was this resampling done? Or was the bias correction based on monthly data and the scaling functions then applied to daily data, as suggested in line 165? Note, that monthly data are much smoother and have less extreme values than daily data, so one cannot easily transfer bias correction functions from monthly to daily data. This is all very confusing.
The success of the bias correction is assessed by validating the native and the bias-corrected CCI data against SMAP data, which previously served as the scaling reference. No wonder that scores like the bias and the RMSE improve (e.g Fig.5). You basically compare SMAP with SMAP. The fact that the R and ubRMSE do not change with scaling tells you that the scaling itself does not significantly improve the dataset. As a consequence, the entire section can be considered redundant.
Why are some results only assessed for selected regions or some hard conclusions (e.g. on the performance of the various CDF-matching implementations) even based on only five points world-wide?
line 165: step 4 in the methodology mentions the application of a post-processing freeze/thaw masking to the gap-filled data. Why? One of the main reasons of the initial gaps in ESA CCI, is the flagging of spurious retrievals under frozen conditions. This is why the data are masked for such events, so gaps are there for a good reason. To me, reintroducing gaps after all the gap-filling effort, is questionable.
line 201-202: if there is only one observation pair between ESA CCI and SMAP you use a nearest neighbour for scaling. But if you have 2 or 3 or another limited number of observations, how reliable is your standard deviation in those cases?
The data sections are very short, hardly any dataset characteristics are given, e.g. what sensors were used in ESA CCI, what flags were applied, what sort of gaps are found and what are their causes? A more careful examination of the input dataset characteristics could have prevented many of the deficient analyses made and conclusions drawn in this paper.
A proper validation of the gap filling performance, which should be the core of the paper, is entirely lacking. In section 3.4 you perform a presumed validation of the pap-filling method, but this approach is incorrect: From the scaled and gap-filled data (you call them "Original values") you remove some regions and then use XGBoost again to predict these removed areas ("Predicted value"). Next, you compare the "Original values") with the "predicted values". Obviously, it's not surprising that these values correspond very well, as you basically assess how well XGBoost repredicts the predicted values. This is not a sound validation of the accuracy of the filled gaps and it's not surprising at all that scores are so high.
The authors spend many words on a very generic introduction, e.g. about microwave sensors that are available for the retrieval of soil moisture. Besides, the introduction contains several false claims, e.g. line 43-44: "there are three methods to obtain high-accuracy soil moisture data with global seamless spatial characteristics... "traditional ground-based measurements ..." -> In situ data are not seamless at all.; lline 58: it is claimed that remote sensing "...has become the most promising way to obtain data in long-term series, near-real-time, and high spatial coverage", yet most studies still show the superiority of reanalysis data over remote sensing soil moisture.
Line 73: CCI data are updated only once a year through ESA, that's correct. However, C3S is responsible for its operational production and regular update, every 10 days.
Individual networks of the ISMN shall be all properly acknowledged.
The discussion section mostly presents new results, not a discussion.

Citation: https://doi.org/10.5194/essd-2024-200-RC1
RC2: 'Comment on essd-2024-200', Anonymous Referee #2, 11 Nov 2024

This manuscript presents a dataset called 'GSSM: Global Seamless Soil Moisture'. It is derived by applying bias correction to the ESA CCI product using SMAP as a reference, followed by gap-filling with the XGBoost method which utilizes environmental variables as input features. The final product is at a 0.25° spatial and monthly temporal resolution.
While there are multiple global soil moisture fusion products, such as SMOPS, which provide high accuracy and daily resolution, the specific added value of this product remains unclear, particularly with a monthly temporal resolution.
The descriptions of the CCI and SMAP products should be expanded, especially given that the latest CCI version includes SMAP data. Additionally, since SMAP descending data typically shows higher accuracy than ascending, please provide justification for using ascending data. Note that the SPL3SMP_E product exclusively employs radiometer measurements.
Throughout the manuscript, daily and monthly GSSM data are utilized in the results and discussion sections. Section 2.2 and Figure 2 require a more detailed and justified explanation of the procedures involved. The validation of bias-corrected data against SMAP is redundant since SMAP serves as the reference; thus, it is expected that the corrected data will align more closely with it. It is essential to use an independent dataset, such as ISMN, for validation.
For the gap-filling approach, it would be valuable to compare XGBoost with simpler numerical methods to better demonstrate the advantages of XGBoost.
Further detailed comments are provided below.
Lines 43-45, ground-based measurements are not seamless and globally accessible.
Line 48, it should be ‘land surface model’
Line 53, you mean ‘climatology’ here?
Line 66, please describe the latency if you would like to emphasize ‘near-real-time’. Also, it does not make sense for a monthly product to provide near real time information.
Line 76, the missing data are mostly before 2002 and due to quality control.
Lines 141-142, it should be Table 2. What is ‘monthly fusion’? What methods are used for resampling?
Table 2, NDVI data link provided is not working.
Lines 149-151, this sentence is not clear; please rewrite it.
Line 153, ‘Only validation...’, this sentence is unclear.
Figure 1, showing networks is not preferred because many sites across different land conditions and regions are within one network. Please show sites instead. Also, the base map should be land cover or NDVI. DEM does not provide too much helpful information here.
Line 200, ’LR’ is not defined beforehand.
Lines 201-202, It is unclear how this is done properly. If there is only one value, which typically means there isn’t enough high-quality data here, then gap-filling can just lead to unreliable interpolated values.
Line 239, different terms have been used throughout the manuscript when referring to the generated product, it is unclear how these results are generated at different stages. Here, you are referring to it as ‘CCI/SMAP’.
Lines 258-259, how accurate are these filled values? And are they going to be masked out considering freezing conditions?
Lines 282-283, why are daily values used here again?
Figure 5 caption, do you mean ‘BCDF’?
Lines 296-297, it is unclear how this conclusion was derived based on the results.
Figure 6 caption, this is a scatterplot, not spatiotemporal analysis.
Line 318, it should be BCDF-corrected CCI. Again, another term is used here.

Citation: https://doi.org/10.5194/essd-2024-200-RC2

Status: closed

RC1: 'Comment on essd-2024-200', Anonymous Referee #1, 15 Oct 2024

The study proposes a new dataset, called GSSM ("Gap-filled S, which first scales ESA CCI COMBINED v8.1 soil moisture against 9km SMAP observations and then fills gaps using an XGBoost machine learning approach with environmental drivers as input. Although gap filling remotely-sensed (soil moisture) data can be useful for several applications and developing enhanced gap-filling approaches is an important field of research, the proposed methodology and evaluation do not allow to draw any corroborated conclusions on the quality of the dataset.
First of all, the authors do not seem to be familiar with the data they are trying to gap-fill and hence the motivation of the study is unsound. As described in the product documentation, ESA CCI v8.1 already assimilates SMAP observations (e.g. https://catalogue.ceda.ac.uk/uuid/ff890589c21f4033803aa550f52c980c/). Instead, the authors write (line 80) that "... SMAP data have the potential to be integrated into existing long-term ESA CCI products to form a more reliable and useful product...".
Most of the paper (both the methodology, results and discussion sections) is about the bias correction between ESA CCI and SMAP. First of all, the justification of this step is unclear, is it because you want to provide ESA CCI in the climatology of the 9km NASA SMAP product? As mentioned before, SMAP is already used in ESA CCI. Second, the role of daily versus monthly data is unintelligible: is the bias correction based on monthly resampled data? If so, how was this resampling done? Or was the bias correction based on monthly data and the scaling functions then applied to daily data, as suggested in line 165? Note, that monthly data are much smoother and have less extreme values than daily data, so one cannot easily transfer bias correction functions from monthly to daily data. This is all very confusing.
The success of the bias correction is assessed by validating the native and the bias-corrected CCI data against SMAP data, which previously served as the scaling reference. No wonder that scores like the bias and the RMSE improve (e.g Fig.5). You basically compare SMAP with SMAP. The fact that the R and ubRMSE do not change with scaling tells you that the scaling itself does not significantly improve the dataset. As a consequence, the entire section can be considered redundant.
Why are some results only assessed for selected regions or some hard conclusions (e.g. on the performance of the various CDF-matching implementations) even based on only five points world-wide?
line 165: step 4 in the methodology mentions the application of a post-processing freeze/thaw masking to the gap-filled data. Why? One of the main reasons of the initial gaps in ESA CCI, is the flagging of spurious retrievals under frozen conditions. This is why the data are masked for such events, so gaps are there for a good reason. To me, reintroducing gaps after all the gap-filling effort, is questionable.
line 201-202: if there is only one observation pair between ESA CCI and SMAP you use a nearest neighbour for scaling. But if you have 2 or 3 or another limited number of observations, how reliable is your standard deviation in those cases?
The data sections are very short, hardly any dataset characteristics are given, e.g. what sensors were used in ESA CCI, what flags were applied, what sort of gaps are found and what are their causes? A more careful examination of the input dataset characteristics could have prevented many of the deficient analyses made and conclusions drawn in this paper.
A proper validation of the gap filling performance, which should be the core of the paper, is entirely lacking. In section 3.4 you perform a presumed validation of the pap-filling method, but this approach is incorrect: From the scaled and gap-filled data (you call them "Original values") you remove some regions and then use XGBoost again to predict these removed areas ("Predicted value"). Next, you compare the "Original values") with the "predicted values". Obviously, it's not surprising that these values correspond very well, as you basically assess how well XGBoost repredicts the predicted values. This is not a sound validation of the accuracy of the filled gaps and it's not surprising at all that scores are so high.
The authors spend many words on a very generic introduction, e.g. about microwave sensors that are available for the retrieval of soil moisture. Besides, the introduction contains several false claims, e.g. line 43-44: "there are three methods to obtain high-accuracy soil moisture data with global seamless spatial characteristics... "traditional ground-based measurements ..." -> In situ data are not seamless at all.; lline 58: it is claimed that remote sensing "...has become the most promising way to obtain data in long-term series, near-real-time, and high spatial coverage", yet most studies still show the superiority of reanalysis data over remote sensing soil moisture.
Line 73: CCI data are updated only once a year through ESA, that's correct. However, C3S is responsible for its operational production and regular update, every 10 days.
Individual networks of the ISMN shall be all properly acknowledged.
The discussion section mostly presents new results, not a discussion.

Citation: https://doi.org/10.5194/essd-2024-200-RC1
RC2: 'Comment on essd-2024-200', Anonymous Referee #2, 11 Nov 2024

This manuscript presents a dataset called 'GSSM: Global Seamless Soil Moisture'. It is derived by applying bias correction to the ESA CCI product using SMAP as a reference, followed by gap-filling with the XGBoost method which utilizes environmental variables as input features. The final product is at a 0.25° spatial and monthly temporal resolution.
While there are multiple global soil moisture fusion products, such as SMOPS, which provide high accuracy and daily resolution, the specific added value of this product remains unclear, particularly with a monthly temporal resolution.
The descriptions of the CCI and SMAP products should be expanded, especially given that the latest CCI version includes SMAP data. Additionally, since SMAP descending data typically shows higher accuracy than ascending, please provide justification for using ascending data. Note that the SPL3SMP_E product exclusively employs radiometer measurements.
Throughout the manuscript, daily and monthly GSSM data are utilized in the results and discussion sections. Section 2.2 and Figure 2 require a more detailed and justified explanation of the procedures involved. The validation of bias-corrected data against SMAP is redundant since SMAP serves as the reference; thus, it is expected that the corrected data will align more closely with it. It is essential to use an independent dataset, such as ISMN, for validation.
For the gap-filling approach, it would be valuable to compare XGBoost with simpler numerical methods to better demonstrate the advantages of XGBoost.
Further detailed comments are provided below.
Lines 43-45, ground-based measurements are not seamless and globally accessible.
Line 48, it should be ‘land surface model’
Line 53, you mean ‘climatology’ here?
Line 66, please describe the latency if you would like to emphasize ‘near-real-time’. Also, it does not make sense for a monthly product to provide near real time information.
Line 76, the missing data are mostly before 2002 and due to quality control.
Lines 141-142, it should be Table 2. What is ‘monthly fusion’? What methods are used for resampling?
Table 2, NDVI data link provided is not working.
Lines 149-151, this sentence is not clear; please rewrite it.
Line 153, ‘Only validation...’, this sentence is unclear.
Figure 1, showing networks is not preferred because many sites across different land conditions and regions are within one network. Please show sites instead. Also, the base map should be land cover or NDVI. DEM does not provide too much helpful information here.
Line 200, ’LR’ is not defined beforehand.
Lines 201-202, It is unclear how this is done properly. If there is only one value, which typically means there isn’t enough high-quality data here, then gap-filling can just lead to unreliable interpolated values.
Line 239, different terms have been used throughout the manuscript when referring to the generated product, it is unclear how these results are generated at different stages. Here, you are referring to it as ‘CCI/SMAP’.
Lines 258-259, how accurate are these filled values? And are they going to be masked out considering freezing conditions?
Lines 282-283, why are daily values used here again?
Figure 5 caption, do you mean ‘BCDF’?
Lines 296-297, it is unclear how this conclusion was derived based on the results.
Figure 6 caption, this is a scatterplot, not spatiotemporal analysis.
Line 318, it should be BCDF-corrected CCI. Again, another term is used here.

Citation: https://doi.org/10.5194/essd-2024-200-RC2

Yunjia Wang, Hao Sun, Zhenheng Xu, Jinhua Gao, Huanyu Xu, Tian Zhang, and Dan Wu

Data sets

GSSM: A global long term seamless soil moisture dataset (1981-2022) Hao Sun and Yunjia Wang https://data.tpdc.ac.cn/en/disallow/0f28a9b5-92eb-470a-80fe-472aa50a136f

Yunjia Wang, Hao Sun, Zhenheng Xu, Jinhua Gao, Huanyu Xu, Tian Zhang, and Dan Wu

Viewed

Total article views: 1,600 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,271	248	81	1,600	53	73

HTML: 1,271
PDF: 248
XML: 81
Total: 1,600
BibTeX: 53
EndNote: 73

Views and downloads (calculated since 15 Jul 2024)

Month	HTML	PDF	XML	Total
Jul 2024	167	40	10	217
Aug 2024	120	22	7	149
Sep 2024	74	11	2	87
Oct 2024	85	22	3	110
Nov 2024	79	10	1	90
Dec 2024	73	14	0	87
Jan 2025	47	10	42	99
Feb 2025	36	12	6	54
Mar 2025	29	15	1	45
Apr 2025	46	19	0	65
May 2025	33	8	3	44
Jun 2025	39	12	1	52
Jul 2025	43	11	4	58
Aug 2025	69	13	0	82
Sep 2025	301	15	1	317
Oct 2025	30	14	0	44

Cumulative views and downloads (calculated since 15 Jul 2024)

Month	HTML	PDF	XML	Total
Jul 2024	167	40	10	217
Aug 2024	120	22	7	149
Sep 2024	74	11	2	87
Oct 2024	85	22	3	110
Nov 2024	79	10	1	90
Dec 2024	73	14	0	87
Jan 2025	47	10	42	99
Feb 2025	36	12	6	54
Mar 2025	29	15	1	45
Apr 2025	46	19	0	65
May 2025	33	8	3	44
Jun 2025	39	12	1	52
Jul 2025	43	11	4	58
Aug 2025	69	13	0	82
Sep 2025	301	15	1	317
Oct 2025	30	14	0	44

Viewed (geographical distribution)

Total article views: 1,567 (including HTML, PDF, and XML) Thereof 1,567 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 31 Oct 2025

Short summary

We propose a novel matching method that can ensure the characteristics of the soil moisture time series and use a machine learning model to fill in the corrected data to solve the problem of low spatial coverage of soil moisture products. Finally, the dataset was obtained, namely long-term seamless CCI/SMAP monthly soil moisture products (GSSM). By obtaining this dataset, researchers can take into account the advantages of long time range, and high spatial coverage soil moisture products.


Total:	0
HTML:	0
PDF:	0
XML:	0

GSSM: A global seamless soil moisture dataset from 1981 to 2022 matching CCI to SMAP with a novel bias correction method

Data sets

Viewed

Viewed (geographical distribution)

Cited

1 citations as recorded by crossref.