An 8-day composited 36 km SMAP soil moisture dataset from 1979 to 2015 produced using a random forest and historical CCI data

Yang, Haoxuan; Wang, Qunming; Zhao, Wei; Atkinson, Peter

doi:10.5194/essd-2022-426

Preprints

https://doi.org/10.5194/essd-2022-426

Preprints

14 Feb 2023

| 14 Feb 2023

Status: this preprint has been withdrawn by the authors.

An 8-day composited 36 km SMAP soil moisture dataset from 1979 to 2015 produced using a random forest and historical CCI data

Haoxuan Yang, Qunming Wang, Wei Zhao, and Peter Atkinson

Abstract. Soil moisture (SM) plays a significant role in many natural and anthropogenic systems which are essential to supporting life on Earth. Thus, accurate measurement and assessment of changes in soil moisture globally is of great value, including long-term historical assessment. Since the on-board cycle and detailed parameters of disparate sensors are different, the European Space Agency established the Climate Change Initiative (CCI) program to harmonize the available multisource SM data, producing long time-series surface SM datasets starting from 1978 to the present. However, the Soil Moisture Active Passive (SMAP) mission, launched in 2015, has shown more satisfactory performance in both spatial accuracy and in capturing pattern of temporal changes. In this paper, a random forest (RF) model was proposed to extend the superior SMAP dataset historically (named RF_SMAP), using the corresponding CCI data time-series. We assumed that the temporal changes in the SMAP dataset are similar generally to those in the available CCI dataset. Accordingly, the RF model was constructed using the temporal characteristics extracted from the CCI SM v05.2 data (coupled with three terrain characteristics and two location characteristics), which was migrated to the prediction of the RF_SMAP dataset. The available in-situ SM data and the real SMAP data from April 2015 to April 2016 were used as references to validate the predicted RF_SMAP data. It was shown that compared with the CCI dataset, the predicted RF_SMAP dataset is closer to the in-situ SM data and the real SMAP data. Moreover, the historical RF_SMAP dataset is more accurate than the widely used Global Land Evaporation Amsterdam Model (GLEAM) dataset in terms of average root mean square error (RMSE), bias (Bias), and Kling-Gutpa efficiency (KGE). Thus, the RF_SMAP dataset was shown to be a reliable substitute for the historical CCI dataset, with an unbiased root mean square error (ubRMSE) of 0.035. The new long time-series RF_SMAP dataset, which will be available to download, will be of great value for a range of research in applications such as climate assessment, agricultural planning, food insecurity monitoring and drought assessment and monitoring.

This preprint has been withdrawn.

Received: 07 Dec 2022 – Discussion started: 14 Feb 2023

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2452 KB)

Withdrawal notice
This preprint has been withdrawn.
Preprint (2452 KB)

Download & links

This preprint has been withdrawn.

Haoxuan Yang, Qunming Wang, Wei Zhao, and Peter Atkinson

Interactive discussion

Status: closed

RC1:
'Comment on essd-2022-426', Anonymous Referee #1, 15 Mar 2023

The comment was uploaded in the form of a supplement: https://essd.copernicus.org/preprints/essd-2022-426/essd-2022-426-RC1-supplement.pdf

Citation: https://doi.org/10.5194/essd-2022-426-RC1
- CC2: 'Reply on RC1', Haoxuan Yang, 22 Mar 2023
  
  This study develops a new global long-term soil moisture dataset by extending the SMAP data back into the ESA CCI era. To do this, the authors train a random forest with historical CCI data and apply the trained model to estimate SMAP soil moisture under the assumption that CCI and SMAP soil moisture have similar temporal variability. Such a dataset is valuable and relevant for a variety of climatological and hydrological applications.
  Reply:
  Dear Referee
  Thank you for your comments. The comments are undoubtedly helpful to improve the quality of the paper. Accordingly, we have analyzed the comments carefully and provided the response below. The complete comments and figures can be seen in the supplement.
  
  However, the main assumption made in this study requires more careful investigation. The authors present the temporal variability of CCI and SMAP over an overlapping period of 4-5 years, but most of the selected sites appear to be in arid regions (Fig. 2); comparison needs to be made in a more comprehensive way, including humid and/or high latitude regions with relatively high variability of soil moisture. Moreover, how can we ensure that this similarity is preserved also during the previous ~30 years of the CCI era?
  Reply:
  In fact, these five pixels in Figure2 were randomly selected. We consider that this issue can be solved by providing a more complete description in Section 2.2.3. Specifically, based on different humid regions and latitudes, we are going to supply more pixels to exhibit the change pattern of SM in Figure 2.
  As for the preservation of similarity during the previous ~30 years, we can explain this point from three aspects. First, the similarity of the CCI and SMAP data from 2015 to 2019 has been exhibited in Figure 2, which shows great similarity already. Second, we have found that both the CCI and predicted SMAP data can preserve consistent similarity to the in-situ data from a large period of about 10 years (i.e., from 1996 to 2015, as the earliest in-situ data began in 1996). Third, for the period before 1996, although there were no in-situ and SMAP data available for comparison with CCI data, the experimental results indicated that the general temporal profiles of CCI and predicted RF_SMAP are similar. Thus, we believe that the similarity can also be preserved over the 30 years.
  To support the main assumption (similarity of the CCI and SMAP datasets), Figure 2 (see supplement)was modified and provided here in advance. Based on the original 6 pixels, we supplied additional 12 pixels (16 pixels in total). The pixels have random distribution, which include arid regions (e.g., Pixel 1 and 9), high latitudes (e.g., Pixel 10), and high altitudes (e.g., Pixel 12).
  
  Second, the method described in Sect. 2.2.2 and 2.2.3 needs more explanation/clarification here and there. For instance, what is the purpose of having two separate experiments? If Experiment 1 was to evaluate the performance of RF_SMAP during the period of SMAP, the SMAP should be included for the model evaluation, e.g. in Fig. 6, Fig. 7, Table 5.
  Reply:
  As for this issue, we are going to supply more detailed description in Section 2.2.3. Meanwhile, the main purpose of the two experiments will also be further clarified in Section 2.2.3. In fact, Experiment 1 aimed to demonstrate the predicted method based on the in-situ as well as the real SMAP data. Specifically, by using the CCI and SMAP data from 2016 to 2019 (2016105 to 2019361), the predicted data (i.e., RF_SMAP) in 2015 (2015105 to 2016097) were generated. In Experiment 1, both the real SMAP and in-situ data in 2015 (2015105 to 2016097) were available for validation. Accordingly, Figure 6, Figure 7 and Table 5 were the model evaluation results by referring to the in-situ data. Figure 5 and Table 4 are the evaluation results that used the SMAP data as reference. The reason of designing the experiment using the real SMAP as reference is that the reference in this case is known perfectly, avoiding the uncertainty introduced by other factors (e.g., the uncertainty in spatial support and geographical location in in-situ data).
  As for Experiment 2, it aimed to validated the RF_SMAP dataset from 1979 to 2015 based on the in-situ data, as in this period only the real SMAP data are not available.
  
  Why is original soil moisture data (I guess you mean actual soil moisture time-series) unsuitable for model training? The authors stated that this is because 1) SM data has spatial gaps and 2) abundant precipitation can lead to abnormal change in SM (Lines 147-149). However, RF can only be trained with grid pixels where SM data is available, and the diverse relationship between precipitation-soil moisture should be included in the training data.
  Reply:
  It should be illustrated that the original SMAP time-series were used for model training in the first version (ESSD_2022_137). However, in the process of revision, we found the shortcomings of this training method. Specifically, in high latitude regions, the original SMAP time-series data contain unavoidable gaps (i.e., the missing data) in a year because of the snow cover and other factors. Theoretically, these spatially missing data cannot be involved in the training process, as you mentioned exactly. If we want to directly use the SMAP time-series data for training, we need to mask the regions with gaps. However, the usage of the mask can significantly harm the reliability of the RF_SMAP dataset in terms of spatial coverage. Also, the number of training data in the RF model can be reduced greatly. Hence, the hctsa characteristics-based training method was adopted in the manuscript. Since the hctsa-extracted temporal characteristics are spatial seamless, the interference of missing data in the SMAP time-series on model training can be eliminated.
  
  To predict RF_SMAP, the trained RF model uses characteristics extracted from SMAP as input (Lines 185-194) at each grid pixel, is this correct? Then, how did you generate RF_SMAP for pixels and periods that do not have SMAP data (and thus unable to extract characteristics from SMAP)?
  Reply:
  We need to clarify that the trained RF model did not use the characteristics extracted from SMAP as input.
  In fact, the construction of model is based on the core assumption that the CCI and SMAP datasets have similar pattern of temporal changes. Specifically, the model at a time t was trained by the label (CCI_t) and the characteristics (extracted from the CCI time series by the hctsa method, coupled with the DEM and location data). In the prediction process, the characteristics (extracted from the SMAP time series by the hctsa method, coupled with the DEM and location data) were imported into the trained model, and the SMAP_tdata at time t was predicted. With the continuous change of CCI_t data from 1979001 to 2015097 (i.e., t, t+1, t+2, t+3, …), different RF models were continuously trained and corresponding RF_SMAP_t data were predicted in turn. We are going to rewrite this point and add key information in the new version of Figure 3.
  To clearly illustrate the prediction process, Figure 3 is modified in advance (see supplement).
  
  Lastly, the validation of RF_SMAP over the CCI era is highly limited due to the lack of ISMN before 2000.The validation of RF_SMAP over diverse climate regimes also seems limited, as most ISMN data are obtained from the US. I also wonder if there are any systematic biases between the RF_SMAP (historical SMAP before 2015) and the actual SMAP data from 2015.
  Reply:
  As you mentioned exactly, due to the uneven distribution and lack of in-situ stations, it is difficult validate the dataset based on diverse climate regions. However, the comparison based on different periods (e.g., from 2000 to 2005, and 2010 to 2015) is possible, we are going to analyze the systematic biases between the RF_SMAP (historical SMAP before 2015) and the actual SMAP data from 2015. We agree that this point is valuable, and we will revise.
  
  It is not clear why e.g. Fig. 6 shows only one time series per network and Fig. 7 shows a very small number of samples (dots) given that each ISMN network has >400 stations according to Table 2. Moreover, the comparison between the gridded datasets could be done from more diverse perspectives, e.g. comparison by season, during extreme (drought) conditions; SoMo is global, long-term data, but the comparison is done only for 4 years at three locations (Sect. 4.4)
  Reply:
  First of all, we need to clarify that Figure 6 aimed to show the change pattern of 11 networks (11 sub-figures) at 46 prediction times (i.e., from 2015105 to 2016097). Figure 7 provided the scatter plot of the corresponding 11 networks at 46 prediction times, based on the results in Figure 6. We need to clarify that the validation was at network level, that is, all stations in a network were averaged. In fact, the number of samples in Figure 7 is 46 (i.e., the prediction times) rather than the number of stations. The authors are going to revise the corresponding description to clarify this confusion.
  Additionally, the comparison of datasets in terms of different seasons is interesting. We will provide the results accordingly in the new version.
  As for the comparison with the SoMo.ml dataset in Section 4.4, we need to clarify the purpose of this section first. That is, Section 4.4 aimed to exhibit the differences between the SoMo.ml and RF_SMAP dataset and provide a potential way to improve the RF_SMAP dataset in future. Specifically, the production of the SoMo.ml dataset used the in-situ data as model inputs to improve the accuracy. However, the in-situ data are always used as the reference for validation, which is undoubtedly beneficial for accuracy evaluation of the SoMo.ml dataset. In Section 4.4, we admitted the difference in accuracy between the SoMo.ml and RF_SMAP dataset, and proposed to use in-situ data to further enhance the predicted RF_SMAP dataset in future research. Thus, we considered that using longer time series of the SoMo.ml data and more in-situ data will not add anything to the current points in Section 4.4 (i.e., the conclusions will also be the same as the current version).
  
  Citation: https://doi.org/10.5194/essd-2022-426-CC2
- AC1: 'Reply on RC1', Q. Wang, 24 Mar 2023
  
  Please see the uploaded file.
  
  Citation: https://doi.org/10.5194/essd-2022-426-AC1
CC1:
'Comment on essd-2022-426', yuyang Ma, 19 Mar 2023

This study provides an accurate and reliable, global 36 km, 8-day synthetic SMAP SM products from 1979 to 2015. This is a valuable dataset for the evaluation of historical events. I have some questions about the dataset you achieved in your article:
1.How to achieve data synthesis toward those different data sources?
2.It seems that the analysis of data in this volume is time-consuming. Could you please provide more details about the data analysis platform that you used here? Such as software or any other online platform.

In addition, there are some minor issues with the manuscript details:
1.In Figure 1, the site location is not clear.
2.Is the reconstruction of SM data before 2015? Why do you use sites from 2015-2016 to validate pre-2015 data in the abstract?
3.I think the flowchart is kind of too simple to express the details of dataset production.

Citation: https://doi.org/10.5194/essd-2022-426-CC1
- CC3: 'Reply on CC1', Haoxuan Yang, 22 Mar 2023
  
  This study provides an accurate and reliable, global 36 km, 8-day synthetic SMAP SM products from 1979 to 2015. This is a valuable dataset for the evaluation of historical events. I have some questions about the dataset you achieved in your article
  Reply:
  Thank you for the comments. We are going to reply item-by-item to each comment. The comments and figures can be seen in supplyment.
  
  1.How to achieve data synthesis toward those different data sources?
  Reply:
  Considering the spatial gaps in the daily SMAP data, we adopted 8-day composited method to acquire a more complete spatial coverage by averaging the valid SM data. Thank you for the comment.
  
  2.It seems that the analysis of data in this volume is time-consuming. Could you please provide more details about the data analysis platform that you used here? Such as software or any other online platform.
  Reply:
  Indeed, the time cost of this work is related high. We used Matlab to generate the RF_SMAP dataset. For the simulation of one scene data, it takes about 300 seconds. The processed operator is Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz.
  
  In addition, there are some minor issues with the manuscript details:
  1.In Figure 1, the site location is not clear.
  Reply:
  Thank you for the comment, we have revised the Figure 1 and enlarged the size of the sample points to clear illustrate the site location.
  
  2.Is the reconstruction of SM data before 2015? Why do you use sites from 2015-2016 to validate pre-2015 data in the abstract?
  Reply:
  Thank you for the suggestion. For the unclear description in the abstract, we are going to revise revised. In fact, the validation of reconstruction before 2015 was adopted the in-situ data before 2015 as the reference (Experiment 2). The in-situ data from 2015-2016 were used in Experiment 1. Experiment 1 aimed to demonstrate the predicted method (evaluate the performance of RF_SMAP during the period of real SMAP).
  
  3.I think the flowchart is kind of too simple to express the details of dataset production.
  Reply:
  Thank you for the suggestion. We have revised this point in advance and increased the readability of Figure 3 (see supplyment).
  
  Citation: https://doi.org/10.5194/essd-2022-426-CC3
- AC2: 'Reply on CC1', Q. Wang, 24 Mar 2023
  
  Please see the uploaded file.
  
  Citation: https://doi.org/10.5194/essd-2022-426-AC2
RC2:
'Comment on essd-2022-426', Anonymous Referee #2, 28 Mar 2023

The manuscript presents a composited, SMAP-like soil moisture dataset derived with a random forest approach from historical CCI data. While a global soil moisture product for such a long period (1979 to 2015) has its merits, the dataset stands and falls with the successful evaluation of the derived product. Its spatial resolution (36 km) is interesting for large scale analyses.
It should be noted that the approach a) assumes stationarity of the CCI and SMAP data and b) generalises its global applicability of the model. Both are rather strong assumptions for a “simple” random forest approach. In addition, the evaluation refers to averaged network-level data, which introduces further uncertainty of the scaling. The authors present some limited evaluation, which does not really exemplify the validity of the derived model data.

There are some similar studies and datasets, the authors did not refer to: https://www.nature.com/articles/s41597-023-02053-x , https://www.nature.com/articles/s41597-021-00925-8 In which way does their dataset (and approach) advance these?
Moreover, there is a recent paper in HESS, which uses Sentinel data to estimate soil moisture: https://hess.copernicus.org/articles/27/1221/2023/
Obviously, the temporal extent of this approach has very little overlap with the data presented in this manuscript (since the sentinel satellites have been operational just from 2015 onwards). However, the authors might find inspiration for further evaluation in this?

After all, it is very difficult to evaluate if the dataset and its presentation justify publication in ESSD. If the authors could really corroborate the validity of their data product, this would be a clear yes. However given the open questions and despite the meticulous effort which went into the compilation of this data, it remains too unclear, how the dataset from a rather simple approach can advance already existing SMAP-like soil moisture products.

Citation: https://doi.org/10.5194/essd-2022-426-RC2
- AC3: 'Reply on RC2', Q. Wang, 14 Apr 2023
  
  Please see the uploaded file.
  
  Citation: https://doi.org/10.5194/essd-2022-426-AC3

Interactive discussion

Status: closed

RC1:
'Comment on essd-2022-426', Anonymous Referee #1, 15 Mar 2023

The comment was uploaded in the form of a supplement: https://essd.copernicus.org/preprints/essd-2022-426/essd-2022-426-RC1-supplement.pdf

Citation: https://doi.org/10.5194/essd-2022-426-RC1
- CC2: 'Reply on RC1', Haoxuan Yang, 22 Mar 2023
  
  This study develops a new global long-term soil moisture dataset by extending the SMAP data back into the ESA CCI era. To do this, the authors train a random forest with historical CCI data and apply the trained model to estimate SMAP soil moisture under the assumption that CCI and SMAP soil moisture have similar temporal variability. Such a dataset is valuable and relevant for a variety of climatological and hydrological applications.
  Reply:
  Dear Referee
  Thank you for your comments. The comments are undoubtedly helpful to improve the quality of the paper. Accordingly, we have analyzed the comments carefully and provided the response below. The complete comments and figures can be seen in the supplement.
  
  However, the main assumption made in this study requires more careful investigation. The authors present the temporal variability of CCI and SMAP over an overlapping period of 4-5 years, but most of the selected sites appear to be in arid regions (Fig. 2); comparison needs to be made in a more comprehensive way, including humid and/or high latitude regions with relatively high variability of soil moisture. Moreover, how can we ensure that this similarity is preserved also during the previous ~30 years of the CCI era?
  Reply:
  In fact, these five pixels in Figure2 were randomly selected. We consider that this issue can be solved by providing a more complete description in Section 2.2.3. Specifically, based on different humid regions and latitudes, we are going to supply more pixels to exhibit the change pattern of SM in Figure 2.
  As for the preservation of similarity during the previous ~30 years, we can explain this point from three aspects. First, the similarity of the CCI and SMAP data from 2015 to 2019 has been exhibited in Figure 2, which shows great similarity already. Second, we have found that both the CCI and predicted SMAP data can preserve consistent similarity to the in-situ data from a large period of about 10 years (i.e., from 1996 to 2015, as the earliest in-situ data began in 1996). Third, for the period before 1996, although there were no in-situ and SMAP data available for comparison with CCI data, the experimental results indicated that the general temporal profiles of CCI and predicted RF_SMAP are similar. Thus, we believe that the similarity can also be preserved over the 30 years.
  To support the main assumption (similarity of the CCI and SMAP datasets), Figure 2 (see supplement)was modified and provided here in advance. Based on the original 6 pixels, we supplied additional 12 pixels (16 pixels in total). The pixels have random distribution, which include arid regions (e.g., Pixel 1 and 9), high latitudes (e.g., Pixel 10), and high altitudes (e.g., Pixel 12).
  
  Second, the method described in Sect. 2.2.2 and 2.2.3 needs more explanation/clarification here and there. For instance, what is the purpose of having two separate experiments? If Experiment 1 was to evaluate the performance of RF_SMAP during the period of SMAP, the SMAP should be included for the model evaluation, e.g. in Fig. 6, Fig. 7, Table 5.
  Reply:
  As for this issue, we are going to supply more detailed description in Section 2.2.3. Meanwhile, the main purpose of the two experiments will also be further clarified in Section 2.2.3. In fact, Experiment 1 aimed to demonstrate the predicted method based on the in-situ as well as the real SMAP data. Specifically, by using the CCI and SMAP data from 2016 to 2019 (2016105 to 2019361), the predicted data (i.e., RF_SMAP) in 2015 (2015105 to 2016097) were generated. In Experiment 1, both the real SMAP and in-situ data in 2015 (2015105 to 2016097) were available for validation. Accordingly, Figure 6, Figure 7 and Table 5 were the model evaluation results by referring to the in-situ data. Figure 5 and Table 4 are the evaluation results that used the SMAP data as reference. The reason of designing the experiment using the real SMAP as reference is that the reference in this case is known perfectly, avoiding the uncertainty introduced by other factors (e.g., the uncertainty in spatial support and geographical location in in-situ data).
  As for Experiment 2, it aimed to validated the RF_SMAP dataset from 1979 to 2015 based on the in-situ data, as in this period only the real SMAP data are not available.
  
  Why is original soil moisture data (I guess you mean actual soil moisture time-series) unsuitable for model training? The authors stated that this is because 1) SM data has spatial gaps and 2) abundant precipitation can lead to abnormal change in SM (Lines 147-149). However, RF can only be trained with grid pixels where SM data is available, and the diverse relationship between precipitation-soil moisture should be included in the training data.
  Reply:
  It should be illustrated that the original SMAP time-series were used for model training in the first version (ESSD_2022_137). However, in the process of revision, we found the shortcomings of this training method. Specifically, in high latitude regions, the original SMAP time-series data contain unavoidable gaps (i.e., the missing data) in a year because of the snow cover and other factors. Theoretically, these spatially missing data cannot be involved in the training process, as you mentioned exactly. If we want to directly use the SMAP time-series data for training, we need to mask the regions with gaps. However, the usage of the mask can significantly harm the reliability of the RF_SMAP dataset in terms of spatial coverage. Also, the number of training data in the RF model can be reduced greatly. Hence, the hctsa characteristics-based training method was adopted in the manuscript. Since the hctsa-extracted temporal characteristics are spatial seamless, the interference of missing data in the SMAP time-series on model training can be eliminated.
  
  To predict RF_SMAP, the trained RF model uses characteristics extracted from SMAP as input (Lines 185-194) at each grid pixel, is this correct? Then, how did you generate RF_SMAP for pixels and periods that do not have SMAP data (and thus unable to extract characteristics from SMAP)?
  Reply:
  We need to clarify that the trained RF model did not use the characteristics extracted from SMAP as input.
  In fact, the construction of model is based on the core assumption that the CCI and SMAP datasets have similar pattern of temporal changes. Specifically, the model at a time t was trained by the label (CCI_t) and the characteristics (extracted from the CCI time series by the hctsa method, coupled with the DEM and location data). In the prediction process, the characteristics (extracted from the SMAP time series by the hctsa method, coupled with the DEM and location data) were imported into the trained model, and the SMAP_tdata at time t was predicted. With the continuous change of CCI_t data from 1979001 to 2015097 (i.e., t, t+1, t+2, t+3, …), different RF models were continuously trained and corresponding RF_SMAP_t data were predicted in turn. We are going to rewrite this point and add key information in the new version of Figure 3.
  To clearly illustrate the prediction process, Figure 3 is modified in advance (see supplement).
  
  Lastly, the validation of RF_SMAP over the CCI era is highly limited due to the lack of ISMN before 2000.The validation of RF_SMAP over diverse climate regimes also seems limited, as most ISMN data are obtained from the US. I also wonder if there are any systematic biases between the RF_SMAP (historical SMAP before 2015) and the actual SMAP data from 2015.
  Reply:
  As you mentioned exactly, due to the uneven distribution and lack of in-situ stations, it is difficult validate the dataset based on diverse climate regions. However, the comparison based on different periods (e.g., from 2000 to 2005, and 2010 to 2015) is possible, we are going to analyze the systematic biases between the RF_SMAP (historical SMAP before 2015) and the actual SMAP data from 2015. We agree that this point is valuable, and we will revise.
  
  It is not clear why e.g. Fig. 6 shows only one time series per network and Fig. 7 shows a very small number of samples (dots) given that each ISMN network has >400 stations according to Table 2. Moreover, the comparison between the gridded datasets could be done from more diverse perspectives, e.g. comparison by season, during extreme (drought) conditions; SoMo is global, long-term data, but the comparison is done only for 4 years at three locations (Sect. 4.4)
  Reply:
  First of all, we need to clarify that Figure 6 aimed to show the change pattern of 11 networks (11 sub-figures) at 46 prediction times (i.e., from 2015105 to 2016097). Figure 7 provided the scatter plot of the corresponding 11 networks at 46 prediction times, based on the results in Figure 6. We need to clarify that the validation was at network level, that is, all stations in a network were averaged. In fact, the number of samples in Figure 7 is 46 (i.e., the prediction times) rather than the number of stations. The authors are going to revise the corresponding description to clarify this confusion.
  Additionally, the comparison of datasets in terms of different seasons is interesting. We will provide the results accordingly in the new version.
  As for the comparison with the SoMo.ml dataset in Section 4.4, we need to clarify the purpose of this section first. That is, Section 4.4 aimed to exhibit the differences between the SoMo.ml and RF_SMAP dataset and provide a potential way to improve the RF_SMAP dataset in future. Specifically, the production of the SoMo.ml dataset used the in-situ data as model inputs to improve the accuracy. However, the in-situ data are always used as the reference for validation, which is undoubtedly beneficial for accuracy evaluation of the SoMo.ml dataset. In Section 4.4, we admitted the difference in accuracy between the SoMo.ml and RF_SMAP dataset, and proposed to use in-situ data to further enhance the predicted RF_SMAP dataset in future research. Thus, we considered that using longer time series of the SoMo.ml data and more in-situ data will not add anything to the current points in Section 4.4 (i.e., the conclusions will also be the same as the current version).
  
  Citation: https://doi.org/10.5194/essd-2022-426-CC2
- AC1: 'Reply on RC1', Q. Wang, 24 Mar 2023
  
  Please see the uploaded file.
  
  Citation: https://doi.org/10.5194/essd-2022-426-AC1
CC1:
'Comment on essd-2022-426', yuyang Ma, 19 Mar 2023

This study provides an accurate and reliable, global 36 km, 8-day synthetic SMAP SM products from 1979 to 2015. This is a valuable dataset for the evaluation of historical events. I have some questions about the dataset you achieved in your article:
1.How to achieve data synthesis toward those different data sources?
2.It seems that the analysis of data in this volume is time-consuming. Could you please provide more details about the data analysis platform that you used here? Such as software or any other online platform.

In addition, there are some minor issues with the manuscript details:
1.In Figure 1, the site location is not clear.
2.Is the reconstruction of SM data before 2015? Why do you use sites from 2015-2016 to validate pre-2015 data in the abstract?
3.I think the flowchart is kind of too simple to express the details of dataset production.

Citation: https://doi.org/10.5194/essd-2022-426-CC1
- CC3: 'Reply on CC1', Haoxuan Yang, 22 Mar 2023
  
  This study provides an accurate and reliable, global 36 km, 8-day synthetic SMAP SM products from 1979 to 2015. This is a valuable dataset for the evaluation of historical events. I have some questions about the dataset you achieved in your article
  Reply:
  Thank you for the comments. We are going to reply item-by-item to each comment. The comments and figures can be seen in supplyment.
  
  1.How to achieve data synthesis toward those different data sources?
  Reply:
  Considering the spatial gaps in the daily SMAP data, we adopted 8-day composited method to acquire a more complete spatial coverage by averaging the valid SM data. Thank you for the comment.
  
  2.It seems that the analysis of data in this volume is time-consuming. Could you please provide more details about the data analysis platform that you used here? Such as software or any other online platform.
  Reply:
  Indeed, the time cost of this work is related high. We used Matlab to generate the RF_SMAP dataset. For the simulation of one scene data, it takes about 300 seconds. The processed operator is Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz.
  
  In addition, there are some minor issues with the manuscript details:
  1.In Figure 1, the site location is not clear.
  Reply:
  Thank you for the comment, we have revised the Figure 1 and enlarged the size of the sample points to clear illustrate the site location.
  
  2.Is the reconstruction of SM data before 2015? Why do you use sites from 2015-2016 to validate pre-2015 data in the abstract?
  Reply:
  Thank you for the suggestion. For the unclear description in the abstract, we are going to revise revised. In fact, the validation of reconstruction before 2015 was adopted the in-situ data before 2015 as the reference (Experiment 2). The in-situ data from 2015-2016 were used in Experiment 1. Experiment 1 aimed to demonstrate the predicted method (evaluate the performance of RF_SMAP during the period of real SMAP).
  
  3.I think the flowchart is kind of too simple to express the details of dataset production.
  Reply:
  Thank you for the suggestion. We have revised this point in advance and increased the readability of Figure 3 (see supplyment).
  
  Citation: https://doi.org/10.5194/essd-2022-426-CC3
- AC2: 'Reply on CC1', Q. Wang, 24 Mar 2023
  
  Please see the uploaded file.
  
  Citation: https://doi.org/10.5194/essd-2022-426-AC2
RC2:
'Comment on essd-2022-426', Anonymous Referee #2, 28 Mar 2023

The manuscript presents a composited, SMAP-like soil moisture dataset derived with a random forest approach from historical CCI data. While a global soil moisture product for such a long period (1979 to 2015) has its merits, the dataset stands and falls with the successful evaluation of the derived product. Its spatial resolution (36 km) is interesting for large scale analyses.
It should be noted that the approach a) assumes stationarity of the CCI and SMAP data and b) generalises its global applicability of the model. Both are rather strong assumptions for a “simple” random forest approach. In addition, the evaluation refers to averaged network-level data, which introduces further uncertainty of the scaling. The authors present some limited evaluation, which does not really exemplify the validity of the derived model data.

There are some similar studies and datasets, the authors did not refer to: https://www.nature.com/articles/s41597-023-02053-x , https://www.nature.com/articles/s41597-021-00925-8 In which way does their dataset (and approach) advance these?
Moreover, there is a recent paper in HESS, which uses Sentinel data to estimate soil moisture: https://hess.copernicus.org/articles/27/1221/2023/
Obviously, the temporal extent of this approach has very little overlap with the data presented in this manuscript (since the sentinel satellites have been operational just from 2015 onwards). However, the authors might find inspiration for further evaluation in this?

After all, it is very difficult to evaluate if the dataset and its presentation justify publication in ESSD. If the authors could really corroborate the validity of their data product, this would be a clear yes. However given the open questions and despite the meticulous effort which went into the compilation of this data, it remains too unclear, how the dataset from a rather simple approach can advance already existing SMAP-like soil moisture products.

Citation: https://doi.org/10.5194/essd-2022-426-RC2
- AC3: 'Reply on RC2', Q. Wang, 14 Apr 2023
  
  Please see the uploaded file.
  
  Citation: https://doi.org/10.5194/essd-2022-426-AC3

Haoxuan Yang, Qunming Wang, Wei Zhao, and Peter Atkinson

Data sets

An 8-day composited 36 km SMAP soil moisture dataset (1979-2015) Haoxuan Yang, Qunming Wang, Wei Zhao, and Peter M. Atkinson https://doi.org/10.6084/m9.figshare.17621765

Haoxuan Yang, Qunming Wang, Wei Zhao, and Peter Atkinson

Viewed

Total article views: 1,955 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,362	493	100	1,955	122	164

HTML: 1,362
PDF: 493
XML: 100
Total: 1,955
BibTeX: 122
EndNote: 164

Views and downloads (calculated since 14 Feb 2023)

Month	HTML	PDF	XML	Total
Feb 2023	118	33	4	155
Mar 2023	136	36	15	187
Apr 2023	76	26	4	106
May 2023	30	7	0	37
Jun 2023	33	13	3	49
Jul 2023	39	19	0	58
Aug 2023	26	10	0	36
Sep 2023	33	13	1	47
Oct 2023	15	7	2	24
Nov 2023	9	4	0	13
Dec 2023	11	8	1	20
Jan 2024	17	5	0	22
Feb 2024	15	6	3	24
Mar 2024	21	15	4	40
Apr 2024	18	2	7	27
May 2024	18	7	7	32
Jun 2024	36	1	2	39
Jul 2024	12	4	4	20
Aug 2024	23	7	3	33
Sep 2024	12	4	0	16
Oct 2024	11	2	0	13
Nov 2024	7	2	0	9
Dec 2024	10	4	0	14
Jan 2025	12	3	3	18
Feb 2025	8	8	2	18
Mar 2025	5	8	1	14
Apr 2025	8	9	3	20
May 2025	22	14	2	38
Jun 2025	21	25	1	47
Jul 2025	24	18	3	45
Aug 2025	52	16	3	71
Sep 2025	260	20	2	282
Oct 2025	27	19	2	48
Nov 2025	45	19	2	66
Dec 2025	30	17	6	53
Jan 2026	38	13	5	56
Feb 2026	43	22	1	66
Mar 2026	39	45	4	88
Apr 2026	2	2	0	4

Cumulative views and downloads (calculated since 14 Feb 2023)

Month	HTML	PDF	XML	Total
Feb 2023	118	33	4	155
Mar 2023	136	36	15	187
Apr 2023	76	26	4	106
May 2023	30	7	0	37
Jun 2023	33	13	3	49
Jul 2023	39	19	0	58
Aug 2023	26	10	0	36
Sep 2023	33	13	1	47
Oct 2023	15	7	2	24
Nov 2023	9	4	0	13
Dec 2023	11	8	1	20
Jan 2024	17	5	0	22
Feb 2024	15	6	3	24
Mar 2024	21	15	4	40
Apr 2024	18	2	7	27
May 2024	18	7	7	32
Jun 2024	36	1	2	39
Jul 2024	12	4	4	20
Aug 2024	23	7	3	33
Sep 2024	12	4	0	16
Oct 2024	11	2	0	13
Nov 2024	7	2	0	9
Dec 2024	10	4	0	14
Jan 2025	12	3	3	18
Feb 2025	8	8	2	18
Mar 2025	5	8	1	14
Apr 2025	8	9	3	20
May 2025	22	14	2	38
Jun 2025	21	25	1	47
Jul 2025	24	18	3	45
Aug 2025	52	16	3	71
Sep 2025	260	20	2	282
Oct 2025	27	19	2	48
Nov 2025	45	19	2	66
Dec 2025	30	17	6	53
Jan 2026	38	13	5	56
Feb 2026	43	22	1	66
Mar 2026	39	45	4	88
Apr 2026	2	2	0	4

Viewed (geographical distribution)

Total article views: 1,925 (including HTML, PDF, and XML) Thereof 1,925 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 04 Apr 2026

Download

This preprint has been withdrawn.

Preprint (2452 KB)
Metadata XML

Short summary

A random forest (RF) model was proposed to extend the superior SMAP dataset (named RF_SMAP) from 1979 to 2015, using the corresponding CCI time-series. The new long time-series RF_SMAP dataset, which will be available to download, will be of great value for a range of research in applications such as climate assessment, agricultural planning, food insecurity monitoring and drought assessment and monitoring.


Total:	0
HTML:	0
PDF:	0
XML:	0