the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Spatially adaptive estimation of multi-layer soil temperature at a daily time-step across China during 2010–2020
Abstract. Soil temperature (Ts) is critical in regulating agricultural production, ecosystem functions, hydrological cycling and climate dynamics. However, the inherent spatial and temporal heterogeneity of soil thermal regimes constitutes a persistent challenge in obtaining high-resolution, continuous gridded Ts datasets along vertical profiles. To address this issue, we propose a spatially adaptive layer-cascading Extreme Gradient Boosting (XGBoost) algorithm to generate daily multi-layer Ts data (0, 5, 10, 15, 20, and 40 cm) at a spatial resolution of 1 km in China from 2010 to 2020. The methodology dynamically partitions non-uniformly distributed measuring sites (2,093 sites across the country) to quadtrees and incorporates thermal coupling effects propagated between neighbor soil layers. Multi-source data, including satellite retrievals of land surface temperature and vegetation index, and ERA5 reanalysis climate variables were used as inputs. Independent tests demonstrated high robustness and accuracy of our model, with depth-specific values of coefficients of determination (R²) being 0.94~0.98 and root mean square errors (RMSE) values ranging 1.75~2.21K. It is noted the model’s performance was lower in summers and winters than in springs and autumns. Compared to existing global or regional Ts products, the dataset developed here is characterized by its fine spatio-temporal patterns and high reliability, enabling it to provide supports for precision agriculture, ecosystem modeling and understanding climate-land feedback. Free access to the dataset can be found at https://doi.org/10.11888/Terre.tpdc.302333 (Wang et al., 2025).
Competing interests: At least one of the (co-)authors is a member of the editorial board of Earth System Science Data.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.- Preprint
(2427 KB) - Metadata XML
-
Supplement
(858 KB) - BibTeX
- EndNote
Status: open (until 23 Jul 2025)
-
RC1: 'Comment on essd-2025-192', Anonymous Referee #1, 09 Jun 2025
reply
- In the introduction (lines 92–104), the authors clearly outline two key challenges in current research: first, the significant heterogeneity of Ts leads to unclear relationships between variables; second, modeling is hindered by data scarcity and uneven distribution. However, in lines 106–113, when introducing the objectives and scope of this study, the authors do not explain how the study addresses these two challenges. It is also unclear what specific methods are used to overcome them, and why these methods are effective. It is recommended that the authors restructure this section by focusing on the core problems, rather than simply listing the research contents. This would improve the clarity and logical flow of the introduction.
- In Section 2.1, the authors describe the use of CMA Ts observational data. However, it is unclear how these data were processed. Were the observations directly provided as daily averages, or were they aggregated from hourly data? Was any quality control applied? How were missing data handled, both in the vertical profile and in the time series? Were any filtering or screening steps performed, and if so, what were the specific criteria?
- In lines 186–190, as well as in Section 4.3, the authors provide a brief discussion of the study’s limitations. However, it is concerning that the missing land surface temperature (LST) data caused by cloud cover were filled using a simple linear interpolation method. This approach may be questionable, as the interpolated values represent a theoretical cloud-free state, while cloud presence can significantly influence radiative transfer and thus impact LST. There are existing interpolation methods that take into account energy transfer and energy balance. It is recommended that the authors investigate these alternatives and consider adopting a more reliable method.
- In Section 2.3.1, it is suggested to provide further explanation of the Variance Inflation Factor (VIF). Specifically, what is its purpose, how is it calculated, and if possible, a formula should be included to make the description more complete.
- In Section 2.3.2, a substantial portion is devoted to the spatial partitioning strategy based on a rotated quadtree. I have several questions regarding this part. First, why was the quadtree data structure chosen? The manuscript does not clearly explain this. Is it intended to address the issue of uneven distribution of observation sites? If so, why is the quadtree suitable for this purpose? Second, what was achieved by using the quadtree? Was there an effort to ensure that each node contains a roughly equal number of sites, for example around 30? Why was 30 selected as the threshold, and what is the basis for this value? Lastly, a minor suggestion (optional for consideration): if the goal is to achieve a more balanced spatial distribution of stations, a top-down data structure such as the K-D tree (with K = 2 in this study) may be more effective than the bottom-up quadtree. A K-D tree can ensure the difference in the number of points between leaf nodes does not exceed one, and can also support rotation operations.
- In Section 2.3.3 (lines 285–295), the authors introduce XGBoost as the core machine learning algorithm used in the study. They present its advantages and compare it with other methods such as SVM, RF, and neural networks. However, the stated advantages are not sufficient to demonstrate that XGBoost is superior to the other listed methods. Machine learning models differ in structure, number of parameters, optimization strategy, and suitability for different tasks. Therefore, the current explanation is not enough to justify the model choice. Considering that the algorithm is not the main focus of this paper, it is suggested to either include a brief comparative experiment to support the claimed superiority or rephrase the section to emphasize the strengths of XGBoost without direct comparison to other models.
- In lines 296–297, the validation set is twice the size of the test set. Is this split reasonable, and can it effectively evaluate the generalization performance of the model? Why not adopt more common ratios such as 8:1:1 or 6:2:2? In addition, the manuscript later mentions that five-fold cross-validation was used for evaluation. In this context, what are the roles of the two validation sets? Are they used for model selection, parameter tuning, or testing? It is recommended that the authors provide a clearer explanation. It is also suggested to report the specific sample sizes for each dataset.
- The authors produced data at a 1-kilometer resolution for China. How did the authors account for the spatial scale difference between point observations of soil temperature and the 1-kilometer resolution results? How was it ensured that the dataset constructed through point observation training could represent results at the 1-kilometer spatial scale? Additionally, regarding the dataset production, I am very interested in the subsequent maintenance and updates of the dataset over time. Can the authors' method be extended to produce datasets for subsequent years?
- How did the authors account for the impact of uneven spatial distribution of observation data points on the development of the national soil temperature dataset? How do factors such as topography, landform, and vegetation cover types influence the results and uncertainties of this dataset?
- There are also some minor issues that should be addressed. For example, in Figures 6 and 7, it is recommended to include a color bar legend. As it stands, it is difficult to interpret the exact values represented by the orange points. In Equation (2), the variables x and y lack subscripts i. In Equation (4), the variable i used for summation is not defined. In the references, line 667 and 763 include “others” among the authors—what does this mean? It is suggested to carefully check the manuscript for such details, including grammar, figures, and reference formatting.
Citation: https://doi.org/10.5194/essd-2025-192-RC1
Data sets
Daily multi-layer soil temperature dataset with 1 km resolution in China from 2010 to 2020 Xuetong Wang et al. https://doi.org/10.11888/Terre.tpdc.302333
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
291 | 37 | 12 | 340 | 13 | 8 | 10 |
- HTML: 291
- PDF: 37
- XML: 12
- Total: 340
- Supplement: 13
- BibTeX: 8
- EndNote: 10
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1