the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A 1 km soil organic carbon density dataset with depth of 20cm and 100cm from 1985 to 2020 in China
Abstract. Soil organic carbon (SOC) is an important component of the worldwide carbon cycle as a vital indicator of soil quality and ecosystem health, with significant implications for agricultural production and climate change adaptation and mitigation strategies. Although there are some studies on mapping the spatial distribution of soil organic carbon density (SOCD), the long-time series SOCD products in China are still lacking. Therefore, this study proposed a new algorithm with climatic zoning, aiming to improve the accuracy of predicting SOC densities with depths of 0–20 cm and 0–100 cm from 1985 to 2020. The data sources used in this study include Landsat archives, topographic data, meteorological data, and measured SOCD data. The innovation lies in the zoning models by climate regions using a random forest ensemble learning approach for SOCD estimation in China. The predicted results show that our zoning model outperformed the global model without climate zoning in predicting SOCD with R2=0.55 and RMSE=2.19 for 0–20 cm SOCD estimation and R2=0.52 and RMSE=6.50 for 0–100 cm. Comparably, the SOCD estimation using the global model is with R2=0.46 and RMSE=2.36 for 0–20 cm SOCD estimation and R2=0.44 and RMSE=8.09 for 0–100 cm. Moreover, our 0–20 cm SOCD predictions align well with independent samples (R²=0.69, RMSE=2.01) and are further validated with Xu's dataset (R²=0.63, RMSE=1.82). Furthermore, the comparisons with the published SOC content products including HWSD, SoilGrids250m, and GSOCmap have also shown good consistency, too. Comparably, our predicted SOCD is the best fit with SoilGrids250m products with R2=0.72 and RMSE=1.35. Comparisons of model predictions to independent datasets from the 1980s, 2000s, and 2010s in China reveal substantial connections and a trend of increasing forecast accuracy over time. The predicted SOCD is available via the Figshare (https://doi.org/10.6084/m9.figshare.27290310.v1) (Dong et al., 2024).
- Preprint
(4335 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (extended)
-
RC1: 'Comment on essd-2024-588', Anonymous Referee #1, 16 Mar 2025
reply
This manuscript by Dong et al. provided an intriguing study on quantifying soil organic carbon (SOC) at various depths of 0-20 cm and 0-100 cm in China from 1985 and 2020. The SOC change is highly important for the terrestrial ecosystem carbon cycle. It’s innovative to use climate zones to improve the performance of this model, with comparable accuracy to other published SOC datasets. In addition, this dataset expands the temporal availability of SOCD in China at high spatial resolution, which can be highly applied in related studies. Overall, this manuscript has clear scientific innovations in understanding SOC changes. Please find my detailed comments below.
- Section 3.2. Could you give more explanation about the principles of selecting variables? For example, from Fig.2, the R between AH and SOCD is almost 0, why select this variable? And only 18 variables have been shown on Fig. 2 without CLCU, how to select CLCU as an input predictor?
- Section 4.2. Fig. 8 and Line 250: The discussion of different features for SOCD estimations is comprehensive, which can help us to understand the important factors of SOCD variations. But it’s very interesting to find that the features have different important values in the two depth models. Please try to discuss more about these differences.
- Section 4.5. Fig. 13 and Line 315: “This may be the result of the topsoil being more susceptible to the direct effects of soil management practices and environmental changes.” Which types of management practices contribute to the changes of SOCD in topsoil? Please add more details (policies or references). As shown in Fig. 13(b), the SOCD estimation in 0-100 cm from this study has a higher value than others. Please add some validation for SOCD in 0-100 cm as mentioned previously. In addition, the SOCD in deep soil should increase if SOCD in topsoil increases. So, please give possible reasons for SOCD in 0-100 cm to be stable from the 1990s to 2020s. Fig. 14 (d) and Fig. 15 (d): In Xinjiang province, the SOCD in 2000-2005 seems to change a lot when compared to another period. Is this due to the model itself, or has some event happened during this period to make a significant change in SOCD? Please give reasonable explanations in this part.
- Section 2.1 “brown soil, brown soil”. Duplicate
- Section 2.2. Line 95: The SOCD data from Song or Xu? Please check it carefully.
- Line 125: Generally, the spatial interpolation results are reliable if stations are evenly distributed. How about the spatial distribution of these meteorological data used for interpolation?
- Line 130: Please add the produced time or effective period of the published soil datasets.
- Section 4.2. Line 225: There is no need to write the full name of the statistical metrics, which have been mentioned previously. Fig. 6: Could you add the sample number in Fig. 6? Please add unit for RMSE both in Figures and the manuscript.
- Section 4.4. Fig. 11: Please add a unit for colorbar for (b), (d), (f), and note the Time (which year). Is it the annual average or any specific year? Please add the validation results for 0-100 cm SOCD in the manuscript or Supplementary.
Citation: https://doi.org/10.5194/essd-2024-588-RC1 -
CC1: 'Comment on essd-2024-588', Tingxuan Zhang, 20 Mar 2025
reply
The manuscript has many serious methodological problems and flaws, undermining the provided dataset’s accuracy. These issues include inadequate input variables for the random forest model, lack of method novelty, and unclear figure illustration. I wonder why this manuscript was sent out for review. Let me offer some significant issues:
(1) Random forest is a nonlinear supervised discrete classification model, while Pearson correlation coefficient is a correlation coefficient that measures linear correlation between two sets of data. The authors know nothing about it. They used the Pearson correlation coefficient to determine the variables for the random forest model inputs.
(2) The authors claimed that they used climate zoning to improve the prediction, but this is absolutely unnecessary because temperature and precipitation are the two most important variables (shown in Figure 8) and are highly correlated to climates. In this case, why take the trouble of building the random forest model for each climate zone?
(3) Issue 2 leads me to my next big concern: the verification part. As the authors insist that climate zoning is the novelty of their methods, why did they verify their results against others across different climate zones? From a scientific view, climate zoning does not mean anything to improve the model's accuracy. For instance, if the authors did not use climate zoning but other geographical partitioning, the smaller the partition area, the more accurate the model would be considering Tobler's first law of geography states that everything is related to everything else, but near things are more related to each other.
(4) Even so, I do not find a significant improvement in the SOCD prediction compared to other published datasets.
(5) The highly skewed SOCD sample input leads to the model's low accuracy (Figure 4). This is probably one of many reasons why the accuracy of 0-20cm SOCD showed higher R2 than that of 0-100cm SOCD.
(6) Another reason is the adequate model input data. The lack of lidar data for soil depth measurement makes your results underestimated compared to other datasets (Figures 11 & 12).
(7) Figures 5(a) and 5(c) are unnecessary as the authors did not conduct any analysis using the biomes.
Citation: https://doi.org/10.5194/essd-2024-588-CC1 -
CC2: 'Comment on essd-2024-588', jianzhao wu, 22 Mar 2025
reply
Publisher’s note: the content of this comment was removed on 24 March 2025 since the comment function was misused for promotional purposes.
Citation: https://doi.org/10.5194/essd-2024-588-CC2 -
RC2: 'Comment on essd-2024-588', Anonymous Referee #2, 27 Mar 2025
reply
The manuscript attempts to produce a 1km SOCD data product of China during 1985-2020. Yet, the novelty of the manuscript is very limited, and the reliability of the produced data product is still difficult to judge, since the manuscript has many inaccurate descriptions on data source and the methodology. My specific comments were given below:
(1) Feature optimization? I think it should be feature selection. Yet, random forest (RF) may represent extremely complicated nonlinear relationship between SOCD and their drivers (i.e., the covariates that you used), why did you select features based on Pearson correlation coefficients? Besides, RF is some insensitive to feature selection!
(2) The use of climate zone in this study is really unnecessary since the temperature and precipitation that you used to define the climate zone have already used as features in your RF model!
(3) How did you separate your train and test samples? How many samples for each of the year? is it a balance sampling across years? This information is very important, since the readers want know if your extrapolation beyond the year of observation.
(4) The descriptions for building your space-time RF model is very confusing! I think your RF model should be a space-time model, otherwise, you can not get time series of SOC from 1985 to 2020. Or you just model the SOCD during each time period separately?if that was true, this manuscript would have no any novelty. if you built the space-time RF model through space-for-time (see, Heuvelink et al. 2020. Machine learning in space and time for modelling soil organic carbon change. Eur J Soil Sci.), are the covariates like vegetation and land use considered as “dynamic covariates? how did you represent the lagging effects of dynamical covariates (i.e., the effects of temperature on SOC state is lagged, vegetation and land use as well), and the memory effects of SOC (i.e., the state of SOC in this year depends on last year)? these information is essential for modelling changes or dynamics of SOC using machine learning (ML) method like RF, as the ML is pure data-driven method.
(5) the validation across different time period is missing, thus, it is difficult to judge the trend in SOC change.
(6) Source of data is confusing. How many soil profiles for each of the year, as we should check the balance of data across time. Your DEM was generated from topographic maps or resampled from SRTM DEM? are weather data monthly or yearly? What’s the beginning year of your weather data.
(7)Line 152: “the measured data in the 2000s is SOCD”? I don’t think so, sine SOCD was calculated from SOC, bulk density, and coarse fragment, not directly measured.
(8) Line 153 : ” Second National Soil Census”, Census is usually for economics, here should be “survey”? many English words for such kind of description (for source of data) were inaccurate or confusing.
(9) Line 158. since you calculate the SOCDs of Chinese sampling points using bulk density and volume percentage of coarse fragments from the SoilGrids 2.0 data product, it is the very reason that your products are highly correlated to the SOCD of SoilGrids 2.0!
(10) Line 160: “coarse fractions proportion”, I think here is not “proportion”, since your CF was divided 100 in equation 7.
Citation: https://doi.org/10.5194/essd-2024-588-RC2 -
CC3: 'Comment on essd-2024-588', Bennett Wang, 04 Apr 2025
reply
I identified several significant methodological flaws that undermine the validity of the findings. ESSD is a high-level data journal. For scientific rigour, I do not think the data constructed in this manuscript can be used as the basis for future scientific research. Therefore, I have no choice but to oppose the publication of this article and its data.
(1) The distribution and accumulation of soil carbon result from intricate and dynamic processes shaped by biological, environmental, and human factors. However, the authors only used features that capture the canopy features of vegetation (using vegetation indices) as biotic factors. Other critical biological factors affecting soil carbon content, such as chemical and physical property information inside the soil, are missing. In particular, the author's experimental objects are carbon storage at various depths of soil, but the explanatory variables using machine learning are only the vegetation index reflecting the growth of vegetation canopy and some climate variables, which are far from enough to predict carbon storage at the depth of soil.
(2) To the extent of (1), the author also obviously ignored the effect of land use change on soil carbon storage, e.g., the progress of urbanization and the encroachment of agricultural land on forest land. The soil carbon content of agricultural land is definitely different from that of forest land. Fertilization and the distribution of roots in the soil of the two types of plants also have an effect.
(3) The most important point is that this grid results from point data to Landsat's 30 m resolution and then accumulated to 1km of soil carbon density, which is seriously inaccurate. This approach obviously ignores the heterogeneity of the soil, making the results and models strongly dependent on the geographic distribution of the data at each point. However, the distribution of these points is not uniform in the grid's 30 m or 1 km resolution.
(4) In addition, the manuscript lacked a description of the method, making the experiment impossible to replicate and hard to understand.
(5) There is a lot of uncertainty in the data validation of this manuscript. For example, in Figures 10 to 12, what does each point represent? Are all 1km*1km grids used for validation? I don't think so! It is obvious that the author only selected specific pixels, which can be seen from the number of points. Even so, the accuracy of the validation is very low. The methods and features proposed in this study are clearly not enough to provide accurate soil carbon content.
Citation: https://doi.org/10.5194/essd-2024-588-CC3
Data sets
A 1 km soil organic carbon density dataset with depth of 20cm and 100cm from 1985 to 2020 in China Yi Dong, Xinting Wang, and Wei Su https://doi.org/10.6084/m9.figshare.27290310.v1
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
414 | 69 | 10 | 493 | 9 | 11 |
- HTML: 414
- PDF: 69
- XML: 10
- Total: 493
- BibTeX: 9
- EndNote: 11
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1