the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A China dataset of soil properties for land surface modeling (version 2)
Abstract. Accurate and high-resolution spatial soil information is crucial for efficient and sustainable land use, management, and conservation. Since the establishment of digital soil mapping (DSM) and the GlobalSoilMap working group, significant advances have been made in spatial soil information globally. However, accurately predicting soil variation over large and complex areas with limited samples remains a challenge, especially for China, which has diverse soil landscapes. To address this challenge, we utilized 11,209 representative multi-source legacy soil profiles (including the Second National Soil Survey of China, World Soil Information Service, First National Soil Survey of China, and regional databases) and high-resolution soil-forming environment characterization. Using advanced Quantile Regression Forest algorithms and a high-performance parallel computing strategy, we developed comprehensive maps of 23 soil physical, chemical and fertility properties at six standard depth layers from 0 to 2 meters in China with a 90 m spatial resolution (China dataset of soil properties for land surface modeling version 2, CSDLv2). Data-splitting and independent samples validation strategies were employed to evaluate the accuracy of the predicted maps quality. The results showed that the predicted maps were significantly more accurate and detailed compared to traditional soil type linkage methods (i.e., CSDLv1, the first version of the dataset), SoilGrids 2.0, and HWSD 2.0 products, effectively representing the spatial variation of soil properties across China. The prediction accuracy of most soil properties at the 0–5 cm depth interval ranged from good to moderate, with Model Efficiency Coefficients for most soil properties ranging from 0.75 to 0.32 during data-splitting validation and from 0.88 to 0.25 during independent sample validation. The wide range between the 5 % lower and 95 % upper prediction limits may indicate substantial room for improvement in current predictions. The relative importance of environmental covariates in predictions varied with soil properties and depth, indicating the complexity of interactions among multiple factors in the soil formation processes. As the soil profiles used in this study mainly originate from the Second National Soil Survey of China during 1970s and 1980s, they could provide new perspectives of soil changes together with existing maps based on 2010s soil profiles. The findings make important contributions to the GlobalSoilMap project and can also be used for regional Earth system modeling and land surface modeling to better represent the role of soil in hydrological and biogeochemical cycles in China. This dataset is freely available and can be accessed at https://doi.org/10.11888/Terre.tpdc.301235 (Shi et al, 2024).
- Preprint
(1937 KB) - Metadata XML
-
Supplement
(6113 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on essd-2024-299', Anonymous Referee #1, 02 Oct 2024
GENERAL COMMENTS:
The manuscript presents preparation of soil maps for the entire area of China at a 90 m resolution, covering six soil depths down to 2 m. The maps include information on 23 soil physical and chemical properties. The structure of the manuscript is easy to follow, and the statistical analysis is profound. However, the following points require revisions by the authors:
- It would be informative to include a description of how the sand, silt, and clay content was computed. Did you consider using additive log-ratio transformed clay, silt, and sand content as the dependent variables to ensure that the sum of the three particle size fractions is 100%? Additionally, how was uncertainty computed for sand, silt, and clay content?
- Please clarify in the entire manuscript whether data-splitting refers to cross-validation. If yes, please use the latter terminology, as it is more widely used.
- It is not clear how soil data from different time periods were considered for the mapping. This issue needs to be addressed in the manuscript.
- For the description of the soil maps, it might be useful to add a geographical map of China and refer to physical features rather than only using compass directions. Please provide information on the spatial patterns (paragraph 3.3) of all the derived maps: maps of TN, AN, porosity, gravel, AP, and colours are not described.
- Data accessibility is problematic. I see that the maps could be downloaded through FTP, but it didn’t work for me. The download site needs improvement, or information on how to use it should be provided.
Please find further suggestion under SPECIFIC COMMENTS.
SPECIFIC COMMENTS:
TITLE: you could put into brackets the acronym of the database: (CSDLv2)
L21: please add information on accuracy based on all depth intervals, not only 0-5 cm.
L36: it seems to be Lu et al. (2016) based on reference list. Please check and correct.
L39: please cite other papers as well.
L47: please rephrase the following: “exemplified by Brazil’s”, the sentence is not finished.
L64: … (McBratney et al., 2003) … the mistyping errors of references could be prevented by using a referencing tool. Please recheck in the entire text if reference list is in line with their citations.
L68: please shortly describe in the text the limitations of the existing dataset.
L80: please rephrase the following: “soil specie survey”, it is mistyped.
L83: please add the list of mapped soil properties.
L84-85: please write with lower case letters the words “available” and “alkali”. Is there a more general name for AN? E.g.: potential long-term supply of nitrogen in the soil (alkali-hydrolysable nitrogen, AN), or something similar?
L94-107: please decrease repetition, by mentioning each advancements once in a logical order.
L102: highlight that covariates were considered for the mapping as independent variables/predictors in the ML. Please consider that improvement in resolution is the result of points 1-3, therefore it could be mentioned after the points 1-4.
L104: please rephrase the following, it is difficult to understand: “without explicitly uncertainty estimates in CSDLv1”
L109: … in Fig. 2. … or change the order of Fig. 1 and 2.
L114-115: based on the entire manuscript 1) validation was performed based on data-splitting and independent soil profile dataset with measured soil data, and 2) comparison was done with existing national and global soil maps. Please consider it and revise the text and workflow figure (Fig. 2. left bottom corner) accordingly.
L132: … in Fig. 2 … change order of figures as suggested above.
L150: it might be better to write “location” instead of “’space”.
L153: it is OK to use soil type information from HWSD, but please shortly explain why you used this 1 km resolution map instead of SoilGrids 250 m resolution.
L159: aspect is not included in Table S2, please add it or delete in the text if it was not used.
L160: do you mean “organism related covariate”? Please rephrase.
L174: similar as above, why soil factors were derived from HWSD 1 km, why not from SoilGrids 250 m?
L181-182: please note that Pearson correlation coefficient can detect only linear relationships. Why didn’t you let RFE decrease the number of covariates? Why did you consider first Pearson corr. coeff. to decrease the number of predictors?
L201: could you please add in the supplementary material info about the 15 most important variables for all depth and soil properties? Similarly to Fig. S26, which shows it for depth 0-5 cm.
L201: please: mention somewhere under “2 Materials and Methods” how resolution of the derived maps was defined.
L252: do you mean 1°× 1° tiles?
L262: Please specify “four different values”. Do you mean four prediction related values?
L271-276: as mentioned above, please clarify if 10 fold cross-validation was performed. Does I mean that all the 1540 Chinese soil profiles of the WoSIS dataset was used only for validation?
L314-315: please consider the following and rephrase if you agree: the goal might be to have training data that is representative for China’s soil types. Do you think that the datasets available to train the model represents well the soil types under different land cover? I ask it because in the case of many countries soils from arable land are well represented, but soils from forested areas, or organic soils, or less widespread soils types are underrepresented. How it is in your case?
L319-322: please note that vertical change in soil properties depends on soil types. Several soil properties are addressed in this manuscript, therefore be specific and do not state that “the average concentrations of most soil property variables tend to decrease with increasing depth (e.g., OC, TN), showing positive skewness distributions.” The last statement is confusing: “indicating no statistically significant differences between samples from different depths.” Is it the case for OC?
L336: please add what can be the reason for the vertical decline in the predictability of soil texture.
L339: please add information about the deeper layers, as well.
L350: what do you mean by “regional covariates”? Please rephrase.
L356: Fig. 4 is discussed later than Fig. 5, please change order of the figures.
L357-358: please explain more the fact that you describe in sentence starting with “The gross …”. Please note that values of soil properties not always increase with depth.
L364: please rephrase the first sentence, it is not complete.
L366: please explain what you mean by “looser soil particles”. What is the reason of having .lower bulk density in the Qianghai-Tibet Plateau?
L367: Is land use the only factor that influence BD in the south-eastern coastal areas? Please explain differences in BD in deeper horizons, which are less affected by land use.
L368: please rephrase the first sentence, it is not complete.
L368-373: please be more specific in referencing the specific regions. Present sentences are contradicting, due to specifying the locations based on the points of the compass.
L373: please discuss map of TN, and why it shows similar pattern with OC.
L400: … lists the PICP values …
L410: please discuss how uncertainty changes with soil depth. What can be an explanation for that change?
L413-414: do you mean that organism type variables have the highest variable importance? Please rephrase.
L422: please note that soils developed on shallow bedrock do not always have low OC. Vegetation type on those soils influence the rate of OC accumulation.
L437: what is the source of organic matter content (TERECO) input layer in the case of clay content maps of CSDLv2? Isn’t it terrestrial ecosystems? Please revise the sentence.
L441—443: please rephrase the last two sentences of the paragraph, those are difficult to understand.
L446-447: please provide more information about the results of Shrini at el. (2017). It is not clear how that is related to your results on CEC.
L451: … SoilGrids 2.0 … please correct it here and the entire manuscript.
L452: please add the selection criteria both in the text and caption of Table 3. E.g., soil properties with highest prediction accuracy, or something similar.
L458-462: In the case of MEC calculate a percentage improvement relative to the possible range or describe absolute improvement, e.g. MEC improved from 0.48 to 0.69.
L477-479: do you think that CSDLv2 can better capture sites with extremely low or extremely high values? If yes, please add it and discuss why it can describe better the extreme values than the other maps.
L493-494: please add an example for “smoothing the properties of certain regions”. Or rephrase the sentence. Do you mean that extreme values are smoothed due to the type of algorithm used (QRF – provides a mean of several trees, which includes a mean at each node)?
L496: … To show the impact of the … or something similar.
L499: … aspects. … end the sentence, delete “:”.
L499: is the resolution of the derived maps 90 m, because the input layers, which are most important for the predictions, also have this resolution? If yes, please add this shortly.
L505-506: please add the benefit of producing map of soil colours in RGB. What is its practical use?
L523-524: the meaning of the sentence starting with “These soil nutrients …” is not clear, please describe more. Do you mean warm-up period?
L534: do you think that 90 m resolution can meet the needs of precision agriculture? 90 m resolution might support the spatial delineation of management zones. Please consider to revise it in the text.
L557-558: “soil management” or “land use”? Please revise if land use is the correct word.
L557-571: this description is very informative. Suggestion for future development: if elevation and slope is highly correlated with temperature and precipitation, it might be possible to derive 90 m resolution climate variables from the original 1 km resolution – downscaling – based on topographical variables.
L572-578: Ok, but it is not clear how you handled soil data originating from different time periods in your study. Please explain it shortly in the text.
L583: on the download page why:
- temporal resolution is yearly and
- spatial resolution is 10 m – 100 m?
L584: please shortly add how 1 km and 10 km resolutions were derived.
L589: … soil physical and chemical soil properties, with … Please delete here and in the entire manuscript the word “fertility”. Fertility is a complex soil property defined by many indicators. In this manuscript soil physical and chemical properties were addressed Of course these influence soil fertility, but the focus is not on that in the manuscript.
L594: … gridded soil datasets, …
L594: please rephrase “more reasonalble”, with something more specific.
L596: please shortly indicate that CSDLv2 describes the state of 1980.
L599-601: please complement the last sentence by how the limitations of CSDLv2 could be addressed in future studies – i.e., summarize paragraph 4.3.
DATA AND CODE AVAILABILITY:
The codes are accessible at GitHub.
Data accessibility is not smooth. I see that the data could be downloaded through FTP, but it didn’t work for me. The download possibility needs improvement or information on using the download site is needed.
TABLES:
Table 2: is it possible to give a general variable name for “Sentinel2B2/B3/B4/8/9” under the description column?
Table 3: do you mean that it is the result of cross-validation in the case of CSDLv2 and performance of the other maps (CSDLv1, SoilGrids 2.0, HWSD 2.0) on the dataset used to train and test the CSDLv2 predictions? Please revise the title to increase clarity. Add number of samples considered for the validation in a separate column.
FIGURES:
Fig. 2: revise left bottom corner based on advice for L114-115, and reedit the figure of “Other soil datasets”, its pattern might not be the same as that of the “Variable maps”. Direction of arrow on the left might go from “points of the soil profiles” to “Compare and evaluate”.
Fig. 4: the caption does not include information on DEM and land use map. Please add them.
Fig. 5: the labels are not visible. Please consider to show the maps in two or three figures, to increase visibility and readability. Please find a logic to put the maps into two or three groups, than you do not have to fit all 23 maps to one page (one figure), but to two or three figures. Please add unit of the soil properties and add “content” where needed, e.g.: sand, silt, and clay content, etc.
Fig. 6: please increase size of the letters on the plot, it is difficult to read.
Fig. 7: please:
- increase size of the letters on the plot, it is difficult to read,
- add R2 – for both maps – and 1:1 line to b), d) and f) plots to better see the comparison,
- use the same min and max values on x and y axis by soil properties, e.g.: 0 and 30 % for OC, 0 and 100 % for sand, 0 and 80 % for clay.
SUPPLEMENTARY MATERIAL:
Fig. S2-S24: please increase size of the letters in the legend. Present version is difficult to read.
Fig. S2, S7, S8-13 : using the word “content” is not appropriate, please revise these captions. Fig. S8-13 needs some further clarification on the meaning of R, G, B, should be easy to understand without reading the manuscript.
Fig. S14-15: I thought there are more variety in the colour of the soil. Do you have only 6 different colour? Or did you decrease/aggregate the possible colours?
Fig. S19-24: write out nitrogen, potassium, phosphorus before the brackets, instead of writing only N, K, and P.
Fig. S2-25: please add unit in the caption of the figure.
Fig. S25: … of the soil organic carbon (OC) and soil pH …
Fig. S26: please increase size of the letters on the plot, it is difficult to read.
Table S4: do you mean that it is the result of cross-validation in the case of CSDLv2 and performance of the other maps (CSDLv1, SoilGrids 2.0, HWSD 2.0) on the dataset used to train and test the CSDLv2 predictions? Please revise the title to increase clarity. Add number of samples considered for the validation in a separate column.
Fig. S26: please add that the top 15 most important variables are shown.
Citation: https://doi.org/10.5194/essd-2024-299-RC1 - AC1: 'Reply on RC1', Gaosong Shi, 23 Oct 2024
-
AC2: 'Reply on RC1', Gaosong Shi, 08 Nov 2024
Dear Reviewer,
To facilitate more efficient access to the 90 m resolution maps developed in this study, we have also stored these maps on the Science Data Bank (scienceDB) repository as a backup (https://www.scidb.cn/s/ZZJzAz), which supports both HTML and FTP download options.
We have revised the manuscript to reflect this addition as follows:
"The soil maps in this study for six depth layers (0-5, 5-15, 15-30, 30-60, 60-100, and 100-200 cm) at 90 m spatial resolution across China are openly accessible at https://www.scidb.cn/s/ZZJzAz or https://doi.org/10.11888/Terre.tpdc.301235 (Shi et al., 2024)."
Citation: https://doi.org/10.5194/essd-2024-299-AC2
-
RC2: 'Comment on essd-2024-299', Anonymous Referee #2, 01 Nov 2024
General comments
This is a great and very extensive contribution to spatial soil information in China and I enjoyed reading the manuscript. The authors use well-established digital soil mapping methodologies. My main concerns and comments are below:
Soil point data:
Key information is missing about the soil point data (Sect. 2.1.2). Are these observations (you use the term in-situ values) laboratory measurements or pedological field estimates (or perhaps both depending on the dataset and soil property)? If they are laboratory measurements, what methods were used to measure them? Are they data only from soil profiles or also from boreholes / augerings? At what depth was sampled (by fixed/predefined soil layer in cm or by pedological soil horizon)? What is the sampling design of the different datasets?
Data-splitting and model evaluation:
It seems that the authors did not group the data-splitting procedures by location / soil profile. If observations from the same profile but at different depths are used in both training and testing (calibration and validation, or in case of CV, it’s also called hold-in vs. hold-out), then accuracy statistics are overly optimistic. This seems problematic in several steps of the modelling framework: RFE using OOB, 10-fold CV during hyperparameter tuning and most importantly, during model evaluation used for reporting the accuracy metrics. Please adjust methods so that, in all steps, all observations from the same location / profile are either in the hold-in or hold-out.
Discussion on use at various spatial scales:
I miss a discussion and recommendations of when and when not to use these maps. You have generated national maps for China of 20 soil properties, which you can expect will be widely used for science, policy and society. Therefore, it is in your interest to make sure they are not used the wrong way. Resolution is not the same thing as accuracy. While it’s great that the authors have created high-resolution products, this does not mean that they are accurate or should be recommended to use at the local level, e.g. farm or field scale. For local-scale policy and land use decisions, local models with more detailed soil surveys would most likely need to be made. However, surely on a national scale and perhaps also on a large regional scale (provincial level), these maps can be used (given that users also consider the uncertainty that you report, i.e. accuracy metrics and uncertainty maps). Please add a section on this topic in the discussion supported by relevant literature.
Please proofread for English spelling and grammar carefully. Currently there are numerous spelling and grammatical errors, some of which (not all) I have listed in the “technical corrections” below.
Figures should be improved and legends and axes labels are often not readable.
Assets (data and code):
I was not able to access or download the data (90m resolution prediction maps). I recommend changing the data repository site and choosing one recommended by ESSD (https://www.earth-system-science-data.net/submission.html#assets). The model code is not provided and so the manuscript and modelling results are not reproducible (repository only contains 2 small scripts). I was not able to open the IGSN link when clicking on it but it did work when I pasted it into the browser (https://doi.org/10.11888/Terre.tpdc.301235). The “data sets” and “IGSN” assets are the same so one can be deleted. The “interactive computing environment” asset is merely a link to the python website and can be removed.
Specific comments:
L42-46: A more recent national product very similar to your own that is worth listing here is https://doi.org/10.5194/essd-16-2941-2024
L105-106: I would suggest to remove the first aspect: you already mentioned several times that new datasets were incorporated and more data were used than in other DSM studies in China. In addition, given the size of the country, the number of soil profiles is still not very high.
L109-115: Thank you for including Fig. 2, which is very useful (see also my technical recommendations regarding this figure below). However, I think the list 1-4 here in the text does not summarize all the relevant steps completely. What about soil point data and covariate harmonization and preparation (which generally takes the longest!), model evaluation not only using data-splitting but also uncertainty maps?
L263-265: How did you obtain the mean prediction using QRF? Or did you use RF for obtaining the mean prediction? This issue is discussed also in https://doi.org/10.1016/j.geoderma.2021.115659, https://doi.org/10.5194/essd-16-2941-2024 or https://doi.org/10.5194/soil-7-217-2021
L261-265: Did you compare median and mean predictions? You could do so quantitatively by comparing accuracy metrics and qualitatively by comparing the quality of the maps visually. Perhaps for some of the many soil properties that you predicted, median predictions are more accurate or are to be preferred over mean predictions. Median and mean predictions of DSM products using QRF and RF are e.g. compared in https://doi.org/10.5194/essd-16-2941-2024.
L274: Why did the authors choose the WoSIS dataset as the independent dataset for statistical validation (second method)? Looking at Fig. 1 of the soil point data on the map, it’s quite clear to me and it’s a good choice, but it should still be shortly explained as this is an important detail. The choice of dataset used for statistical validation strongly influences accuracy metrics (e.g. https://doi.org/10.1111/j.1365-2389.2011.01364.x)
L276-281 and Eq. 2-4: Consider changing the order to ME followed by RMSE and then MEC since mathematically this makes much more sense (ME is a part of RMSE equation). This would also make more sense for explaining the terms in the text directly afterwards (L282-286).
L303: I don’t think Yan et al., 2020 is the most appropriate citation here. Better choose a manuscript that is specifically about prediction uncertainty and its error sources in DSM or statistical modelling. Some examples include: https://doi.org/10.1007/978-3-319-63439-5_14 or https://doi.org/10.1016/j.geoderma.2024.117052
L333-334: I suggest referencing the extensive review study of Chen et al. 2022 (https://doi.org/10.1016/j.geoderma.2021.115567) and also comparing with other studies (e.g. https://doi.org/10.5194/essd-16-2941-2024) not only in China to support the statement that pH is usually easiest to predict.
L401: Careful! Confidence intervals are not the same as prediction intervals. Here you should be referring to prediction intervals, just as you do in the methods section.
L555-556: A more recent approach has also used covariates dynamic not only in two dimensional space but also over depth (and time), see https://doi.org/10.1038/s43247-024-01293-y
Technical corrections:
L75: remove parentheses around Zhou et al., 2019a
L104: “without explicit uncertainty”
L109: Fig. 1 is the map of soil profiles. Here I assume you refer to Fig 2. Check this and make sure all tables and figures are in the correct chronological order in which they appear in the text.
L131: If I am not mistaken 11,209 should be written as 11 209 and 8,979 as 8979. Also, there should be spaces between units (also percentages) and the number. Please carefully read through https://www.earth-system-science-data.net/submission.html. There is a very detailed and useful section about “mathematical notation and terminology”. Please check this and apply to entire manuscript.
L146-147: include reference to GSM standard depths to make it clear which international standards you are referring to:
Arrouays et al., 2014. GlobalSoilMap: Basis of the global spatial soil information system)
Arrouays et al., 2015. The GlobalSoilMap project specifications, in: Proceedings of the 1st GlobalSoilMap Conference.
L160: Perhaps adjust to “Covariates related to the soil-forming factor ‘organism’”
Tables and Figures:
Figures in the manuscript and supplements are often too small, axis and legend labels are non-readable. It is key that these figures are improved for publication, as maps are key to this study. Some colors scales in the figures are not color-blind friendly (red and green colors), e.g. Fig. S26.
In general, I would recommend re-assessing where and how information is presented in figures, which I realize is challenging with so many predicted soil properties at different depths and maps of uncertainty etc. Perhaps see https://doi.org/10.5194/essd-16-2941-2024 and the supplements of that manuscript for ideas (https://doi.org/10.5194/essd-16-2941-2024-supplement) – there they organized the supplements by soil property.
Figure 2: remove “altitude”, shown in parentheses below depth. Altitude usually refers to elevation, whereas here you are referring to depth. According to Meinshausen 2006, QRF should be “quantile regression forest”, not “quantile random forest”. You also refer to it as quantile regression forest elsewhere. Check entire manuscript to make sure it’s the same. “Variables” is misspelled (“varibles maps”). Finally, the caption is grammatically incorrect: either “for national-scale soil properties mapping” or “for developing national-scale soil property maps”. Please check.
Figure 5: Maps are too small. Legends and axis labels cannot be read. Maps need to be enlarged. Consider restructuring figures (see comment above).
Citation: https://doi.org/10.5194/essd-2024-299-RC2 - AC3: 'Reply on RC2', Gaosong Shi, 09 Nov 2024
-
AC4: 'Reply on RC2', Gaosong Shi, 15 Nov 2024
Dear Reviewer,
Following your suggestion, we compared the performance of mean predictions using Random Forest (RF) and median predictions using Quantile Regression Forest (QRF) when generating the national-scale dataset. Please refer to the attached PDF for detailed information.
Data sets
A China dataset of soil properties for land surface modeling (version 2) Gaosong Shi and Wei Shangguan https://doi.org/10.11888/Terre.tpdc.301235
Model code and software
A China dataset of soil properties for land surface modeling (version 2) Gaosong Shi https://github.com/shgsong/CSDLv2
Interactive computing environment
A China dataset of soil properties for land surface modeling (version 2) Gaosong Shi https://www.python.org/
IGSN
A China dataset of soil properties for land surface modeling (version 2) Gaosong Shi https://doi.org/10.11888/Terre.tpdc.301235
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
818 | 205 | 78 | 1,101 | 55 | 9 | 14 |
- HTML: 818
- PDF: 205
- XML: 78
- Total: 1,101
- Supplement: 55
- BibTeX: 9
- EndNote: 14
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1