OpenLandMap-soildb: global soil information at 30 m spatial resolution for 2000&ndash;2022+ based on spatiotemporal Machine Learning and harmonized legacy soil samples and observations

Hengl, Tomislav; Consoli, Davide; Tian, Xuemeng; Nauman, Travis W.; Nussbaum, Madlene; Isik, Mustafa Serkan; Parente, Leandro; Ho, Yu-Feng; Simoes, Rolf; Gupta, Surya; Samuel-Rosa, Alessandro; Zborowski Horst, Taciara; Safanelli, José Lucas; Harris, Nancy

doi:10.5194/essd-2025-336

Preprints

https://doi.org/10.5194/essd-2025-336

Preprints

24 Jun 2025

| 24 Jun 2025

Status: a revised version of this preprint was accepted for the journal ESSD.

OpenLandMap-soildb: global soil information at 30 m spatial resolution for 2000–2022+ based on spatiotemporal Machine Learning and harmonized legacy soil samples and observations

Tomislav Hengl, Davide Consoli, Xuemeng Tian, Travis W. Nauman, Madlene Nussbaum, Mustafa Serkan Isik, Leandro Parente, Yu-Feng Ho, Rolf Simoes, Surya Gupta, Alessandro Samuel-Rosa, Taciara Zborowski Horst, José Lucas Safanelli, and Nancy Harris

Abstract. There is increasing interest in global dynamic soil information with changes in soil properties mapped over time and at high spatial resolution. Thanks to long-term, multi-temporal, and fine- and medium-resolution satellite missions such as Landsat, MODIS, Copernicus Sentinel and similar, it is possible to produce globally consistent predictions of key soil variables that match other 10–30 m spatial resolution global data sets. This paper describes data preparation, modeling, and production of OpenLandMap-soildb: global dynamic predictions of soil organic carbon content, soil organic carbon density, bulk density, soil pH in H₂O, soil texture fractions (clay, sand and slit) and USDA subgroup soil types (USDA soil taxonomy subgroups) at 30 m spatial resolution based on spatiotemporal Machine Learning (Quantile Regression Random Forest with output predictions showing the mean plus the lower and upper prediction intervals of 68 % probability). To train the models, a large compilation of soil samples imported from legacy soil projects was used: 216,000 soil samples with soil carbon density (kg m^-3), 408,000 soil samples with soil carbon content (g kg^-1), 272,000 samples with soil pH in H₂O, 363,000 samples with clay, silt, and sand (%), and 134,000 samples with bulk density oven dry (t m^-3). Soil carbon and soil pH were mapped with 5 year time-intervals; soil texture fractions, bulk density, and soil types were mapped for recent years only. The cross-validation results indicate RMSE of 17.7 (kg m^-3; 0.486 in log-scale) and CCC of 0.88 for SOC density, RMSE of 51.3 (g kg^-1; 0.574 in log-scale) and CCC of 0.87 for SOC content, RMSE of 0.15 (t m^-3) and CCC of 0.92 for bulk density of fine-earth, RMSE of 0.51 and CCC of 0.91 for soil pH, RMSE of 8.4 % and CCC of 0.87 for soil clay content, and RMSE of 12.6 % and CCC of 0.84 for soil sand content respectively. The most important variables for predicting soil organic carbon density (kg m^-3) were: soil depth, Landsat-based uncalibrated Gross Primary Productivity (GPP), Normalized Difference Vegetation Index (NDVI) and CHELSA bioclimatic indices. The global distribution of soil pH can be primarily explained by the CHELSA Aridity Index (long-term), annual precipitation, and salinity grade. The global stocks for 2020–2022+ period for 0–30 cm depth interval are estimated at 461 Pg (Peta grams); the results further indicate that, in the last 25 years, the world has lost at least 11 Pg of SOC in the top soil. Suggestions are made on how to set up global permanent monitoring stations to accurately track land degradation and enable land restoration projects. The training dataset is available at https://doi.org/10.5281/zenodo.4748499 (Hengl and Gupta, 2025), while the resulting data products can be accessed at https://doi.org/10.5281/zenodo.15470431 (Consoli et al., 2025). Both datasets are released under a CC-BY license.

How to cite. Hengl, T., Consoli, D., Tian, X., Nauman, T. W., Nussbaum, M., Isik, M. S., Parente, L., Ho, Y.-F., Simoes, R., Gupta, S., Samuel-Rosa, A., Zborowski Horst, T., Safanelli, J. L., and Harris, N.: OpenLandMap-soildb: global soil information at 30 m spatial resolution for 2000–2022+ based on spatiotemporal Machine Learning and harmonized legacy soil samples and observations, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2025-336, in review, 2025.

Received: 06 Jun 2025 – Discussion started: 24 Jun 2025

Competing interests: Tomislav Hengl, Davide Consoli, Xuemeng Tian, Mustafa Serkan Isik, Leandro Parente, Yu-Feng Ho and Rolf Simoes are employed by OpenGeoHub.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Status: final response (author comments only)

CC1:
'Comment on essd-2025-336', Surendran Udayar Pillai, 04 Aug 2025

Comments file attached

Citation: https://doi.org/10.5194/essd-2025-336-CC1
- AC3: 'Reply on CC1', Xuemeng Tian, 27 Nov 2025
  
  We have compiled our replies in the attached PDF, where we provide point-by-point responses to all of the comments.
  
  Citation: https://doi.org/10.5194/essd-2025-336-AC3
CC2:
'Comment on essd-2025-336', Mario Guevara Santamaria, 08 Aug 2025

Comment on https://doi.org/10.5194/essd-2025-336 “OpenLandMap-soildb: global soil information at 30 m spatial resolution for 2000–2022+ based on spatiotemporal Machine Learning and harmonized legacy soil samples and observations”. Validation of soil organic carbon and bulk density predictions at the national scale of Mexico.

Carlos Arroyo, Viviana Varon, and Mario Guevara

Geosciences Institute, National Autonomous University of Mexico, Campus Juriquilla, Queretaro, Mexico.

The authors present an interesting spatial and temporal digital soil mapping effort to predict soil key variables at the global scale. Among other variables, soil organic carbon and bulk density are critical to understand soil responses to environmental change and land use. The authors increase the global availability of these variables with unprecedented spatial resolution for its use by multiple users across a large diversity of applications. There is a high scientific merit behind this effort and we hope to see the final version published soon.

However, important implications exist in the misuse of model derived products, because they are not error free and they include intrinsic and multisource uncertainty. In a revised version, the narrative could better prevent the misuse of soil model derived products across high uncertainty dominated areas. While the authors report relatively high accuracy in model predictions from cross validation, we hypothesize that such accuracy will drop-down significantly when compared with fully independent datasets, e.g., leave one dataset out cross validation, because each dataset is collected for a different purpose. Our overarching goal is to increase interoperability of digital soil mapping efforts from the plot, to the global scale. Therefore the objectives of this comment are a) to highlight the existence of fully independent national databases in Mexico that can be used to improve model accuracy of global soil predictions, or to calibrate country specific estimates, and b), to compare country-specific values of soil organic carbon and bulk density from fully independent datasets, with values derived from the new global soil variability models across 30m grids.

We use two fully independent datasets to validate global soil predictions at the national scale in Mexico. The first dataset was collected and analyzed by our National Institute of Geostatistics and Geography-INEGI in the year 2008 to assess soil erosion at the national scale, considering multiple land covers (INEGI, 2014). The second database was collected and analyzed by the former Ministry of Agriculture (now SADER) with support from FAO in 2012, considering only agricultural land (Arroyo et al, 2025). While the dataset from INEGI is representative of the topsoil, from the mineral surface to a maximum of 30 to 40 cm of soil depth, the agriculture soil dataset is representative of the first 30cm of soil depth. The INEGI dataset is available here: https://www.inegi.org.mx/app/biblioteca/ficha.html?upc=702825004223 and the SADER dataset is described and available here: https://bsssjournals.onlinelibrary.wiley.com/doi/10.1111/ejss.70116. Note that INEGI metadata is available in Spanish only (please let us know of any required assistance). We first download the soil datasets and model predictions from OpenLandMap-soildb, and then we compute the R2 between soil carbon and bulk density values from global predictions and the datasets. Because the agriculture dataset reports organic matter values rather than organic carbon, we use the conventional 0.58 factor as explained by (Van Bemmelen, 1897).

We observe, as expected, relatively low correlation compared to that reported in the paper, when comparing predictions against fully independent datasets (Fig. 1). Comparing global models with fully independent datasets is appealing to identify the main drivers of soil research across countries and identify the capacity of a global model to reproduce nationwide information.

Comparing all land uses, the correlation between the Openlandmap derived soil carbon values and the INEGI 2008 dataset increases significantly from R2 0.06 to R2 0.34 when transforming their values to a natural log scale. The Openlandmap soil carbon predictions and the soil carbon values in the dataset described in Arroyo et al, (2025) from 2012 across agricultural land only shows an R2 value of 0.23 that, interestingly, was not sensitive to the logarithmic transformation. Bulk density in the Mexican datasets is also different from that reported in the Openlandmap products (Fig. 1).

[ insert Fig .1] Fig 1 Scatterplots of soil organic carbon and bulk density values from the Openlandmap project compared with independent soil datasets across Mexico. Soil organic carbon carbon across all land uses considering a national dataset representing the year 2008 showa lowest correlation against the Openlandmap product (a). Considering a national dataset collected in 2012, only agricultural land, the correlation is slightly higher (b). Bulk density across all land uses (c) and across agricultural land (d) show even lower correlation values.

It is clear that, based on R2 metrics and an independent dataset, the validation at the national scale is different from that reported in Openlandmap products. Due to this kind of global soil map being commonly used in governmental institutions for decision making, overall in countries with a lack of soil information. Therefore we propose that it would be interesting to report a country-based validation; for example, leaving- one-country-out validation. Maps of R2 variation across the world help users (i.e., public institutions, universities) to understand the specific limitations of global products in their countries.

The authors present an unprecedented opportunity to increase soil data quantity, quality and accessibility by combining local to national datasets into global soil variability models. The synergy between regional to global soil variability models brings positive implications towards more robust soil estimates (Zhang et al., 2025). We hope that the authors find the highlighted datasets useful for their global soil mapping efforts towards an increased interoperability among national to global soil mapping groups. We believe that highlighting all possible sources of uncertainty and clarifying the scope of the new information would help to promote the responsible use of global soil variability models. In conclusion, our comment enriches the ongoing discussion around global soil mapping by grounding it in real-world national data and offering constructive pathways for improvement. It’s the kind of feedback that can elevate both the scientific robustness and practical relevance of large-scale environmental models.

References:

Arroyo-Cruz, C.E., Prado, B., Kolb, M., Mora-Palomino, L.N., Todd-Brown, K. and Guevara, M. (2025), Synthesis of a National Soil Dataset Across Productive Land in Mexico: The Importance of Making Existing Data Accessible. Eur J Soil Sci, 76: e70116. https://doi.org/10.1111/ejss.70116

INERGI 2014, Conjunto de Datos de Erosión del Suelo, Escala 1: 250 000 Serie I Continuo Nacional https://www.inegi.org.mx/app/biblioteca/ficha.html?upc=702825004223 (last accessed 07/08/2025).

Lei Zhang, Lin Yang, Yuxin Ma, A-Xing Zhu, Ren Wei, Jie Liu, Mogens H. Greve, Chenghu Zhou, Regional-scale soil carbon predictions can be enhanced by transferring global-scale soil–environment relationships, Geoderma, Volume 461, 2025,117466,ISSN 0016-7061, https://doi.org/10.1016/j.geoderma.2025.117466.

Van Bemmelen, J.M., 1897. Die Absorption. Das Wasser in den Kolloiden, besonders in dem Gel der Kieselsäure. Z. Anorg. Chem. 13 (1), 233–356.

Note Fig. 1 available in the attached pdf.

Citation: https://doi.org/10.5194/essd-2025-336-CC2
- AC4: 'Reply on CC2', Xuemeng Tian, 27 Nov 2025
  
  The comment was uploaded in the form of a supplement: https://essd.copernicus.org/preprints/essd-2025-336/essd-2025-336-AC4-supplement.pdf
  
  Citation: https://doi.org/10.5194/essd-2025-336-AC4
RC1:
'Comment on essd-2025-336', Zhongkui Luo, 13 Aug 2025
This is an ambitious and important study that, with revisions, should be published.
As a soil ecologist focusing on soil carbon–nutrient dynamics and their interactions under environmental and management changes, I found the work both impressive and valuable. Using comprehensive, high-resolution global datasets, the authors employ spatiotemporal machine learning approaches to map soil organic carbon (SOC), soil pH, soil type, and other variables at a remarkable 30 m resolution worldwide. They also estimate SOC changes over the past two decades. The harmonized legacy soil observations are particularly noteworthy, and the resulting maps will be valuable for diverse applications, such as driving soil carbon models, informing land management, and guiding policy decisions.
That said, there are several areas—both scientific and presentational—where improvements would substantially enhance the manuscript’s rigor, clarity, and impact.
The manuscript uses an excessive number of abbreviations, which interrupts reading flow. Some abbreviations are not defined at first mention—for example, CCC and RMSE in the abstract. The authors should not assume all readers will be familiar with these terms. I recommend carefully reviewing the manuscript to ensure that all abbreviations are defined upon first appearance and that only essential abbreviations are retained. Given the already considerable length of the manuscript, the use of numerous abbreviations does not meaningfully reduce length and may hinder comprehension.
The manuscript is long enough to deter some potential readers. If journal policy allows, I recommend moving certain sections—such as extended details on data collection, preparation, modeling approach, and mapping techniques—into the Supplementary Information. This would allow the main text to focus more on:
Implications and potential applications of the dataset

Interpretation of key findings (e.g., variable importance, spatial patterns)

Broader impacts and future directions

While complexity is sometimes unavoidable, figures should be as simple, clear, and self-explanatory as possible. Specific suggestions:
Figure 1: Reorganize to emphasize the workflow; remove unnecessary logos. Use a consistent layout style (either top-down or left-right) with clearly separated blocks for each step.
Figure 2: The content can be succinctly described in a few sentences; consider removing the figure or replacing it with more impactful visuals.
Figure 5: Overly complex and difficult to interpret; despite repeated attempts, I could not fully understand it.
Figure 6: Contains too much information without adequate caption detail. Consider showing only the left panel with block diagrams; integrate textual explanations, interpretations, and distribution plots into the main text or Methods section.

Scientific Comments:
Motivation for 30 m resolution. The rationale for mapping at 30 m resolution should be articulated more clearly. Higher resolution should serve a clear theoretical or practical purpose—for example, improving global SOC stock estimates, supporting fine-scale land management, or providing critical input to Earth system models. The current introduction touches on these but could better synthesize them into a concise, logically connected research objective. The four research questions presented are somewhat disjointed; consider distilling them into a single, coherent framework.
Multicollinearity of predictor variables. The manuscript does not address multicollinearity among predictors—a significant concern for high-dimensional datasets, especially with overlapping climatic and remote sensing variables (e.g., SAVI, NDVI, GPP, and climate metrics). Multicollinearity can cause overfitting and obscure variable importance. The authors should clarify whether collinearity was assessed or controlled, and if not, explain the rationale. Reducing redundancy could also decrease computation time and improve model interpretability.
Inclusion of bedrock depth. Bedrock depth is a critical factor influencing SOC stocks, particularly in mountainous regions where bedrock often occurs at shallow depths (<1 m). While the authors mention future inclusion, I suggest considering it now—global bedrock depth maps do exist and could be integrated relatively easily. Bedrock depth affects root distribution and carbon inputs to the soil, potentially altering model performance and the relative importance of predictors.
Overall, this is an excellent and timely study with strong potential impact. Addressing the concerns outlined above—particularly improving structure, clarifying motivations, simplifying figures, and addressing certain methodological points—will greatly strengthen the manuscript’s readability and scientific contribution.
Citation: https://doi.org/10.5194/essd-2025-336-RC1
- AC1: 'Reply on RC1', Xuemeng Tian, 27 Nov 2025
  
  We have compiled our replies in the attached PDF, where we provide point-by-point responses to all of the comments.
  
  Citation: https://doi.org/10.5194/essd-2025-336-AC1
RC2:
'Comment on essd-2025-336', Anonymous Referee #2, 27 Oct 2025

This manuscript represents an important contribution to the field of fine-resolution global digital soil mapping. The methodology and its application potential are commendable, and the work is generally of publishable quality. The paper can be significantly strengthened by addressing several key issues related to the methodological description, which currently lacks sufficient detail, and by improving the consistency of writing and terminology throughout the text.
Specific Comments:
Please add continuous line numbers throughout the manuscript. This will greatly facilitate referencing specific locations during the review and revision process.
Abstract needs better clarification. Please provide the abbreviation of SOC in P1 Line 5. It is not clear about the spatiotemporal Machine learning. How time is incorporated into this model? Is it a 3D+T model? Why using 68% probability for quantifying the prediction uncertainty? Soil carbon density (P1 Line 9) and soil carbon (P1 Line 11) should be corrected as soil organic carbon density and soil organic carbon due to the presence of soil inorganic carbon in many soil samples. It is reasonable that authors did not consider the temporal changes of soil texture fractions and soil types, while bulk density is highly correlated to soil organic carbon and therefore its temporal changes should be considered if you taking 5-year time intervals for soil organic carbon. Please provide the full names of RMSE and CCC for the first time. It is not necessary to indicate the RMSE in log-scale since this information is useless. It is not clear why authors only present the most important variables for soil organic carbon density and pH.
P3 Lines 5-7: This work did not solve this issue during predictive modelling.
P3 Lines 21-23: Authors overlooked the maps from Australia. It would be better to include the work by Grundy et al. (2015). Grundy M.J., Rossel, R.V., Searle, R.D., Wilson, P.L., Chen, C. and Gregory, L.J., 2015. Soil and landscape grid of Australia. Soil Research, 53(8), pp.835-844.
P4 Line 29: This work covers the period from 2000 to 2022+, so how to evaluate the impact of land use conversion for SOC and pH on a scale of 25+ years?
P5 Line 20: the difference between observations and measurements should be better clarified.
P5 Line 21: Once you match the O&M with covariate layers by year, it means that you overlook the legacy effect from environmental covariates (e.g., land use change), which can be quite important for soil properties, such as soil organic carbon. I understand that this consideration would pose a heavy load for computing, but at least you should address this limitation in the discussion.
P5 Lines 25-26: It is not clear that whether authors included the profile issue in performance evaluation to avoid data leakage.
P7 Line 26: Please correct SOC as SOCD or SOCd. Indeed, the full name should be specified in P7 Line 22.
P8 Line 4: in t m^-2?
P10 Line 5: A recent released dataset from Chen et al. (2025) may be helpful for your future work. Chen, Z., Chen, L., Lu, R. et al. A national soil organic carbon density dataset (2010–2024) in China. Sci Data 12, 1480 (2025). https://doi.org/10.1038/s41597-025-05863-3
Figure 3: Please use either SOCD or SOCd in the manuscript.
P15 Lines 28-30: What is the advantage for estimating SOC density directly from SOC content? SOC together with other variable control the variability of bulk density, only take SOC to estimate SOC will limit the prediction accuracy. SOC [kg m^-3] should be corrected as SOC density [kg m^-3].
P16 Lines 11-12: variable resolutions in 1 kilometer resolution? Scale is not an appropriate term here.
P17 Line 15: Since it is a 3D+T model, it is important to demonstrate the time span of soil data to support spatiotemporal modelling.
P24 Lines 14-25: Please also include R² in the accuracy evaluation. Why not report the accuracy of silt content here?
Figure 11: It would be also interesting to demonstrate the difference of model performance across different continents, which would be helpful for the design of future direction.
P53 Line 11: assessment indicates that (R2) the best achievable. R2 is a typo here?

Citation: https://doi.org/10.5194/essd-2025-336-RC2
- AC2: 'Reply on RC2', Xuemeng Tian, 27 Nov 2025
  
  We have compiled our replies in the attached PDF, where we provide point-by-point responses to all of the comments.
  
  Citation: https://doi.org/10.5194/essd-2025-336-AC2

Data sets

OpenLandMap-soildb Davide Consoli et al. https://doi.org/10.5281/zenodo.15470431

Model code and software

openlandmap/soildb Tomislav Hengl and Xuemeng Tian https://doi.org/10.5281/zenodo.15608971

Viewed

Total article views: 5,965 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
4,538	1,360	67	5,965	55	71

HTML: 4,538
PDF: 1,360
XML: 67
Total: 5,965
BibTeX: 55
EndNote: 71

Views and downloads (calculated since 24 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	945	113	4	1,062
Jul 2025	867	167	21	1,055
Aug 2025	489	120	4	613
Sep 2025	1,127	149	12	1,288
Oct 2025	310	148	4	462
Nov 2025	471	251	6	728
Dec 2025	329	412	16	757

Cumulative views and downloads (calculated since 24 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	945	113	4	1,062
Jul 2025	867	167	21	1,055
Aug 2025	489	120	4	613
Sep 2025	1,127	149	12	1,288
Oct 2025	310	148	4	462
Nov 2025	471	251	6	728
Dec 2025	329	412	16	757

Viewed (geographical distribution)

Total article views: 5,873 (including HTML, PDF, and XML) Thereof 5,873 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 31 Dec 2025

Short summary

We used satellite data and thousands of soil samples to create detailed global maps showing how soil changes over time. These maps reveal important patterns in soil health, such as a significant global loss of soil carbon in the past 25 years. Our results help track land degradation and support better land restoration efforts. This work provides a new global tool for understanding and protecting soil, a key resource for food, water, and climate.


Total:	0
HTML:	0
PDF:	0
XML:	0