the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A random forest isoscape model of bioavailable Sr for South America: a focus on southern Brazil
Abstract. In recent years, advances in machine learning have greatly improved the generation of maps showing the geographic distribution of isotope ratios (isoscapes), which have become essential tools for environmental, mobility and provenance studies in both modern and archaeological contexts. Among the various isotopic systems employed, strontium (Sr) is particularly useful because its 87Sr/86Sr ratio in the environment is largely controlled by the underlying geology through the composition of local soils and rocks.
In this work, we present a new dataset of bioavailable 87Sr/86Sr ratios derived from n = 233 plant samples collected across southern Brazil, covering the states of Santa Catarina and Rio Grande do Sul (c.a. 370,000 km2). The measured ratios span from 0.70521 to 0.76039 and capture the bioavailable Sr isotope signatures over all major geological units in the region.
We combined these new data with an extensive compilation of published bioavailable Sr measurements from across South America (including plants, fauna, ancient human remains, shells, snails, lichens, water and soils) to construct three random forest Sr isoscapes using different subsets of the combined dataset at the regional and continental scales. The first model incorporates the entire dataset ('All' dataset, n = 883 sites), the second is based on plant+fauna+lichen+human (n = 661 sites) and the third is limited to plant+lichen samples (n = 531 sites). Among the three models, the full dataset model shows lower predictive power, while the plant+fauna+lichen+human and the plant+lichen models yield better results, with similar RMSE (0.0049 and 0.0054) and R2 values (ca. 0.76). Compared to existing Sr isoscapes of South America, our models significantly enhance both spatial coverage and resolution of bioavailable Sr predictions, particularly in southern Brazil.
The new bioavailable Sr isotope dataset from Santa Catarina and Rio Grande do Sul states is available at https://doi.org/10.5281/zenodo.17988601 (Scaggion et al., 2025a) and the compiled literature dataset is reported as supplementary material.
- Preprint
(1537 KB) - Metadata XML
-
Supplement
(2104 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on essd-2026-136', Mael Le Corre, 20 Apr 2026
-
AC1: 'Reply on RC1', Tommaso Giovanardi, 20 May 2026
Dear Reviewer,
we want to thank you for your time and effort in reviewing our paper. Here's a detailed response.
In this paper, the authors develop a bioavailable Sr isoscape for South America using a machine learning approach. The model is trained on a compilation of published Sr data across the continent, complemented by new plant samples from two regions in Brazil. This work represents a valuable contribution, as spatially explicit Sr isoscapes remain scarce at the local and continental scales in South America, and such models have strong potential for applications in archaeology, ecology, and provenance studies.
Overall, the manuscript is well written and easy to follow. The introduction provides the necessary background and clearly frames the objectives. The methods are clearly described and rely on a robust and widely used modelling framework. The results are presented in a clear and detailed manner.
I only have a comment regarding the discussion. I was wondering whether there are published Sr datasets from southern Brazil (Santa Catarina or Rio Grande do Sul) that could be used for a qualitative comparison with the isoscape predictions. I am not suggesting a formal validation with statistical tests, but rather a brief and simple consistency check. It would help the reader assess how well the model performs in real-world contexts.
A: We thank the reviewer for the suggestion. We have incorporated all documents, to the best of our knowledge, pertaining to the areas of investigation into the construction of the model proposed here.
I only have a few minor comments:
- L248: Can you provide the description of the predictors that you included in the model? (e.g. as a table in SI). This would allow the reader to understand the variables without having to consult previous publications. It would also make it unnecessary to redefine abbreviations in the results section.
A: we added a Table in the main text, reporting the used predictors and their respective reference:
Variables
Description
Resolution
Type
Source
r.m1
Median bedrock model
1 km
Discrete
Bataille et al. 2018
r.srsrq1
1st quartile bedrock model
1 km
Discrete
Bataille et al. 2018
r.srsrq3
3rd Quartile bedrock model
1 km
Discrete
Bataille et al. 2018
r.age
Terrane age attribute
1 km
Discrete
Mooney et al. 1998
r.dust
Multi-models average (g.m-2.yr-1)
1°x1°
Continuous
Mahowald et al., 2006
r.map
Mean annual precipitation (mm.yr-1)
30-arc sec
Continuous
Hijmans et al., 2005
r.salt.t
CCSM.3 simulation
1.4°×1.4°
Continuous
Mahowald et al. 2006
r.pet
Global Potential Evapo-Transpiration
30-arc sec
Continuous
Zomer et al. 2008
r.ai
Global Aridity Index
30-arc sec
Continuous
Zomer et al. 2008
r.elevation
SRTM (m)
90 m
Continuous
Jarvis et al., 2008
r.ph
Soil pH in H2O solution (x10)
250 m
Continuous
Hengl et al., 2017
r.clay
Clay (weight %)
250 m
Continuous
Hengl et al., 2017
r.bulk
Bulk density (kg m−3)
250 m
Continuous
Hengl et al., 2017
r.GUM
Global unconsolidated sediment map
1 km
Continuous
Börker et al., 2018
r.ssaw
Multi-models average sea salt wet deposition
(kg.ha−1.yr−1)
1° x 1°
Continuous
Vet et al., 2014
r.ssa
Multi-models average Sea salt wet+dry deposition (kg.ha−1.yr−1)
1° x 1°
Continuous
Vet et al., 2014
r.cec
Cation Exchange Capacity
250 m
Continuous
Hengl et al., 2017
r.bouguer
WGM2012_Bouguer
2 min
Continuous
Balmino et al., 2012
r.mat
Mean annual temperature (°C)
30-arc sec
Continuous
Hijmans et al., 2005
r.maxage_geol
GLiM age attribute (Myrs)
1 km
Discrete
Hartmann and Moosdorf, 2012
r.minage_geol
GLiM age attribute (Myrs)
1 km
Discrete
Hartmann and Moosdorf, 2012
r.meanage_geol
GLiM age attribute (Myrs)
1 km
Discrete
Hartmann and Moosdorf, 2012
r.nfert
Global Nitrogen Fertilization
30-arc
Continuous
Potter et al., 2010
r.volc
Atmospheric deposition of volcanic material (kg.m-2.s-1)
0.5°
Continuous
Brahney et al. 2015
Balmino, G., Vales, N., Bonvalot, S., and Briais, A.: Spherical harmonic modelling to ultra-high degree of Bouguer and isostatic anomalies. J. Geod. 86, 499–520. https://doi.org/10.1007/s00190-011-0533-4, 2012
Börker, J., Hartmann, J., Amann, T., and Romero-Mujalli, G.: Terrestrial sediments of the earth: development of a global unconsolidated sediments map database (gum). Geochem. Geophys. Geosyst. 19, 997–1024. https://doi.org/10.1002/2017GC007273, 2018.
Brahney, J., Ballantyne, A. P., Kociolek, P., Leavitt, P. R., Farmer, G. L., and Neff, J. C.: Ecological changes in two contrasting lakes associated with human activity and dust transport in western Wyoming: Dust-P controls on alpine lake ecology, Limnology and Oceanography, 60, 678–695, https://doi.org/10.1002/lno.10050, 2015.
Hartmann, J., and Moosdorf, N.: The new global lithological map database GLiM: a representation of rock properties at the Earth surface. Geochem. Geophys. Geosyst., 13, Q12004 https://doi.org/10.1029/2012GC004370, 2012.
Hengl, T., Mendes de Jesus, J., Heuvelink, G.B.M., Ruiperez Gonzalez, M., Kilibarda, M., Blagotić, A., Shangguan, W., Wright, M.N., Geng, X., Bauer-Marschallinger, B., Guevara, M.A., Vargas, R., MacMillan, R.A., Batjes, N.H., Leenaars, J.G.B., Ribeiro, E., Wheeler, I., Mantel, S., and Kempen, B.: SoilGrids250m: Global gridded soil information based on machine learning. PLoS One 12, e0169748. https://doi.org/10.1371/journal.pone.0169748, 2017.
Hijmans, R.J., Cameron, S.E., Parra, J.L., Jones, P.G., and Jarvis, A.: Very high resolution interpolated climate surfaces for global land areas. Int. J. Climatol. 25, 1965–1978. https://doi.org/10.1002/joc.1276, 2005.
Jarvis, A., Reuter, A., Nelson, A., and Guevara, E.: Hole-filled SRTM for the globe Version 4, available from the CGIAR-CSI SRTM 90m Database. CGIAR CSI Consort.Spat. Inf. 1–9, 2008.
Mahowald N. M., Muhs D. R., Levis S., Rasch P. J., Yoshioka M., Zender C. S., and Luo, C.: Change in atmospheric mineral aerosols in response to climate: Last glacial period, preindustrial, modern, and doubled carbon dioxide climates. J Geophys Res Atmos. 111, D10202, https://doi.org/10.1029/2005JD006653, 2006.
Mooney, W. D., Laske, G., and Masters, T. G.: CRUST 5.1: A global crustal model at 5° × 5°, J. Geophys. Res. Solid Earth 10, 727-747. https://doi.org/10.1029/97JB02122, 1998.
Potter, P., Ramankutty, N., Bennett, E.M., and Donner, S.D.: Characterizing the spatial patterns of global fertilizer application and manure production, Earth Interact. 14, 1–22, https://doi.org/10.1175/2009EI288.1, 2010.
Vet, R., Artz, R. S., Carou, S., Shaw, M., Ro, C. U., Aas, W., Baker, A., Bowersox, V. C., Dentener, F., Galy-Lacaux, C., Hou, A., Pienaar, J. J., Gillett, R., Forti, M. C., Gromov, S., Hara, H., Khodzher, T., Mahowald, N. M., Nickovic, S., Rao, P. S. P., and Reid, N. W.: A global assessment of precipitation chemistry and deposition of sulfur, nitrogen, sea salt, base cations, organic acids, acidity and pH, and phosphorus, Atmos. Environ. 93, 3–100. https://doi.org/10.1016/j.atmosenv.2013.10.060, 2014.
Zomer, R. J., Trabucco, A., Bossio, D. A., and Verchot, L. V.: Climate change mitigation: A spatial analysis of global land suitability for clean development mechanism afforestation and reforestation, Agric. Ecosyst. Environ. 126, 67-80. https://doi.org/10.1016/j.agee.2008.01.014, 2008.
- L254: parameter “mtry” not “mtyr”
A: We amended accordingly.
- L295: What do you mean by “anomalous values”? Higher than expected? And how did you handled these data? Removed as you did for the extreme Sr values of water or kept in the model?
A: We thank the reviewer for this comment. With "anomalous values" we refer to data that deviate from the expected isotopic range for these unconsolidated Quaternary sediments. We expected the values to generally show a strontium isotopic range of approximately 0.708 to 0.715, while our datasets show unusually radiogenic 87Sr/86Sr ratios compared to the rest of the sedimentary units. These values represent statistical outliers (Tukey’s fences) as reported in Figure 3. We considered these values as related to the heterogeneity of source deposits, such as fluvial/marine reworking processes and floodplain sources contributing to mixed sedimentary sources enriched by older local radiogenic lithologies. These samples come indeed from sediments near high isotopic Sr ratios in older formations of the Mantiqueira Province and Dom Feliciano Belt and that the higher values could be consistent with sediments derived from these formations. The data were kept in the models. Conversely, we removed from our dataset the extreme values (>0.900) from the Parguaza Batholith because they show remarkably high residuals in the RF models. .
We have clarified this point in the main manuscript and we report the text here:
'The presence of isotopic outliers in our plant dataset, with ratios higher than those expected for unconsolidated coastal deposits, can be explained by the heterogeneity of the Holocene and Pleistocene deposits (fluvial, marine, aeolian) on which the plants grew, and the characteristics of the parental lithologies. Since these values reflect geological variability, they were retained in the model. In contrast, the extremely radiogenic values for river waters (>0.900) were excluded.’
- L433-439: This is more out of curiosity and does not need to be included in the manuscript. Have you tested your model with the global dataset provided by Bataille et al. 2020 or more recent updates? Your reported R² is higher than studies using a global dataset to run their RF. I agree that when regional data coverage is good, a regional model is often preferable to a global one, but I would be interested to know how your model compares in that context.
A: This is a well-known issue in the field: whether to use global datasets when modeling local bioavailable Sr isotope variability. In our experience, incorporating data from highly diverse eco-geological settings often leads to worse predictions for the specific area of interest. While this is not universally true, it can be particularly problematic in geologically complex regions (e.g., Madagascar; Bataille et al., 2020). Additionally, the increased computational cost further discouraged us from using a global dataset to train the RF model. We do not claim that our approach is optimal in all cases, but we consider it a cautious and reasonable choice. The lower R^2 reported for the global isoscape by Bataille et al. (2020) is coherent when considering the much larger dataset used in their model. Such broad coverage likely reduces overall residuals at the global scale, even if local variability is likely underrepresented. In contrast, our dataset is smaller and derived entirely from a geologically complex region, where higher local variability can lead to lower model fit.
Citation: https://doi.org/10.5194/essd-2026-136-AC1
-
AC1: 'Reply on RC1', Tommaso Giovanardi, 20 May 2026
-
RC2: 'Comment on essd-2026-136', Clement Bataille, 21 Apr 2026
This manuscript presents an important new bioavailable 87Sr/86Sr dataset from southern Brazil (n = 233 plant samples) and compiles a continental-scale dataset for South America to generate random forest–based Sr isoscapes. The new field data are clearly valuable, geographically well distributed, and analytically robust. In particular, the analytical methods and metadata for the newly generated samples are exceptionally well documented, which is unfortunately still rare in isoscape studies and deserves explicit recognition. The authors should be congratulated for this. However, while the new dataset itself is very strong, there are some issues related to data compilation, attribution, metadata completeness, and reproducibility that must be addressed before publication. Most of these issues concern adherence to FAIR data principles. In its current form, the manuscript does not yet meet the standards expected for a fully reproducible data paper in Earth System Science Data.
My concerns are outlined below.
- Compilation transparency, provenance, and FAIR compliance
The manuscript relies heavily on a large compilation of previously published bioavailable Sr datasets (1588 datapoints from 883 sites), yet the compilation process itself is insufficiently described and documented.
Specifically it is unclear how the data were identified, selected, filtered, cleaned, harmonized, and quality-controlled. The manuscript or the zenodo link/metadata do not clearly explain many of the rationale for how non-local samples (e.g., humans, fauna) were assessed and excluded across all contributing studies. How many and why? This sort of sample selection needs clear/define criteria. How inconsistent metadata across studies were standardized. How duplicate sampling locations were handled beyond brief mention of “sites”? How were coordinates obtained when not explicitly reported in the original studies? How was analytical heterogeneity treated (i.e., some datasets use TIMS, others MC-ICPMS, others ICPMS) and these techniques have different precision, accuracy, and susceptibility to interferences? How was preparation protocol affected the data quality? For example soil digestion methods vary considerably across studies and do not represent the same bioavailable Sr pool.
I recommend the authors to explicitly document the compilation workflow, ideally with a dedicated subsection to follow existing best-practice compilation frameworks for isotope data as provided by IsoArcH / IsoBank metadata fields. See for example recent compilations like ARDUOUS (e.g. ESSD 2026). These frameworks emphasize transparent provenance, versioning, and machine-readable metadata. Ideally, the compiled dataset should be restructure to conform to an IsoArcH/Isobank-style schema with metadata table describing source, material type, analytical method, coordinate source, uncertainty, and QC status for every entry.
- Missing citations and lack of credit to original data producers
Another important point is that the individual dataset used in the compilation are not cited in the main manuscript. While “Supplement Table S1” is mentioned, this is insufficient. The model presented here is built entirely on years of fieldwork, laboratory work, and interpretation by many research groups. Failing to properly cite each original dataset will discourage future data sharing and is technically not following FAIR principles. The same issue applies to predictor layers used in the random forest model. While the Dr. Bataille compiled, reprojected, and reorganized these layers and is cited, the primary creators of those datasets must also be properly cited and acknowledged. Many of us, including myself, have learned this “the hard way” during large compilations (e.g. ARDUOUS). I would encourage the authors to give credits to all scientists who made their work possible.
- Treatment of outliers and model-performance comparisons
The discussion of outlier removal and model comparison needs some revision. While it is reasonable to remove extreme 87Sr/86Sr values it also fundamentally alters model performance metrics because RMSE and R² are strongly dependent on the value range and distribution. Models trained on datasets with fewer extreme values will always appear to perform better. As a result I don’t think it is ideal to say that the plant+lichen model is “better” than the full dataset model in an absolute sense. I think differences in RMSE and R² primarily reflect training-set composition (see Fig. 2), not intrinsic model quality. The same comment is true when comparing with other published models (e.g. Dosseto et al. 2025; Bataille et al. 2020). Sampling strategy matters, but training-set distribution matters far more. For example, it would be easy to produce a bioavailable Sr model across Egypt that would achieve RMSE < 0.001 simply because carbonate geology dominates and is highly predictable and not very variable. Conversely, South America includes cratonic terrains with extremely high values along with young volcanic terrain with extremely low values, inflating RMSE but providing higher R² due to larger variance. So I would recommend the authors to mention some of those points when discussing model performance which in my view is almost entirely dependent on the selected training set distribution. Or the authors could provide a more stratified evaluation if they absolutely want to compare model to each other.
- Predictor choice and interpretability
I am surprised to see r.GUM included as a predictor, if you could justify its inclusion mechanistically?
- Reproducibility and open science
I also recommend to fully follow FAIR principles to provide the entire modeling all scripts (data harmonization, predictor processing, modeling, validation, plotting) and the workflow should be archived on OSF or GitHub, with a permanent DOI. Ideally, provide a README explaining how the isoscapes can be regenerated from raw inputs.
Citation: https://doi.org/10.5194/essd-2026-136-RC2 -
AC2: 'Reply on RC2', Tommaso Giovanardi, 20 May 2026
Dear Reviewer,
we want to thank you for your time and effort in reviewing our paper. Here's a detailed response.
This manuscript presents an important new bioavailable 87Sr/86Sr dataset from southern Brazil (n = 233 plant samples) and compiles a continental-scale dataset for South America to generate random forest–based Sr isoscapes. The new field data are clearly valuable, geographically well distributed, and analytically robust. In particular, the analytical methods and metadata for the newly generated samples are exceptionally well documented, which is unfortunately still rare in isoscape studies and deserves explicit recognition. The authors should be congratulated for this. However, while the new dataset itself is very strong, there are some issues related to data compilation, attribution, metadata completeness, and reproducibility that must be addressed before publication. Most of these issues concern adherence to FAIR data principles. In its current form, the manuscript does not yet meet the standards expected for a fully reproducible data paper in Earth System Science Data.
My concerns are outlined below.
Compilation transparency, provenance, and FAIR compliance
The manuscript relies heavily on a large compilation of previously published bioavailable Sr datasets (1588 datapoints from 883 sites), yet the compilation process itself is insufficiently described and documented.
Specifically it is unclear how the data were identified, selected, filtered, cleaned, harmonized, and quality-controlled. The manuscript or the zenodo link/metadata do not clearly explain many of the rationale for how non-local samples (e.g., humans, fauna) were assessed and excluded across all contributing studies. How many and why? This sort of sample selection needs clear/define criteria. How inconsistent metadata across studies were standardized. How duplicate sampling locations were handled beyond brief mention of “sites”? How were coordinates obtained when not explicitly reported in the original studies? How was analytical heterogeneity treated (i.e., some datasets use TIMS, others MC-ICPMS, others ICPMS) and these techniques have different precision, accuracy, and susceptibility to interferences? How was preparation protocol affected the data quality? For example soil digestion methods vary considerably across studies and do not represent the same bioavailable Sr pool.
I recommend the authors to explicitly document the compilation workflow, ideally with a dedicated subsection to follow existing best-practice compilation frameworks for isotope data as provided by IsoArcH / IsoBank metadata fields. See for example recent compilations like ARDUOUS (e.g. ESSD 2026). These frameworks emphasize transparent provenance, versioning, and machine-readable metadata. Ideally, the compiled dataset should be restructure to conform to an IsoArcH/Isobank-style schema with metadata table describing source, material type, analytical method, coordinate source, uncertainty, and QC status for every entry.
A: We thank the reviewer for highlighting these points regarding transparency and compliance with FAIR principles. The dataset we produced in this study, available on Zenodo, contains useful information for reproducibility purposes, such as coordinates, geological unit, age, ⁸⁷Sr/⁸⁶Sr value, and related uncertainty. The literature dataset was compiled from existing databases and publications and includes all available information reported in those sources, such as sample names, coordinates, material type, and analytical uncertainties. Where such information was not provided in the original sources, we are unable to supply it, consistent with the approach taken in databases such as ARDUOS cited by the reviewer. We acknowledge the value and importance of the IsoArcH metadata model; however, its structure may be less intuitive for some users, as it distributes information across multiple spreadsheets and introduces renamed sample identifiers. In contrast, our approach consolidates all relevant data into a single sheet (with full references provided separately), which we believe offers a more straightforward and user-friendly format for readers.
We acknowledge, however, that the documentation provided for compilation could be made more explicit. In the revised version of the manuscript, we have added a description of the criteria adopted for sample selection and exclusion, duplicate management, harmonization, and treatment of data heterogeneity.
Here we report the text added in section 3.4:
For the continental-scale compilation, we included both environmental samples (plants, lichens, soils, and water) and archaeological human and faunal samples identified as local in the original publications. Non-local individuals were excluded. Metadata relating to coordinates and material categories were harmonized, and duplicates were consolidated by site. Uncertainty values not reported in the original publications or datasets were excluded.
Missing citations and lack of credit to original data producers
Another important point is that the individual dataset used in the compilation are not cited in the main manuscript. While “Supplement Table S1” is mentioned, this is insufficient. The model presented here is built entirely on years of fieldwork, laboratory work, and interpretation by many research groups. Failing to properly cite each original dataset will discourage future data sharing and is technically not following FAIR principles. The same issue applies to predictor layers used in the random forest model. While the Dr. Bataille compiled, reprojected, and reorganized these layers and is cited, the primary creators of those datasets must also be properly cited and acknowledged. Many of us, including myself, have learned this “the hard way” during large compilations (e.g. ARDUOUS). I would encourage the authors to give credits to all scientists who made their work possible.
A: We thank the reviewer for his valuable comments.
We have revised the text taking his suggestions into account to ensure compliance with the FAIR principles.
Here we report the text added in section 3.4:
'All primary datasets used in the compilation, including the open-access IsoArcH repository, Borges et al. (2021), Plomp (2021), Stantis et al. (2024) and Seferidou (2025), have been integrated with all additional sources and reported in Table S1. Overall data are from : Brass (1976), Palmer and Edmond (1989), Edmond et al. (1995; 1996), Hieronymus (1995), Henry et al. (1996), Gaillardet et al. (1997), Négrel and Lachassagne (2000), Grove et al. (2003), Knudson et al. (2004; 2013), Brunet et al. (2005), Pasquini et al. (2005), Knudson and Price (2007), Knudson (2008), Martins et al. (2008), Andrushko et al. (2009), Bastos (2009; 2014), Fiege et al. (2009), Hermenegildo (2009), Poszwa et al. (2009), Bastos et al. (2011; 2014; 2015; 2016; 2019; 2021), Laffoon et al. (2012), Lee et al. (2013), Machado (2013), Malaspinas et al. (2014), Pouilly et al. (2014), Santos et al. (2014), Oppitz (2015), Strauss et al. (2016), Barberena et al. (2017; 2019; 2020; 2021), Chala-Aldana et al. (2018), Duran et al. (2018), Slovak et al. (2018), Plomp et al. (2019), Moquet et al. (2020), Serna et al. (2020), Azevedo et al. (2021), Washburn et al. (2021), Fernandez et al. (2022), Quaggio et al. (2022), Torres-Rouff et al. (2022), Avigliano et al. (2023), Seferidou et al. (2023), Silva et al. (2023), Kafino et al. (2024), Martinelli et al. (2025), Loponte et al. (2025), Scaggion et al. (2025a).'
All the articles and datasets are now cited also in the main text.
Yet, we would also like to stress that in the original version all source publications were referenced in the Supplementary information, as is common practice in our field (see e.g., Bataille et al., 2020 or e.g. Wang et al., 2024). We fully recognize the value of the ARDUOS dataset and its contribution to data standardization and accessibility. However, we note that it represents one of several possible approaches to data organization and dissemination, and alternative formats, such as the one adopted here, can also meet transparency and reproducibility requirements.
Finally, we stress that the primary aim of this study is to present newly-measured isotope data and their interpretation. Thus, this manuscript is not ultimately intended as a review paper sensu stricto, although we have made the compiled data fully available.
Treatment of outliers and model-performance comparisons
The discussion of outlier removal and model comparison needs some revision. While it is reasonable to remove extreme 87Sr/86Sr values it also fundamentally alters model performance metrics because RMSE and R² are strongly dependent on the value range and distribution. Models trained on datasets with fewer extreme values will always appear to perform better. As a result I don’t think it is ideal to say that the plant+lichen model is “better” than the full dataset model in an absolute sense. I think differences in RMSE and R² primarily reflect training-set composition (see Fig. 2), not intrinsic model quality. The same comment is true when comparing with other published models (e.g. Dosseto et al. 2025; Bataille et al. 2020). Sampling strategy matters, but training-set distribution matters far more. For example, it would be easy to produce a bioavailable Sr model across Egypt that would achieve RMSE < 0.001 simply because carbonate geology dominates and is highly predictable and not very variable. Conversely, South America includes cratonic terrains with extremely high values along with young volcanic terrain with extremely low values, inflating RMSE but providing higher R² due to larger variance. So I would recommend the authors to mention some of those points when discussing model performance which in my view is almost entirely dependent on the selected training set distribution. Or the authors could provide a more stratified evaluation if they absolutely want to compare model to each other.
A: We think there is a misunderstanding. We agree that RMSE and R^2 are highly dependent on the isotopic distribution and variance of the training dataset and, therefore, cannot be interpreted as absolute indicators of model quality.
We intended to argue that the higher R^2 value observed in the plant+lichen model reflected the narrower isotopic range and more homogeneous composition of the dataset, rather than an intrinsically superior predictive performance.
This was in fact already reported in the original version of the manuscript at lines 338-341: ‘The lower variance observed in these latter models likely reflects the smaller number and lower variability of samples compared to the ‘All’ dataset, which includes outliers with 87Sr/86Sr higher than 0.818 in water samples (Fig. 3).’
We have still revised the Discussion section to state more explicitly that the observed differences in RMSE and R² values, resulting from the comparison between our models, are relative and driven by the data selection.
Here we report the sentence added in Discussion:
'We reiterate that these differences in RMSE and R² more accurately reflect the isotopic range and internal heterogeneity of the training datasets, rather than intrinsic differences in model quality.'
Predictor choice and interpretability
I am surprised to see r.GUM included as a predictor, if you could justify its inclusion mechanistically?
A: The r.GUM variable represents the global distribution of unconsolidated sediments and was included among the covariates in the original framework of Bataille et al. (2020). From a mechanistic perspective, such sediments (largely present in South America) can influence local Sr isotope signatures through weathering, sediment-water interactions, and the eventual incorporation of allochthonous material. In our approach, rather than arbitrarily excluding predictors a priori, we kept the full set of original covariates and allowed the model to evaluate their relative importance after multicollinearity screening. As shown in Fig. 6, r.GUM shows the lowest predictive contribution among the selected variables, indicating that while it may retain some environmental variability, its influence on model performance is limited if not almost null.
Reproducibility and open science
I also recommend to fully follow FAIR principles to provide the entire modeling all scripts (data harmonization, predictor processing, modeling, validation, plotting) and the workflow should be archived on OSF or GitHub, with a permanent DOI. Ideally, provide a README explaining how the isoscapes can be regenerated from raw inputs.
A: We thank the reviewer for the suggestion. We have included our script in the Zenodo repository, although it is not intrinsically different from the Bataille et al. (2020) approach, and it was used from our group in other publications such as Scaggion et al. (2025), Armaroli et al. (2024), Gigante et al. (2023).
Citation: https://doi.org/10.5194/essd-2026-136-AC2
Data sets
Bioavailable Sr isotopes of plants from Rio Grande do Sul and Santa Catarina states (Brazil, South America) and related isoscape raster files C. Scaggion et al. https://doi.org/10.5281/zenodo.17988601
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 293 | 150 | 42 | 485 | 69 | 29 | 35 |
- HTML: 293
- PDF: 150
- XML: 42
- Total: 485
- Supplement: 69
- BibTeX: 29
- EndNote: 35
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
In this paper, the authors develop a bioavailable Sr isoscape for South America using a machine learning approach. The model is trained on a compilation of published Sr data across the continent, complemented by new plant samples from two regions in Brazil. This work represents a valuable contribution, as spatially explicit Sr isoscapes remain scarce at the local and continental scales in South America, and such models have strong potential for applications in archaeology, ecology, and provenance studies.
Overall, the manuscript is well written and easy to follow. The introduction provides the necessary background and clearly frames the objectives. The methods are clearly described and rely on a robust and widely used modelling framework. The results are presented in a clear and detailed manner.
I only have a comment regarding the discussion. I was wondering whether there are published Sr datasets from southern Brazil (Santa Catarina or Rio Grande do Sul) that could be used for a qualitative comparison with the isoscape predictions. I am not suggesting a formal validation with statistical tests, but rather a brief and simple consistency check. It would help the reader assess how well the model performs in real-world contexts.
I only have a few minor comments:
- L248: Can you provide the description of the predictors that you included in the model? (e.g. as a table in SI). This would allow the reader to understand the variables without having to consult previous publications. It would also make it unnecessary to redefine abbreviations in the results section.
- L254: parameter “mtry” not “mtyr”
- L295: What do you mean by “anomalous values”? Higher than expected? And how did you handled these data? Removed as you did for the extreme Sr values of water or kept in the model?
- L433-439: This is more out of curiosity and does not need to be included in the manuscript. Have you tested your model with the global dataset provided by Bataille et al. 2020 or more recent updates? Your reported R² is higher than studies using a global dataset to run their RF. I agree that when regional data coverage is good, a regional model is often preferable to a global one, but I would be interested to know how your model compares in that context.