the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A random forest isoscape model of bioavailable Sr for South America: a focus on southern Brazil
Abstract. In recent years, advances in machine learning have greatly improved the generation of maps showing the geographic distribution of isotope ratios (isoscapes), which have become essential tools for environmental, mobility and provenance studies in both modern and archaeological contexts. Among the various isotopic systems employed, strontium (Sr) is particularly useful because its 87Sr/86Sr ratio in the environment is largely controlled by the underlying geology through the composition of local soils and rocks.
In this work, we present a new dataset of bioavailable 87Sr/86Sr ratios derived from n = 233 plant samples collected across southern Brazil, covering the states of Santa Catarina and Rio Grande do Sul (c.a. 370,000 km2). The measured ratios span from 0.70521 to 0.76039 and capture the bioavailable Sr isotope signatures over all major geological units in the region.
We combined these new data with an extensive compilation of published bioavailable Sr measurements from across South America (including plants, fauna, ancient human remains, shells, snails, lichens, water and soils) to construct three random forest Sr isoscapes using different subsets of the combined dataset at the regional and continental scales. The first model incorporates the entire dataset ('All' dataset, n = 883 sites), the second is based on plant+fauna+lichen+human (n = 661 sites) and the third is limited to plant+lichen samples (n = 531 sites). Among the three models, the full dataset model shows lower predictive power, while the plant+fauna+lichen+human and the plant+lichen models yield better results, with similar RMSE (0.0049 and 0.0054) and R2 values (ca. 0.76). Compared to existing Sr isoscapes of South America, our models significantly enhance both spatial coverage and resolution of bioavailable Sr predictions, particularly in southern Brazil.
The new bioavailable Sr isotope dataset from Santa Catarina and Rio Grande do Sul states is available at https://doi.org/10.5281/zenodo.17988601 (Scaggion et al., 2025a) and the compiled literature dataset is reported as supplementary material.
- Preprint
(1537 KB) - Metadata XML
-
Supplement
(2104 KB) - BibTeX
- EndNote
Status: open (until 11 May 2026)
- RC1: 'Comment on essd-2026-136', Mael Le Corre, 20 Apr 2026 reply
-
RC2: 'Comment on essd-2026-136', Clement Bataille, 21 Apr 2026
reply
This manuscript presents an important new bioavailable 87Sr/86Sr dataset from southern Brazil (n = 233 plant samples) and compiles a continental-scale dataset for South America to generate random forest–based Sr isoscapes. The new field data are clearly valuable, geographically well distributed, and analytically robust. In particular, the analytical methods and metadata for the newly generated samples are exceptionally well documented, which is unfortunately still rare in isoscape studies and deserves explicit recognition. The authors should be congratulated for this. However, while the new dataset itself is very strong, there are some issues related to data compilation, attribution, metadata completeness, and reproducibility that must be addressed before publication. Most of these issues concern adherence to FAIR data principles. In its current form, the manuscript does not yet meet the standards expected for a fully reproducible data paper in Earth System Science Data.
My concerns are outlined below.
- Compilation transparency, provenance, and FAIR compliance
The manuscript relies heavily on a large compilation of previously published bioavailable Sr datasets (1588 datapoints from 883 sites), yet the compilation process itself is insufficiently described and documented.
Specifically it is unclear how the data were identified, selected, filtered, cleaned, harmonized, and quality-controlled. The manuscript or the zenodo link/metadata do not clearly explain many of the rationale for how non-local samples (e.g., humans, fauna) were assessed and excluded across all contributing studies. How many and why? This sort of sample selection needs clear/define criteria. How inconsistent metadata across studies were standardized. How duplicate sampling locations were handled beyond brief mention of “sites”? How were coordinates obtained when not explicitly reported in the original studies? How was analytical heterogeneity treated (i.e., some datasets use TIMS, others MC-ICPMS, others ICPMS) and these techniques have different precision, accuracy, and susceptibility to interferences? How was preparation protocol affected the data quality? For example soil digestion methods vary considerably across studies and do not represent the same bioavailable Sr pool.
I recommend the authors to explicitly document the compilation workflow, ideally with a dedicated subsection to follow existing best-practice compilation frameworks for isotope data as provided by IsoArcH / IsoBank metadata fields. See for example recent compilations like ARDUOUS (e.g. ESSD 2026). These frameworks emphasize transparent provenance, versioning, and machine-readable metadata. Ideally, the compiled dataset should be restructure to conform to an IsoArcH/Isobank-style schema with metadata table describing source, material type, analytical method, coordinate source, uncertainty, and QC status for every entry.
- Missing citations and lack of credit to original data producers
Another important point is that the individual dataset used in the compilation are not cited in the main manuscript. While “Supplement Table S1” is mentioned, this is insufficient. The model presented here is built entirely on years of fieldwork, laboratory work, and interpretation by many research groups. Failing to properly cite each original dataset will discourage future data sharing and is technically not following FAIR principles. The same issue applies to predictor layers used in the random forest model. While the Dr. Bataille compiled, reprojected, and reorganized these layers and is cited, the primary creators of those datasets must also be properly cited and acknowledged. Many of us, including myself, have learned this “the hard way” during large compilations (e.g. ARDUOUS). I would encourage the authors to give credits to all scientists who made their work possible.
- Treatment of outliers and model-performance comparisons
The discussion of outlier removal and model comparison needs some revision. While it is reasonable to remove extreme 87Sr/86Sr values it also fundamentally alters model performance metrics because RMSE and R² are strongly dependent on the value range and distribution. Models trained on datasets with fewer extreme values will always appear to perform better. As a result I don’t think it is ideal to say that the plant+lichen model is “better” than the full dataset model in an absolute sense. I think differences in RMSE and R² primarily reflect training-set composition (see Fig. 2), not intrinsic model quality. The same comment is true when comparing with other published models (e.g. Dosseto et al. 2025; Bataille et al. 2020). Sampling strategy matters, but training-set distribution matters far more. For example, it would be easy to produce a bioavailable Sr model across Egypt that would achieve RMSE < 0.001 simply because carbonate geology dominates and is highly predictable and not very variable. Conversely, South America includes cratonic terrains with extremely high values along with young volcanic terrain with extremely low values, inflating RMSE but providing higher R² due to larger variance. So I would recommend the authors to mention some of those points when discussing model performance which in my view is almost entirely dependent on the selected training set distribution. Or the authors could provide a more stratified evaluation if they absolutely want to compare model to each other.
- Predictor choice and interpretability
I am surprised to see r.GUM included as a predictor, if you could justify its inclusion mechanistically?
- Reproducibility and open science
I also recommend to fully follow FAIR principles to provide the entire modeling all scripts (data harmonization, predictor processing, modeling, validation, plotting) and the workflow should be archived on OSF or GitHub, with a permanent DOI. Ideally, provide a README explaining how the isoscapes can be regenerated from raw inputs.
Citation: https://doi.org/10.5194/essd-2026-136-RC2
Data sets
Bioavailable Sr isotopes of plants from Rio Grande do Sul and Santa Catarina states (Brazil, South America) and related isoscape raster files C. Scaggion et al. https://doi.org/10.5281/zenodo.17988601
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 242 | 104 | 33 | 379 | 55 | 25 | 30 |
- HTML: 242
- PDF: 104
- XML: 33
- Total: 379
- Supplement: 55
- BibTeX: 25
- EndNote: 30
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
In this paper, the authors develop a bioavailable Sr isoscape for South America using a machine learning approach. The model is trained on a compilation of published Sr data across the continent, complemented by new plant samples from two regions in Brazil. This work represents a valuable contribution, as spatially explicit Sr isoscapes remain scarce at the local and continental scales in South America, and such models have strong potential for applications in archaeology, ecology, and provenance studies.
Overall, the manuscript is well written and easy to follow. The introduction provides the necessary background and clearly frames the objectives. The methods are clearly described and rely on a robust and widely used modelling framework. The results are presented in a clear and detailed manner.
I only have a comment regarding the discussion. I was wondering whether there are published Sr datasets from southern Brazil (Santa Catarina or Rio Grande do Sul) that could be used for a qualitative comparison with the isoscape predictions. I am not suggesting a formal validation with statistical tests, but rather a brief and simple consistency check. It would help the reader assess how well the model performs in real-world contexts.
I only have a few minor comments:
- L248: Can you provide the description of the predictors that you included in the model? (e.g. as a table in SI). This would allow the reader to understand the variables without having to consult previous publications. It would also make it unnecessary to redefine abbreviations in the results section.
- L254: parameter “mtry” not “mtyr”
- L295: What do you mean by “anomalous values”? Higher than expected? And how did you handled these data? Removed as you did for the extreme Sr values of water or kept in the model?
- L433-439: This is more out of curiosity and does not need to be included in the manuscript. Have you tested your model with the global dataset provided by Bataille et al. 2020 or more recent updates? Your reported R² is higher than studies using a global dataset to run their RF. I agree that when regional data coverage is good, a regional model is often preferable to a global one, but I would be interested to know how your model compares in that context.