the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Reference maps of soil phosphorus for the pan-Amazon region
João Paulo Darela-Filho
Anja Rammig
Katrin Fleischer
Tatiana Reichert
Laynara Figueiredo Lugli
Carlos Alberto Quesada
Luis Carlos Colocho Hurtarte
Mateus Dantas de Paula
David M. Lapola
Download
- Final revised paper (published on 31 Jan 2024)
- Supplement to the final revised paper
- Preprint (discussion started on 16 Aug 2023)
- Supplement to the preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on essd-2023-272', Anonymous Referee #1, 28 Sep 2023
This manuscript employs Random Forests to map spatial patterns of various forms of phosphorus (P) within the topsoil profile (0 - 30 cm) across the pan-Amazon region at a resolution of 0.5 x 0.5 degrees. Leveraging a dataset of 108 soil P observations, the authors have generated comprehensive maps depicting the distribution of total, available, organic, and inorganic P forms in the topsoil. The results demonstrate commendable accuracy in estimating these various P forms. The manuscript is generally well-crafted with clearly defined objectives, and the study presents several merits. However, there are some methodological aspects lacking clarity and justification, along with a few minor concerns.
The main shortcomings of this study include:
Resolution Rationale: The manuscript lacks a clear justification for selecting a 0.5-degree resolution for mapping. Providing insight into the reasoning behind this choice would enhance the manuscript's robustness.
Methodological Explanation: Certain methodological aspects, such as the approach to Random Forests model selection and the use of 105 Random Forests models for 108 observations, require more detailed explanation and justification. The reason for excluding primary mineral P and the occluded P forms were not solid.
Temporal Representativeness: The temporal scope of the soil P estimates needs to be clarified and discussed. It would be beneficial to address the use of data collected at different periods and its potential impact on the results, especially in the context of changing soil conditions.
Additionally, the manuscript could be strengthened by:
High-Resolution Covariate Exploration: Given that many relevant soil P covariates are available at finer spatial grids, discussing the potential benefits of reproducing this study with higher spatial resolution information would enhance the value of the presented soil P data.
Sensitivity to Spatial Support: An exploration of how soil P predictions might vary with different spatial support levels would provide valuable insights into the robustness of the results.
Finally, on a minor note, it's important to consistently capitalize "Random Forests" throughout the manuscript, as it is the name of the algorithm.
Citation: https://doi.org/10.5194/essd-2023-272-RC1 -
RC2: 'Comment on essd-2023-272', Anonymous Referee #2, 05 Oct 2023
Dear Authors,
I have reviewed your manuscript titled "Mapping Spatial Patterns of Phosphorus in Pan-Amazon Region Using Random Forests" with great interest. The study explores the distribution of various forms of phosphorus (P) within the topsoil profile across the pan-Amazon region at a resolution of 0.5 x 0.5 degrees, using a dataset of 108 soil P observations. While I find the manuscript to be generally well-crafted and commend the accuracy of your P form estimations, there are some methodological aspects that require clarification and addressing, along with some major concerns.
-
Lack of Innovation in the Fitting Dataset: It is noted that your data sources are primarily from Hou et al., 2018, and Quesada et al. (2020). However, it is apparent from Supplementary Figure S6 that there is a lack of observation data in the central extensive Solimoes Basin. This gap in the dataset should have been addressed or justified in the manuscript.
-
Resolution Insufficiency: The manuscript does not provide a clear justification for selecting a 0.5-degree resolution for mapping. Given the availability of higher-resolution climate data and other relevant soil physicochemical properties at finer resolutions, the choice of such a coarse resolution for a relatively small study area requires further explanation.
-
Lack of Methodological Innovation: The manuscript mentions suboptimal model performance in high-altitude regions. It would be beneficial to explore the possibility of stratifying the analysis, perhaps by altitude or dominant vegetation types. This would demonstrate the flexibility of the Random Forest methodology. Additionally, if there is a lack of representative data in the central part of the study area, consider leveraging transfer learning techniques with data from other similar terrains globally.
-
Insufficient Model Validation: The manuscript could benefit from a comparison of simulation results with other relevant data sources such as the World Soil Information database, especially when using inputs from various soil databases. This would enhance the robustness of your findings.
-
Temporal Representativeness: The temporal scope of your soil P estimates should be clarified and discussed, especially considering data collected at different time periods. Exploring the temporal variation of soil P and its relationship with climate change would add depth to your study.
-
Sensitivity Analysis: It is crucial to conduct sensitivity analyses to assess how soil P predictions might vary with different spatial support levels. This would provide valuable insights into the robustness of your results.
In summary, while your manuscript presents promising results and is well-structured, it lacks some critical clarifications and justifications in terms of data, resolution choice, methodological innovation, validation, temporal analysis, and sensitivity analysis. Addressing these concerns would significantly strengthen your manuscript.
Please consider these comments in your revision. I look forward to seeing an improved version of your work.
Citation: https://doi.org/10.5194/essd-2023-272-RC2 -
-
AC1: 'Comment on essd-2023-272', Joao Paulo Darela-FIlho, 03 Nov 2023
Dear Editor,
We sincerely thank you for considering our data description paper for publication in Earth System Science Data. We also thank the anonymous referees for their time, insightful comments, and recommendations, which helped us improve our manuscript. After carefully considering all feedback, we believe that the dataset and the data description paper have significantly improved.
In this revised version, we have reconstructed the dataset with a higher spatial resolution (5 arcminutes). Additionally, some of the geospatial datasets used in the methodology have been replaced with more recent products. We have also updated the manuscript to incorporate the suggestions of the anonymous referees. Specifically, we have clarified our model selection approach and addressed other methodological aspects that were pointed out.
Below, we reproduce the comments of Referees #1 and #2, followed by our responses (Author’s Response).
Referee #1
Resolution Rationale: The manuscript lacks a clear justification for selecting a 0.5-degree resolution for mapping. Providing insight into the reasoning behind this choice would enhance the manuscript's robustness.
Author’s Response: We thank the referee for this comment. Our initial choice of resolution was influenced by the primary purpose of these maps, which was to support the parameterization and evaluation of land surface/vegetation models. These models are computationally intensive and generally use a 0.5-degree resolution. However, we acknowledge that other uses of the reference maps, as proposed in the manuscript (Lines 348 and 355), could benefit from a finer resolution. Therefore, we are pleased to offer the set of reference maps at a 5-arcminute resolution.
During this process, we revisited the pre-processing of the predictive data and incorporated more recent datasets that have been published. Now, all variables, apart from temperature, precipitation, and elevation, are sourced from SoilGrids 2.0 [International Soil Reference and Information Centre] (Poggio et al., 2021).
We have added a sentence in Section 2.2 (Line 129) to clarify the rationale behind our chosen resolution.
Poggio, L., de Sousa, L. M., Batjes, N. H., Heuvelink, G. B. M., Kempen, B., Ribeiro, E., and Rossiter, D.: Soilgrids 2.0: Producing Soil Information for the Globe with Quantified Spatial Uncertainty, SOIL, 7, 217-240, https://doi.org/10.5194/soil-7-217-2021, 2021.
Methodological Explanation: Certain methodological aspects, such as the approach to Random Forests model selection and the use of 105 Random Forests models for 108 observations, require more detailed explanation and justification. The reason for excluding primary mineral P and the occluded P forms were not solid.
Author’s Response: We thank the referee for this comment. Regarding the model selection approach, we added the following text in the section 2.3 (Line 151):
“We chose this selection approach due to the inherent stochasticity in both the train/test split phase and the training of Random Forest models. In the former, samples from the dataset are randomly assigned to either the training or testing sets. In the latter, stochasticity arises from two factors: (i) bootstrap sampling, where each decision tree is trained on a random sample (with replacement) from the dataset, and (ii) feature randomness during decision tree construction. Unlike standard decision tree construction, which uses the feature that provides the most information gain for a split (or tree branch), Random Forests build each tree based on a random subset of features from the training data. Therefore, by selecting a group of models from a pool, we can capture the inherent stochasticity in the models while choosing the most accurate ones.”
Upon revisiting our model selection, we decided to include the Random Forest models trained on the occluded P form in the production of the reference maps. Although the number of selected models and their accuracy were lower compared to other P forms, we decided to include them due to the positive feedback from the reviewers. As mentioned, the initial choice to not include the occluded P in the model fitting was based on the lower number of selected models for this P form. Nonetheless, due to the significance of occluded P to the understanding of ecological processes in the pan-Amazon region, we decided to include after the review.
Unfortunately, this was not possible for primary mineral P. In our model selection approach, all models trained to predict primary mineral P demonstrated very low accuracy values. We believe that this is caused by the trace amounts of mineral P form (Calcium bound P) found in most of the samples in the fitting dataset. Which in its turn, is related to the lack of observations in the most P rich sites. Additionally, the trace amounts of Calcium bound P found in the samples are explained by the elevated pH and advanced weathering stages of these soils. We have added an explanation for the exclusion of mineral P from model fitting and selection in Section 2.3 (Line 178).
Temporal Representativeness: The temporal scope of the soil P estimates needs to be clarified and discussed. It would be beneficial to address the use of data collected at different periods and its potential impact on the results, especially in the context of changing soil conditions.
Author’s Response: The comment raises a crucial point, and we thank the reviewer for pointing it out. The creation of the P reference maps assumes that the size of the P forms pools remains constant during the sampling process. This significant assumption was not mentioned in the previous version of the manuscript. Given the timescales of P transformations in soils, we believe this to be a reasonable assumption. Unfortunately, the challenges associated with data collection in the Amazon are unparalleled in terms of available human and economic resources. This is an impediment to continuous survey campaigns aimed at consecutive collections of data in the region. It is also beyond the scope of our study to investigate the dynamics of P forms in soils on this spatial scale. We have now included this information at the end of subsection 2.1 (Line 116) in the Materials and Methods section. In section 4.3 (Line 346), we propose a potential use for the reference maps in studies investigating the dynamics of P forms in soils.
Additionally, the manuscript could be strengthened by:
High-Resolution Covariate Exploration: Given that many relevant soil P covariates are available at finer spatial grids, discussing the potential benefits of reproducing this study with higher spatial resolution information would enhance the value of the presented soil P data.
Author’s Response: We thank the referee for this comment. As previously mentioned, we have revisited the pipeline for creating the reference maps and rebuilt it with a finer resolution (5 arcminutes). In addition, we have replaced the predictive geospatial data on soil physiochemical attributes with more recent products.
Sensitivity to Spatial Support: An exploration of how soil P predictions might vary with different spatial support levels would provide valuable insights into the robustness of the results.
Author’s Response: The feedback from the referee is appreciated. We acknowledge that an analysis of this nature would require the exclusion or aggregation of some sampled points for the training of the Random Forest models. Given the limited number of samples, we chose to use a multivariate dissimilarity index. This approach allows us to avoid applying the models to data outside the training range, while testing the generality of the selected models using a Monte-Carlo cross-validation. In our view, this is the most effective approach, given the constraints imposed by the small number of samples and the characteristics of the Random Forest algorithm.
While we agree that some variables in the fitting dataset have different spatial supports (for example, pixels for climatic data and soil cores for soil data), it’s important to note that the Random Forest algorithm is not a geostatistical interpolation technique. Therefore, its requirements and assumptions for application in a geospatial context differ.
Finally, on a minor note, it's important to consistently capitalize "Random Forests" throughout the manuscript, as it is the name of the algorithm.
Author’s Response: We thank the referee for noticing it. We have now ensured that the term “Random Forest” is consistently capitalized throughout the manuscript.
Referee #2
Lack of Innovation in the Fitting Dataset: It is noted that your data sources are primarily from Hou et al., 2018, and Quesada et al. (2020). However, it is apparent from Supplementary Figure S6 that there is a lack of observation data in the central extensive Solimoes Basin. This gap in the dataset should have been addressed or justified in the manuscript.
Author’s Response: We concur with the referee regarding the sparse nature of the sampled data. The limitations inherent in data collection across the pan-Amazon region are due to the limited mobility options. Many of the sampled sites can only be accessed through lengthy journeys along rivers and trails in the heart of the forest (Carvalho et al. 2023).
In our view, the world’s largest tropical forest has not received the attention it deserves, despite the tremendous efforts of scientists who spend considerable amounts of time and risk their lives to collect data in the Amazon wilderness. Furthermore, our study’s primary objective was to use available data with an alternative statistical method, to overcome the challenges encountered by other methods that used the same data to map P forms in the region.
We have addressed this issue by adding a paragraph to Section 2.1 in the manuscript (Line 116).
Carvalho, R. L., Resende, A. F., Barlow, J., Franca, F. M., Moura, M. R., Maciel, R., Alves-Martins, F., Shutt, J., Nunes, C. A., Elias, F., Silveira, J. M., Stegmann, L., Baccaro, F. B., Juen, L., Schietti, J., Aragao, L., Berenguer, E., Castello, L., Costa, F. R. C., Guedes, M. L., Leal, C. G., Lees, A. C., Isaac, V., Nascimento, R. O., Phillips, O. L., Schmidt, F. A., Ter Steege, H., Vaz-de-Mello, F., Venticinque, E. M., Vieira, I. C. G., Zuanon, J., Synergize, C., and Ferreira, J.: Pervasive Gaps in Amazonian Ecological Research, Curr Biol, 33, 3495-3504 e3494, https://doi.org/10.1016/j.cub.2023.06.077, 2023.
Resolution Insufficiency: The manuscript does not provide a clear justification for selecting a 0.5-degree resolution for mapping. Given the availability of higher-resolution climate data and other relevant soil physicochemical properties at finer resolutions, the choice of such a coarse resolution for a relatively small study area requires further explanation.
Author’s Response: We value the referee's suggestion and have taken it into consideration. Our original choice for a half-degree resolution was based on the anticipated use of the reference maps for the parameterization and evaluation of land surface/vegetation models. We agree that, given the relatively small study area, the initial choice of spatial resolution was not optimal. To address this issue, we have reconstructed the reference maps at a finer spatial resolution of five arcminutes. Additionally, we have incorporated more recent datasets of soil physiochemical properties. All geospatial datasets with soil properties are now sourced from SoilGrids 2.0 [International Soil Reference and Information Centre] (Poggio et al., 2021). We have added a sentence in Section 2.2 (Line 129) to clarify the rationale behind our chosen resolution.
Poggio, L., de Sousa, L. M., Batjes, N. H., Heuvelink, G. B. M., Kempen, B., Ribeiro, E., and Rossiter, D.: Soilgrids 2.0: Producing Soil Information for the Globe with Quantified Spatial Uncertainty, SOIL, 7, 217-240, https://doi.org/10.5194/soil-7-217-2021, 2021.
Lack of Methodological Innovation: The manuscript mentions suboptimal model performance in high-altitude regions. It would be beneficial to explore the possibility of stratifying the analysis, perhaps by altitude or dominant vegetation types. This would demonstrate the flexibility of the Random Forest methodology. Additionally, if there is a lack of representative data in the central part of the study area, consider leveraging transfer learning techniques with data from other similar terrains globally.
Author’s Response: We acknowledge and appreciate the referee's point of view on this matter. However, we would like to clarify that we did not claim that we observed suboptimal model performance at high altitudes. The model’s performance was evaluated using appropriate metrics (accuracy, R2, MAE) and a Monte-Carlo cross-validation. The limited number of observations in these high elevation environments prevented us from applying the fitted models in areas where the predictive variables showed high multivariate dissimilarity between the fitting dataset and the predictive dataset.
The exclusion of some areas in the maps is related to a well-known limitation of the method applied. The Random Forest algorithm can be inaccurate when data outside the ranges used in the training phase is used for testing or prediction. We observed, a posteriori, that the areas excluded after using the Dissimilarity Index (DI) are characterized by high elevation.
Due to the small number of sampled points, especially in the most elevated areas, we believe that a stratified analysis is not an optimal approach as it would require us to train the models on even smaller subsets of data. Moreover, including a categorical variable defining elevation would be redundant, because elevation is already in the dataset.
Regarding the dominant vegetation type, most of the sampled points in the fitting dataset are in forests. A categorical variable defining vegetation type would be severely unbalanced and uninformative. As shown using the DI, despite the small number of observations in the Solimões basin, the region is well represented in terms of all variables when compared to the most elevated areas.
Nonetheless, new data collection campaigns throughout the study area could improve future studies.
Insufficient Model Validation: The manuscript could benefit from a comparison of simulation results with other relevant data sources such as the World Soil Information database, especially when using inputs from various soil databases. This would enhance the robustness of your findings.
Author’s Response: We appreciate the referee’s comment, but we find it unclear. The World Soil Information database does not provide maps of P forms in soil that could be used to evaluate our maps. This database is now the source for a subset of the covariates used in constructing the P maps at the new resolution of five arcminutes. As the World Soil Information database represents the state-of-the-art in terms of geospatial soil datasets, we consider that the application of the models using outdated soil datasets would not be of great benefit.
Temporal Representativeness: The temporal scope of your soil P estimates should be clarified and discussed, especially considering data collected at different time periods. Exploring the temporal variation of soil P and its relationship with climate change would add depth to your study.
Author’s Response: We kindly ask Referee #2 to refer to our response to the same issue raised by Referee #1. In that response, we clarified the caveats related to the temporal scope of data collection and stated the main assumption that the size of the P pools did not change significantly during the data acquisition period.
While we agree that the temporal variability of P in soil and its relationship with climate change is an important topic, discussing the impacts of climate change on the dynamics of P forms would require a different methodological approach. Therefore, we believe that this topic is outside the scope of this study.
Sensitivity Analysis: It is crucial to conduct sensitivity analyses to assess how soil P predictions might vary with different spatial support levels. This would provide valuable insights into the robustness of your results.
Author’s Response: We thank the referee for this comment. We kindly ask Referee #2 to refer to our response to the same issue raised by Referee #1 regarding sensitivity to spatial support.
Citation: https://doi.org/10.5194/essd-2023-272-AC1