Global patterns and drivers of soil total phosphorus concentration

. Soil represents the largest phosphorus (P) stock in terrestrial ecosystems. Determines the amount of soil P is a critical 15 first step for identifying sites where ecosystem functioning is potentially limited by soil P availability. However, global patterns 16 and predictors of soil total P concentration remain poorly understood. To address this knowledge gap, we constructed a database 17 of total P concentration of 5,275 globally distributed (semi-)natural soils from 761 published studies. We quantified the relative 18 importance of 13 soil-forming variables in predicting soil total P concentration and then made further predictions at the global 19 scale using a random forest approach. Soil total P concentration varied significantly among parent material types, soil orders, 20 biomes, and continents, and ranged widely from 1.4 to 9,630.0 (median 430.0 and mean 570.0) mg kg -1 across the globe. About 21 two-thirds (65%) of the global variation was accounted for by the 13 variables that we selected, among which soil organic 22 carbon concentration, parent material, mean annual temperature, and soil sand content were the most important ones. While 23 predicted soil total P concentrations increased significantly with latitude, they varied largely among regions with similar 24 latitudes due to regional differences in parent material, topography, and/or climate conditions. Soil P stocks (excluding 25 Antarctica) were estimated to be 26.8 ± 3.1 (mean ± standard deviation) Pg and 62.2 ± 8.9 Pg (1 Pg=1×10 15 g) in the topsoil 26 (0-30 cm) and subsoil (30-100 cm), respectively. Our global map of soil total P concentration as well as the underlying drivers 27 of soil total P concentration can be used to constraint Earth system models that represent the P cycle and to inform 28 quantification of global soil P availability. Raw datasets and global maps generated in this study are available at 29 (He et al., 2021). By constructing a database of total P concentration globally, we quantified the relative importance of multiple soil-forming 374 variables for predicting soil total P concentration and further estimated it at the global scale. Our results indicated that no single variable can be used to predict soil total P concentration. Instead, it is a combination of variables that are needed to reliably 376 predict soil total P concentration, among which SOC, parent material, MAT, and soil sand content are the most important predictors. Soil total P concentration was positively correlated with SOC and negatively correlated with both MAT and soil 378 sand content. Our predicted map captures the latitudinal gradient in potential soil total P concentration expected from our 379 theoretical understanding. We estimated that P stocks in the topsoil (0-30 cm) and subsoil (30-100 cm) of soil of natural ecosystems (excluding Antarctica) were 26.8 and 62.2 Pg, respectively. Our improved global map of soil total P will be an important resource for future work which aims to tackle issues related to P cycling, including predicting future land carbon P to aquatic and ecosystems of to increase food security.

carbon concentration, parent material, mean annual temperature, and soil sand content were the most important ones. While 23 predicted soil total P concentrations increased significantly with latitude, they varied largely among regions with similar 24 latitudes due to regional differences in parent material, topography, and/or climate conditions. Soil P stocks (excluding 25 Antarctica) were estimated to be 26.8 ± 3.1 (mean ± standard deviation) Pg and 62.2 ± 8.9 Pg (1 Pg=1×10 15 g) in the topsoil 26 (0-30 cm) and subsoil (30-100 cm), respectively. Our global map of soil total P concentration as well as the underlying drivers 27 of soil total P concentration can be used to constraint Earth system models that represent the P cycle and to inform 28 quantification of global soil P availability. Raw datasets and global maps generated in this study are available at  In terrestrial ecosystems, to a depth of one meter from the land surface, most of the P is found in the soil (Zhang et 32 2 al., 2021). The amount and form of P determine the supply of soil P to plants, which further regulate the structure and function 33 of global terrestrial ecosystems (Vitousek et al., 2010;Hou et al., 2020;Elser et al., 2007;Hou et al., 2021). Moreover, the 34 amount or total concentration of P in soils determines P concentration in all major forms in soils (Hou et al., 2018a;35 Turner and Engelbrecht, 2011). Therefore, it is important to determine the total concentration of P in soils, which varies up 36 to three orders of magnitude across the globe (Yanai, 1998;Augusto et al., 2010;Zhang et al., 2021). Despite the large variation 37 in soil total P concentration, its global patterns and drivers remain poorly resolved and improving this knowledge gap is needed 38 to better represent the P cycle in Earth system models (Fleischer et (Ringeval et al., 2017) and is crucial for both mapping soil total P concentration in natural 49 terrestrial ecosystems (Reed et al., 2015) and simulating ecosystem functioning (Achat et al., 2016a).

50
While each soil-forming factor can determine soil total P concentration, the roles of some factors (e.g., climate and 51 vegetation) are less understood than other factors (e.g., parent material and soil age). Since P in soil is derived mainly from 52 parent materials, the control of parent material on soil total P concentration has been well recognized (Augusto et al., 2017; 53 Porder and Ramachandran, 2013). Soil chronosequences provide a unique opportunity to isolate the effect of soil age from 54 other soil-forming factors on soil P dynamics, and have shown that soil age negatively impacts soil total P concentration 55 (Wardle et al., 2004;Delgado-Baquerizo et al., 2020;Vitousek et al., 2010;Walker and Syers, 1976 (Shangguan et al., 2014;Yang et al., 2013). These two maps have been used to 67 explore global patterns of soil P supply (Yang et al., 2013), estimate P limitation on future terrestrial C sequestration (Sun et 68 al., 2017), and used as baseline information to quantify P supply in agricultural ecosystems (Ringeval et al., 2017). They are 69 also used frequently in land surface models to benchmark soil P modules Goll et al., 2012). However, the 70 two maps may suffer from large uncertainties due to limited numbers of predictors used and/or low spatial coverage of global

76
To address these issues, we constructed a global database of total P concentration of 5,275 (semi-)natural soils from 761 77 published studies. We defined (semi-)natural ecosystems as ecosystems without any documented significant anthropogenic 78 activities such as tillage, fertilization, and heavy grazing. We then used random forest algorithms to quantify the relative 79 importance of soil-forming variables for predicting soil total P concentration and further predicted it at the global scale. In our 80 predicted map, we did not remove cropland or other heavily influenced areas (e.g., cities and roads), so the predicted map Given massive measurements of soil total P concentration in literature, it is practically infeasible to collect all the 88 measurements in literature. Therefore, we collected soil total P concentration measurements in (semi-)natural terrestrial 89 ecosystems mainly from existing global or regional databases, and additionally from literature with focus on the 90 underrepresented regions identified in global databases, to ensure a good coverage of global terrestrial ecosystems. We defined 91 (semi-)natural ecosystems as ecosystems without any documented significant anthropogenic activities such as tillage, 92 fertilization, and heavy grazing. Forests with a stand age greater than 10 years were considered as (semi-)natural ecosystems. We carefully checked the description of soil sampling in every cited paper for any anthropogenic activities such as tillage, 94 fertilization, and heavy grazing, and excluded such samples. Despite our efforts to exclude soils affected by anthropogenic 95 activities, some soils in our database might be influenced by undocumented anthropogenic activities (e.g., P fertilization in 96 reforested lands), particularly in Western Europe and Eastern USA (e.g., De Schrijver et al., 2012). We compiled the database 97 in four steps, which are described as follows.

98
First, we searched existing global or regional databases that may include soil total P concentration measurements in 99 (semi-)natural ecosystems in the Web of Science using key words "global OR terrestrial OR meta-analysis" AND "soil  Table S1.

107
Second, we used "soil phosphorus" as keywords to search global or regional databases stored in public data repositories  Table S1.

123
Fourth, we searched additional soil total P concentration measurements from underrepresented regions identified in steps 124 5 1-3, from Web of Science using keywords of "soil phosphorus" along with the keywords of the underrepresented regions (listed 125 in detail in Table S2). According to criteria above, we only collected soil total P concentration measurements in (semi-)natural 126 terrestrial ecosystems. In this step, we collected 541 additional site-level measurements of soil total P concentrations from 60 127 additional papers (Table S2; Supplementary Text 1).

128
Following these steps, our database included 5,275 measurements of soil total P concentration at the five factors were directly considered here (Table 1): parent material, climate (i.e. mean annual temperature (MAT), mean 142 annual precipitation (MAP), and biome), vegetation (i.e. net primary production (NPP)), and topography (e.g. slope and 143 elevation). As soil age was rarely reported, we used USDA soil orders as a proxy for age with 3 classes: slightly, intermediately, 144 and strongly weathered (Yang et al., 2013;Smeck, 1985

154
In addition to predictors of soil total P concentration related to soil-forming factors, we collected information about the 155 properties of the soils (e.g. soil organic carbon (SOC), soil pH, soil clay content (Clay) and soil sand content (Sand), and soil 156 depth (Depth); Table 1). These soil properties were used as additional predictors. We extracted predictors from each original 157 publication when available. In cases where information on predictors were not reported, we extracted the missing data from 158 gridded datasets (Table S3) based on the geographical coordinates of the measurement sites.

159
In the random forest model, correlated predictors can be substituted for each other, so that the importance of correlated 160 predictors will be shared, making the estimated importance smaller than the true value (Strobl et al., 2008). Thus, we did not 161 include soil total nitrogen content as it is correlated with SOC (r = +0.84), nor did we include aridity index as it is strongly 162 correlated with MAP (r = +0.82). We also did not include variables that were rarely reported in the referenced studies (e.g. soil 7 extractable aluminum and iron concentrations).

165
Among the 5,275 soil total P measurements, there were 15 extremely high values (> 4000 mg kg -1 ) (Fig. 1B). These high 166 values were likely derived in exceptional geological contexts (Porder and Ramachandran, 2013), or special soils (e.g. very 167 young volcanic soils). We reported these extremely high values while summarizing the database, for example, in Table 2 and   168   Table 3. However, we excluded these 15 measurements from model training and correlation analyses to avoid their possible 169 large influences on the overall relationships between soil total P concentration and other variables.

170
We compared a suite of algorithms against the aforementioned 13 predictors which included three generalized linear 171 models, Cubist model, Boosted tree model, and random forest model ( the performance of the models. In this method, the whole dataset was randomly split into five folds, each of which contained 174 20% of the data. One fold of data was used as test data, while the other four folds were used as training data. Then another fold 175 of data was used as test data, and the remaining ones as training data, and so on and so forth for a total of five times. Averages 176 of five sets of R 2 and RMSE were used as the model R 2 and RMSE, respectively. Based on the five-fold cross-validation 177 method, the random forest algorithm performed the best (R 2 = 0.65) among all five algorithms (Table S4) and was therefore 178 selected for follow-up analyses. Five-fold cross-validation was performed using the R package caret (v. 6.0-86) (Kuhn, 2020).

179
Random forest analysis was performed with the R package caret by applying the embedded R package randomForest version  Table S3). We did not remove cropland or other heavy influenced areas 186 (e.g. cities, roads, etc.), so the predicted map can be used to represent an initial state without direct anthropogenic influence.

187
Here we assume that cropland and other heavily influenced areas in their native states had the same set of relationships as for 188 (semi-)natural lands.
189 Soil depth was used as a covariate, so that the models could predict soil total P concentration for any given depth (Hengl

236
The random forest regression model explained 65% of soil total P concentration variability across all sites, with an RMSE 237 of 288.8 mg kg -1 (Fig. 3B & Table S4). The random forest model revealed that the two most important predictors of soil total 238 P concentration were SOC content and parent material. The remaining predictors showed a lower, but non-negligible influence, 239 11 with MAT and soil sand content having the most noticeable influence (Fig. 3A). Although soil order, biome, elevation, slope, 240 depth, NPP, and pH showed significant influences on soil total P concentration ( Fig. 2 and Fig. S3), their relative importance 241 was lower than the above four predictors. Partial dependent plots (Fig. 4) revealed similar results to Pearson correlation analysis 242 (Fig. S3). The partial dependent plots indicated a significant and positive relationship between soil total P concentration and 243 SOC at a global scale; soil total P concentration was significantly and negatively correlated with MAT and soil sand content 244 (Fig. S4). The Pearson correlation indicated the correlation coefficients between soil total P concentration and the top three 245 continuous predictors MAT, SOC, and soil sand content were -0.23, 0.19, and -0.18, respectively (Fig. S3).

255
In our predicted global map (Fig. 5), we did not remove cropland or other heavily influenced areas (e.g., cities and roads), 256 so the predicted map can be used to represent a natural background without direct anthropogenic influence. The predicted soil 257 total P indicated that the total global P stocks in the topsoil (0-30 cm) and subsoil (30-100 cm) were 26.8 (standard deviation 258 = 3.1) Pg and 62.2 (standard deviation = 8.9) Pg, respectively (excluding Antarctica; Table 4). Estimated area-weighted average 259 soil total P concentrations in the topsoil and subsoil were 529.0 and 502.3 mg kg -1 , respectively. Estimated area-weighted 260 average soil total P content in the topsoil and subsoil were 209.7 and 487.0 g cm -2 , respectively. The estimated global map of soil total P concentration revealed latitudinal patterns (Fig. 5), which were also found from 266 analysis of the site-level data (Fig. S4K). Soil total P concentration significantly increased from the equator to the poles in both  (Fig. S6).

271
Highlands and mountains at low latitudes (e.g., the Tibetan plateaus, Andes, Africa highlands, west India etc.) had high 272 soil total P concentrations. Our map also indicated some regional difference in soil total P, for example, central Australia was 273 low in soil total P compared with east and west Australia. On a larger scale, South America, Oceania, and Africa had the lowest 274 soil total P concentration, while soil total P concentration was highest in Europe, North America, and Asia (Table 4). The 275 estimated soil total P concentrations in the subsoil showed similar patterns to those found in the topsoil (Fig. 5A&C). data that did not represent the heterogeneous conditions found on Earth well, and did not separate natural soils from human-288 managed soils and therefore may not be able to distinguish natural drivers from anthropogenic factors (e.g. land use type, 289 mineral fertilizer). In addition, we mapped soil total P concentration by considering more predictors, at multiple soil depths, 290 and at a higher resolution than previous studies.

322
Further, we provide two explanations for the negative relationship between soil total P concentration and sand content. 0.3-0.5 kg P ha -1 yr -1 by leaching, while coarse sandy soils lose up to 2.0 kg P ha -1 yr -1 (Amberger, 1996).

348
While we found a latitudinal gradient in soil total P concentration, heterogeneity in soil total P concentration at the regional 349 and local scales was large. For example, consistent with Brédoire et al. (2016), we found that the soil total P concentration was 350 higher in Siberia than in northern Europe, both of which have similar latitudes. First, this difference may be due to the fact that 351 glaciation was more regular and intense in Siberia than in northern Europe (Wassen et al., 2021), leading to a more intensive 352 rejuvenation of soils. Second, the warmer and wetter climate in northern Europe may promote weathering which releases P 353 17 from parent material (Goll et al., 2014) and makes it subject to loss (Fig. S8). Regional variation in soil total P concentration 354 may also be attributable to regional variation in parent material. For example, higher soil total P concentration in eastern 355 Australia than in central Australia was probably due to P-enriched basaltic lithologies in eastern Australia (6500-8700 mg kg −1 ) 356 (Viscarra Rossel and Bui, 2016). Moreover, regional differences in soil total P concentration may be related to topography 357 conditions. For example, higher soil total P concentration in the Tibetan plateau than in eastern China may be the result of 358 higher elevation and lower MAT in the Tibetan plateau (Zhang et al., 2005).

360
Despite our unprecedented effort to construct a database and perform global predictions, our study has some limitations.

361
First, some regions were still underrepresented, e.g., northern Canada, Russia, middle Asia, and inner Australia, which may 362 result in a low accuracy of predicted values in these regions (Ploton et al., 2020). Further, our assumption that soils which are 363 or have been in agricultural use can be characterized in their native state by the same relationships as semi(natural) soils might 364 not hold true. For example, as fertile soils are preferred in agriculture and forestry. Second, subsoils (> 30 cm depth) were not 365 well represented in our dataset (14%) and therefore predicted P concentrations of subsoils may suffer from larger uncertainties 366 than those of topsoils (< 30 cm depth). Third, some predictors were largely missing. Map-filled values suffer from large 367 uncertainties, especially for the soil variables. This may cause some uncertainties in the predicted soil total P concentration.

368
Finally, 36% of the variation in soil total P concentration was not explained, despite inclusion of 13 predictors using an 369 advanced machine learning approach. This result may be because of measurement errors and/or methodological constraints.

370
These limitations highlight the need for more measurements of subsoil total P concentration and closely associated variables, 371 especially from the underrepresented regions, as well as more advanced statistical methods for spatial predictions.

373
By constructing a database of total P concentration globally, we quantified the relative importance of multiple soil-forming 374 variables for predicting soil total P concentration and further estimated it at the global scale. Our results indicated that no single 375 variable can be used to predict soil total P concentration. Instead, it is a combination of variables that are needed to reliably 376 predict soil total P concentration, among which SOC, parent material, MAT, and soil sand content are the most important 377 predictors. Soil total P concentration was positively correlated with SOC and negatively correlated with both MAT and soil 378 sand content. Our predicted map captures the latitudinal gradient in potential soil total P concentration expected from our 379 theoretical understanding. We estimated that P stocks in the topsoil (0-30 cm) and subsoil (30-100 cm) of soil of natural 380 ecosystems (excluding Antarctica) were 26.8 and 62.2 Pg, respectively. Our improved global map of soil total P will be an 381 important resource for future work which aims to tackle issues related to P cycling, including predicting future land carbon 382 18 sink potential and P losses to aquatic and marine ecosystems as well as modeling the P needs of crops to increase food security.