A data-driven topsoil δC dataset and the drivers of spatial variability across the Tibetan Plateau

Soil carbon isotopes (δC) provide reliable insights at a long-term scale for studying soil carbon turnover. The Tibetan Plateau (TP), called “the third pole of the earth” is one of the most sensitive areas to global climate change and exhibits an early warning signal of global warming. Although many studies detected the variability of soil δC at site scales, a knowledge gap still exists in the spatial pattern of topsoil δC across the TP. To fill the substantial knowledge gap, we first 25 compiled a database of topsoil δC with 396 observations from published literatures. Then we applied a Random Forest (RF) algorithm – a machine learning approach, to predict the spatial pattern of topsoil δC and β (indicating the decomposition rate of soil organic carbon (SOC), calculated by δC divided by logarithmically converted SOC). Finally, two datasets topsoil δC and β with a fine spatial resolution of 1 km across the TP were developed. Results showed that topsoil δC varied significantly among different ecosystem types (p < 0.001). Topsoil δC was -26.3 ± 1.60 ‰ (mean ± standard 30 deviation) for forests, 24.3 ± 2.00 ‰ for shrublands, -23.9 ± 1.84 ‰ for grasslands, -18.9 ± 2.37 ‰ for deserts, respectively. RF could well predict the spatial variability of topsoil δC with a model efficiency of 0.62 and root mean square error of 1.12 ‰, enabling to derive data-driven δC and β products. Data-driven topsoil δC varied from -28.26 ‰ to -16.95 ‰, https://doi.org/10.5194/essd-2021-411 O pe n A cc es s Earth System Science Data D icu ssio n s Preprint. Discussion started: 20 November 2021 c © Author(s) 2021. CC BY 4.0 License.


Introduction
Soil organic carbon (SOC) is the largest carbon pool in terrestrial ecosystems, containing about 1500 Pg (1 Pg = 10 15 g) carbon within the first meter, which is two-fold higher than that of the atmosphere (Scharlemann et al., 2014). Due to the decomposition of SOC, an amount of 57.2 Pg C was released from soil to the atmosphere (Tang et al., 2020). Thus a small change in SOC could lead to a profound impact on the atmospheric CO2 concentrations and hence climate change (Köchy et 45 al., 2015). Understanding SOC dynamics is of great importance to assess ecosystem carbon balance and its feedbacks to climate change (Averill et al., 2014;Campbell et al., 2009;Wang et al., 2012). However, it is difficult to detect statistically significant changes in soil carbon pool over a short time (Van Groenigen et al., 2014).
Carbon isotopes (δ 13 C) in soil organics provide reliable insights at the long-term scale into studying soil carbon turnover (Acton et al., 2013;Blagodatskaya et al., 2011;Khan et al., 2008;Li et al., 2020). Because the majority of soil organic 50 matter originates from plant residues, soil δ 13 C can well reflect vegetation-related soil formation and dynamics (Ehleringer et al., 2000). Previous studies focused on spatial variability of soil δ 13 C at in-site scale (Acton et al., 2013;Lu et al., 2004;Wang et al., 2012). For example, it is widely observed that soil δ 13 C values increase with the increasing soil depth and increase with the decreasing soil organic carbon (Brunn et al., 2014;Wang et al., 2017;Wang et al., 2012). Climate, edaphic variables, and their combinations have a vital influence on the spatial variability of soil δ 13 C (Garten Jr et al., 2000). 55 However, modelling the spatial patterns of δ 13 C using field observations has not been observed. A better understanding of the spatial variability of soil δ 13 C and its controlling factors at the regional scale is important to understand soil carbon dynamics and potential feedbacks to climate change (Li et al., 2020;Rao et al., 2017;Zhao et al., 2019).
Previous studies found a negative linear correlation between the log-transformed SOC concentration and soil δ 13 C (Acton et al., 2013;Garten Jr and Hanson, 2006). The slope of the linear regression of soil δ 13 C on log-transformed organic carbon 60 concentration is defined as β, a proxy of SOC decomposition (Garten Jr, 2006). The more negative the slope, the larger the decrease in the β value and the faster the turnover rate (Campbell et al., 2009). The method was widely used to study SOC turnover in the forest, grassland, and meadow ecosystems (Gautam et al., 2017;Peri et al., 2012;Zhao et al., 2019;Zhou et al., 2019). Acton et al. (2013) noted that the β should be applied out in well-drained soils characterized by a gradual mixing of litter and root carbon inputs decomposing in the soil profile. Empirical studies found that temperature, precipitation 65 (Acton et al., 2013), and soil properties (Wynn et al., 2006a) are important factors of driving β. However, temperature, precipitation, and soil properties vary greatly with climate zones and biomes, indicating that there is a strong spatial pattern of β. Therefore, detecting the spatial patterns of β is critical to study the SOC turnover to climate sensitivity. Whether β values can be used to constrain rates and controls on SOC turnover has not been fully explored at regional scales, particularly for areas with great sensitivity to climate change, e.g. the Tibetan Plateau (TP). 70 Due to highly data-adaptive and no initial assumptions on functional relationships between target variables and predict variables, machine learning approaches, e.g. Random Forest (RF) (Breiman, 2001), have been widely applied in spatial modelling in ecology and earth sciences using easy-to-measure variables (Tang et al., 2020;Yang et al., 2016). For example, Tang et al. (2020) predicted soil heterotrophic respiration using RF and found that soil heterotrophic respiration increased from 1980 to 2016 at the global scale. However, to date, no studies have used empirical field observations to assess the 75 spatial variability of soil δ 13 C to bridge the knowledge gap between local, regional and global scales.
The TP is the largest and highest plateau with an average altitude of 4000 m above sea level and covers about 2.5×10 6 km 2 on the earth . Soils in the TP store about 4.4 Pg (1 Pg = 10 15 g) carbon within 30 cm (Yang et al., 2009), accounting for 12.4% of total SOC in China's grasslands (Fang et al., 2010). In the last few decades, surface air temperature in the TP has increased by 0.44 o C per decade, which was almost three times the world average (0.16 o C per decade) (Duan 80 and Xiao, 2015). Thus, it is urgent to explore the feedbacks between SOC and climate change under ongoing climate change.
Site-level studies found that the decomposition rate of SOC accelerates with temperature increase, resulting in the release of stored carbon from the soil into the atmosphere (Chang et al., 2012;Dong et al., 2018;Li et al., 2020). However, regional estimates of the sensitivity of SOC decomposition in the TP are still missing.
In this study, we firstly compiled a database of topsoil δ 13 C from published literature based on field observations from 85 the TP and applied a Random Forest algorithm to predict the spatial patterns of topsoil δ 13 C with the linkage of environment variables. The objectives of this study are to: (1) compare the topsoil δ 13 C among different ecosystems; (2) develop gridded products of topsoil δ 13 C and β (named data-driven δ 13 C and β); (3) explore the spatial pattern of topsoil δ 13 C and β over the TP. The outcome can provide an insightful view of the SOC dynamics and turnover across the TP.
2 Materials and methods 90

Data sources
The first topsoil (0 -5 cm) δ 13 C dataset based on field observations was from Lu et al. (2004) and second dataset was Qi (2017). Along different elevation gradients, soil samples were collected away from the villages and leaves, grass roots, and litter were removed before sampling Qi, 2017

Environmental variables 100
Topsoil δ 13 C is affected by multiple environmental factors. To investigate the spatial patterns of δ 13 C, the spatial grid of environmental variables was required. The environmental variables were included: mean annual temperature (MAT) and mean annual precipitation (MAP) with the spatial resolution is 1 km during 2000-2010 from Peng et al. (2019).The elevation data (digital elevation map; DEM) with 1 km spatial resolution was obtained from National Earth System Science Data Center (http://www.geodata.cn). Moderate-resolution Imaging Spectroradiometer (MODIS) products, including normalized 105 difference vegetation index (NDVI) and enhanced vegetation index (EVI) with a spatial resolution of 1 km, leaf area index (LAI) and the fraction of photosynthetically active radiation (FPAR), with a spatial resolution of 500 m, gross primary productivity (GPP) with a spatial resolution of 500 m, evapotranspiration (ET) and potential evapotranspiration (PET) with a spatial resolution of 500 m, land cover type (LCT), with a spatial resolution of 500 m were from https://lpdaac.usgs.gov/.
Due to data availability, NDVI, EVI and GPP covered from 2000 to 2010, while PET covered from 2001 to 2010, 110 and LAI and FPAR covered from 2002 to 2010. The soil organic content, soil pH, soil BD, soil silt content, soil clay content, soil sand content with 250 m spatial resolution were from SoilGrids (Hengl et al., 2017). Before data analysis, the soil and MODIS data were resampled to 1 km using the nearest neighbour method.

Data analysis
One-way analysis of variance (ANOVA) was performed to analyse the significance difference in topsoil δ 13 C among 115 forests, shrublands, grasslands and deserts. If the difference was significant at the 0.05 level, the Tukey-HSD (honestly significant difference) test was applied for multiple comparisons. Tukey-HSD is a post-hoc test based on the studentized range distribution that determines which specific groups' means are different by comparing all possible pairs of means  (Bretz et al., 2016). The correlation analysis was conducted to explore the correlations between topsoil δ 13 C and climate, vegetations and soil factors. 120 The β values, which reflect the sensitivity of SOC decomposition (Garten Jr, 2006), were obtained using a standard least-squares regression analysis between the log10-transformed SOC concentration and δ 13 C: Where δ 13 CSOC is the δ 13 C in SOC, and the β value is the regression coefficient. The α value is a constant number, which was obtained using a standard-least squares regression analysis between the log10-transformed SOC concentration and soil 125 δ 13 C from gathering dataset (Fig. S1). In current study, the spatial resolution 250 m SOC within 0 -5 cm from SoilGrids was used (Hengl et al., 2017). All the analyses were conducted in R 3.6.3 (R Core Team, 2018).

Feature selection
In order to improve the model efficiency and reduce the workload, a recursive feature elimination (RFE) algorithm was 130 used for variable selection (Kuhn, 2008). RFE improves the generalization efficiency by avoiding overfitting while reducing the complexity of the model. In general, RFE is to select features by recursively considering smaller and smaller feature sets: first, all input variables participate in random forest modelling and rank the importance of participating variables; second, a new feature set is obtained by removing the corresponding proportion of unimportant indicators from the current feature variables; third, a new random forest is created with the new feature set, and the variable importance of each feature in the 135 feature set is calculated and ranked. The three steps are repeated until best features remained. Finally, six variables (MAT, MAP, Altitude, NDVI, Vegetation types, pH) were selected to predict topsoil δ 13 C. Figure S1 shows that the relative importance evaluation of all the variables.

Modelling
RF was used to model the spatial patterns of topsoil δ 13 C. RF is an ensemble machine learning algorithm that predicts a 140 response from a set of predictors by creating multiple decision trees and aggregating their results (Breiman, 2001). RF algorithm has two important custom parameters, the number of categorical regression trees and the number of random variables separating the nodes. Model prediction accuracy can be improved by optimizing these two parameters (Liaw and Wiener, 2002). However, RF models are usually insensitive to the number of trees or variables. RF regression can handle a large number of features and aids feature selection according to the importance value of each variable for avoiding over-145 fitting (Bodesheim et al., 2018;Jian et al., 2018;Tang et al., 2020). In the present study, the RF model was trained by caret package (version 6.0-80) by linking RandomForest in R, and then the model was implemented to predict topsoil δ 13 C for each grid with a spatial resolution of 1 km. To evaluate the performance of RF, a 10-fold cross-validation was applied, which meant that the data set was stratified into 10 parts, and each part contained approximately an equal number of samples. The target values for each of these ten parts were predicted using a model trained using the remaining nine parts (Jung et al., 150 2011;Tang et al., 2020). The model efficiency (R 2 ) and root-mean-square error (RMSE) were used to model evaluation (Tang et al., 2020;Yao et al., 2018).

Spatial patterns of the data-driven δ 13 C
Based on the 10-fold cross-validation, R 2 and RMSE were 0.62 and 1.12 ‰, respectively, indicating that RF can well predict the spatial patterns of topsoil δ 13 C (Fig. 4).  The data-driven δ 13 C showed great spatial variation across the TP. The highest topsoil δ 13 C was observed in the north and northwest TP, while the lowest topsoil δ 13 C was in Southeast or South TP. Across the TP, soil δ 13 C varied from -28.26 ‰ to -16.95 ‰, and mean topsoil δ 13 C was -22.26 ‰.

Spatial variability of the data-driven β across the TP
The data-driven β also showed strong spatial variabilities across the TP (Fig. 6). The highest β values were found in northwest and north regions, while the lowest values were observed in the middle-west TP with β lower than -8. Mean β was 190 -2.33 across the TP.
For the first time, we developed a data-driven δ 13 C of topsoil using RF, and found a great spatial pattern in topsoil δ 13 C, with an increasing trend from the southeast to the northwest TP. Such spatial patterns may be primarily associated with 200 vegetation types , because we found that vegetation types was the most important factor in predicting topsoil δ 13 C (Fig. S2). Plant litter was the main source of soil organic matter and plant litter production varied greatly among different ecosystem types. Therefore, different vegetation types with large differences in leaf δ 13 C could lead to a significant impact on topsoil δ 13 C Yang et al., 2015). Plant species growing in dry habitats generally have high leaf δ 13 C values . For example, C4 plants, growing in the high elevation of the TP with a strong ability to 205 adapt to severe drought, had much higher leaf δ 13 C than C3 plants . Meanwhile, the degree of carbon isotope fractionation during the conversion of soil organic matter from different regions and plant residues varied, ranging from 0.5 ‰ to 2 ‰ (Cao et al., 2005). The southeast TP was dominated by forests, and the northwest TP was dominated by deserts and grasslands , which could potentially lead to higher soil δ 13 C in the northwest TP and lower soil δ 13 C in the southeast TP (Fig. 4). 210 Besides vegetation types, climate factors were also important to influence the spatial variation of topsoil δ 13 C (Wang et al., 2013;Rao et al., 2012;Zhao et al., 2017). The data-driven δ 13 C had a significant and negative correlation with MAT (Fig.   3), indicating that temperature could lead to a significant impact on topsoil δ 13 C. Our results were consistent with Rao et al. (2017) and Zhang et al. (2020), who found topsoil δ 13 C decreased with the increase of MAT. However, the data-driven δ 13 C were different from Wang et al. (2012) and Zhang et al. (2020), and they found a positive correlation between topsoil δ 13 C 215 and MAT. The controversial results may be associated with the relative complexity between temperature and topsoil δ 13 C (Rao et al., 2017). First, the temperature can affect topsoil δ 13 C by changing δ 13 C in vegetation and microbial characteristics.
For example, temperature could affect the relative abundance of C3 and C4 plants (Tieszen et al., 1997). Generally, a relatively higher abundance of C4 plants distribution was found in higher MAT and lower MAP areas (Zhang et al., 2003).
Second, temperature could also affect carbon isotope fractionation by modulating the stomatal conductance of plants and the 220 activities of photosynthetic enzymes (Rao et al., 2017;Zhao et al., 2017). With the increase of temperature, plants close to the leaf stomata decrease the intercellular CO2, thus leading to the increase of δ 13 C in plants. Third, temperature could also affect topsoil δ 13 C by regulating the decomposition rate of litter and ecosystem respiration (Cai et al., 2021;Kato et al., 2004). Generally, a lower temperature led to a lower litter decomposition and ecosystem respiration , enriching soil δ 13 C.
Fourth, the temperature can also affect topsoil δ 13 C by changing isotopic fractionation during the microbial decomposition 225 (Garten Jr, 2006). Microbes tend to use lighter 12 C during the decomposition and 13 C component accumulates. Therefore, the combined effects of lower stomatal conductance and lower enzyme activity resulted in a negative correlation between topsoil δ 13 C and temperature (Li et al., 2020;Zhang et al., 2020), and higher topsoil δ 13 C in north and northwest TP (Fig. 4).
Precipitation is another important factor influencing soil δ 13 C. Our results showed that topsoil δ 13 C decreases with increasing precipitation (Fig. 3), which was consistent with previous studies (Murphy and Bowman, 2009;Zhao et al., 2019). 230 The mechanisms of the impact of precipitation on δ 13 C have been well explained. It is generally accepted that because of the lack of water, vegetation will close stomata to reduce transpiration, leading to an increase in δ 13 C (Farquhar et al., 1989). In the last several decades, the TP suffered from a significant increase in MAT and MAP, which may increase the species, microbial quantities and activities (Papatheodorou et al., 2004), accelerating the decomposition rate of 12 C, enriching soil δ 13 C (Li et al., 2020). In the north TP, precipitation was much lower than in the south TP (Fig. S3). Therefore, the change of 235 vegetation types, the closure of stomata due to the lack of precipitation and the increase microbial activities due to increasing MAT and MAP may partly explain the higher topsoil δ 13 C in the north TP compared to other regions of the TP.
Soil factors may influence soil δ 13 C by altering microbial activity, matrix quality, and effectiveness (Wynn et al., 2006b;. Generally, soil δ 13 C decreased with the increase of SOC Yang et al., 2015), and increased with the increase of soil sand content . This is consistent with our study that we found a 240 negative correlation between the data-driven δ 13 C and SOC, and a positive correlation between the data-driven δ 13 C and sand https://doi.org/10.5194/essd-2021-411  (Fig. 3). Meanwhile, soil texture can lead to a significant impact on SOC dynamics and affect soil δ 13 C (Bird et al., 2002). Soil δ 13 C increases with decreasing particle size because carbon enriched in δ 13 C is allocated to microbial biomass and can subsequently be stabilized by the interaction with soil fine mineral phases (Kleber et al., 2011;Sollins et al., 2009).
We found that topsoil δ 13 C was negatively correlated with soil silt content (Fig. 3). This result is consistent with the study 245 from Wang et al. (2012) in the TP.

Spatial patterns of the data-driven β across the TP
Soil carbon turnover is a major determinant of the capacity of soil carbon sequestration (Luo et al., 2003), and a decrease in carbon turnover can sequestrate SOC without an increase in carbon input (Jastrow et al., 2006). Because it is difficult to detect the change in SOC stock over short periods due to the large pool size and huge spatial heterogeneity (Van 250 Groenigen et al., 2014), the predicted β across the TP could provide a reliable method to evaluate the SOC turnover rate over a large spatial scale (Brunn et al., 2014;Gautam et al., 2017). Therefore, understanding the spatial variation of the β values is particularly important.
The β values reflect the turnover rate of SOC in response to microbial activities. The more negative the β values and the faster turnover of SOC (Acton et al., 2013;Zhao et al., 2019). Although many studies have compared β values among 255 different ecosystem types across the TP and suggested that β was a useful proxy for understanding generalized patterns of SOC turnover and the underlying control over soil metabolism , knowledge gaps still exist in the spatial variability of β. This study was the first time to estimate the spatial patterns of β across the TP, which could improve our understanding of the spatial patterns of SOC turnover and contribute to predicting the soil C dynamics and feedback of soil C cycle to climate change. There was a great spatial pattern of the data-driven β across the TP, highlighting the large variability 260 in SOC turnover. The lowest β values were below -10, locating at the east and middle TP, which was much lower than the observed β values ranging from −0.60 to −7.41 Zhao et al., 2019), indicating that SOC turnover was faster in the east and middle TP compared that of other regions and highlighting the need of protecting SOC in the TP under the ongoing climate change.
Understanding how the environmental variables that affects the spatial patterns of β values is a key goal for 265 understanding the SOC dynamics. The temperature and precipitation are important variables that have the significant effect on SOC turnover (Li et al., 2020;Wang et al., 2017). Generally, increasing temperature and precipitation can stimulate the turnover rate of SOC by affecting the soil microbial biomass and enzyme activities (Collins et al., 2008;Conant et al., 2011), and vice versa. Our study found that β values were low in the east TP and high in the north and northwest TP, indicating that SOC turnover rate in the east TP was faster than that in the north and northwest TP. This result may result from differences 270 in climate because MAT and MAP in the north or northwest TP were much lower compared to the east TP ( Fig. S3 and Fig.   S4). Our result also agreed with a previous study that the SOC turnover increased with increasing temperature at a global scale . Besides climate, soil properties also lead to a significant impact on SOC turnover. A previous study indicated that β values were generally negatively correlated with sand content and positively correlated with clay content in the TP (Li et al., 275 2020). Our results generally agreed with the previous study that β values were high in the northwest TP where soil sand content was low, and soil clay content was high ( Fig. S5 and Fig. S6). It is widely accepted that soil texture could affect SOC turnover by changing soil water-holding capacity, water movement, gas diffusion (Kaiser et al., 2015;Yiqi and Zhou, 2010;. Meanwhile, soil pH can also affect SOC turnover by altering microbial community and enzyme activity, along with substrate availability (Priha et al., 2001). Therefore, the spatial patterns of β values were jointly controlled by 280 climate and soil properties, and detecting the dominant environmental control on β values can enhance the predictive power for addressing the spatial patterns of SOC turnover, as well as for understanding the future response of SOC to climate change.

Limitations
In this study, based on the topsoil δ 13 C field observations dataset, we developed a data-driven δ 13 C of topsoil using a RF 285 algorithm and analysed its spatial pattern across the TP, however, limitations still remain in a few aspects. First, the RF algorithm builds a model based on the training dataset, which is usually limited by data in terms of quantity, quality, and representativeness. In many ecological studies around the world, uneven data distribution has always been a well-known problem (e.g., Jung et al. (2011) and Xu and Shang (2016)). The study sites of topsoil δ 13 C were mainly concentrated in the eastern and northern TP, while there were a lack of topsoil δ 13 C field observations in the western and northwest TP. 290 Therefore, the uneven coverage of observations was an important source of uncertainty to predict topsoil δ 13 C, which may cause a bias in the RF model towards the areas with more observations. In the future studies, increasing the number of field observations in the eastern and northern TP could improve the ability to evaluate spatial pattern of topsoil δ 13 C across the TP.
Second, our dataset was from the topsoil within 0 -5 cm, although it is generally accepted that topsoil generally had higher carbon content and more sensitive to environmental change compared to subsoils. Therefore, modelling soil δ 13 C for deeper 295 soils would greatly improve our understanding of soil carbon dynamics and its response to carbon-climate feedbacks across the TP.

Data availability
There were three datasets in our study. The first dataset was topsoil δ 13 C from field observations. The second and third datasets were data-driven δ 13 C and β with a spatial resolution of 1 km using RF algorithm. The datasets were publicly 300 available for scientific purposes and freely downloaded at https://doi.org/10.6084/m9.figshare.16641292.v2 (Tang, 2021).

Conclusions
Gridded data-driven δ 13 C and β of topsoil with a spatial resolution of 1 km were developed based on field observations using RF. Our results showed that topsoil δ 13 C varied significantly among ecosystem types, indicating that vegetation types led to a significant impact on topsoil δ 13 C. Data-driven δ 13 C of topsoil varied from -28.29 ‰ to -16.95 ‰ and δ 13 C was 305 increasing from southeast to northwest. Similarly, strong spatial variabilities were observed in data-driven β and increased from eastern to northwest, indicating that SOC turnover was higher in the east TP compared to that of the northwest TP. The data-driven δ 13 C and β of topsoil could provide an independent benchmark for biogeochemical models to study SOC turnover and terrestrial carbon-climate feedbacks under ongoing climate change in the TP.

310
Competing interests. The authors declare that they have no conflict of interest.