SoilKsatDB: global database of soil saturated hydraulic conductivity measurements for geoscience applications

The saturated soil hydraulic conductivity (Ksat) is a key parameter in many hydrological and climate models. Ksat values are primarily determined from soil basic properties and may vary over several orders of magnitude. Despite the availability of Ksat datasets in the literature, significant efforts are required to combine the data before it can be used for specific applications. In this work, a total of 13,258 Ksat measurements from 1,908 sites were assembled from the published literature and other sources, standardized (i.e., units made identical), and quality-checked in order to obtain a global database of soil sat5 urated hydraulic conductivity (SoilKsatDB). The SoilKsatDB covers most regions across the globe, with the highest number of Ksat measurements from North America, followed by Europe, Asia, South America, Africa, and Australia. In addition to Ksat, other soil variables such as soil texture (11,584 measurements), bulk density (11,262 measurements), soil organic carbon (9,787 measurements), moisture content at field capacity (7,382) and wilting point (7,411) are also included in the dataset. To show an application of SoilKsatDB, we derived Ksat pedotransfer functions (PTFs) for temperate regions and laboratory10 based soil properties (sand and clay content, bulk density). Accurate models can be fitted using a Random Forest machine learning algorithm (best concordance correlation coefficient (CCC) equal to 0.74 and 0.72 for measurements from temperate areas and for laboratory measurements, respectively). However, when these Ksat PTFs are applied to soil samples obtained from tropical climates and field measurements, respectively, the model performance is significantly lower (CCC = 0.49 for tropical and CCC = 0.10 for field measurements). These results indicate that there are significant differences between Ksat data 15 collected in temperate and tropical regions and Ksat measured in the laboratory or field. The SoilKsatDB dataset is available at https://doi.org/10.5281/zenodo.3752721 (Gupta et al., 2020) and the code used to extract the data from the literature and the applied random forest machine learning approach are publicly available under an open data license.


Introduction
The soil saturated hydraulic conductivity (Ksat) describes the rate of water movement through saturated soils and is defined 20 as the ratio between water flux and hydraulic gradient (Amoozegar and Warrick, 1986). It is a key variable in a number 1 of hydrological, geomorphological, and climatological applications, such as rainfall partitioning into infiltration and runoff (Vereecken et al., 2010), optimal irrigation design (Hu et al., 2015), as well as the prediction of natural hazards including catastrophic floods and landslides (Batjes, 1996;Gliński et al., 2000;Zhang et al., 2018). Accurate measurements of Ksat in the laboratory and field are laborious and time consuming and are often scale dependent (Youngs, 1991). Using infiltrometer measurements in the field enables the measurement of Ksat also in forests and other types of structured soils; however, so far 5 Ksat values have been measured mainly for agricultural soils (Romano and Palladino, 2002).
Efforts to produce reliable and spatially refined datasets of hydraulic properties date back to the 1970's with the proliferation of distributed hydrologic and climatic modeling. These early notable works also provided basic databases (some of which are used in this study) for Australia (McKenzie et al., 2008;Forrest et al., 1985), Belgium (Vereecken et al., 2017;Cornelis et al., 2001), Brazil (Tomasella et al., 2000(Tomasella et al., , 2003Ottoni et al., 2018), France (Bruand et al., 2004), Germany (Horn et al., 1991;10 Krahmer et al., 1995), Hungary (Nemes, 2002), the Netherlands (Wösten et al., 2001), Poland (Glinski et al., 1991), and the USA (Rawls et al., 1982). A detailed discussion of the available datasets for Ksat and other hydro-physical properties is provided in Nemes (2011). Collaborative efforts have resulted in the compilation of multiple databases, including the Unsaturated Soil Hydraulic Database (UNSODA) (Nemes et al., 2001), the Grenoble Catalogue of Soils (GRIZZLY) (Haverkamp et al., 1998), and the Mualem catalogue (Mualem, 1976). These databases however, focused on soil types and not on the spatial 15 context of Ksat mapping. In an effort to provide spatial context, Jarvis et al. (2013), and Rahmati et al. (2018) published global databases for soil hydraulic and soil physical properties. Likewise, the European soil data center also started projects such as SPADE (Hiederer et al., 2006) and HYPRES (Wösten et al., 2000), for generating spatially referenced soil databases for several countries. Since HYPRES only includes western European countries, Weynants et al. (2013) gathered data from 18 countries and developed the European HYdropedological Data Inventory (EU-HYDI) database. This dataset is, however, not publicly 20 available and was not included in this compilation. The datasets mentioned above cover almost all climatic zones except tropical regions, where Ksat values can be significantly different due to the strong local weathering processes and different clay mineralogy (Hodnett and Tomasella, 2002). Recently, Ottoni et al. (2018) published a dataset named HYBRAS (Hydrophysical Database for Brazilian Soils) improving the coverage of South American tropical regions. In addition, Rahmati et al. (2018) recently published the Soil Water Infiltration Global database (SWIG) with information on Ksat for the whole globe. In the 25 SWIG database, some Ksat values were extracted from the literature and other Ksat values were deduced from infiltration time series. In contrast to laboratory measurements that determine Ksat as the ratio of flux density to gradient, infiltrationbased methods determine Ksat by fitting infiltration dynamics to parametric models of the infiltration process; for a review on analytical models characterizing the infiltration process see Kutílek et al. (1988), Youngs (1991), and Vereecken et al. (2019).
The increasing demand for highly resolved descriptions of surface processes requires commensurate advances in represen- 30 tation of Ksat in modern Earth System Model (ESM) applications. Several existing Ksat datasets miss either coordinates or these are provided with an unknown accuracy thus limiting their applications for spatial modeling. For example, the SWIG dataset misses information on soil depth and assigns entire watersheds to a single coordinate. Similarly, the UNSODA dataset does not provide coordinates and soil texture information for all samples. For a few locations, HYBRAS uses a different coor-dinate system. Taken together, these limitations imply that, to prepare spatially referenced global Ksat datasets for large scale applications, it requires serious effort to compile, standardize, and quality check all (publicly available) literature.
The objective of the work here is to provide a new global standardized Ksat database (SoilKsatDB) that can be used for geoscience applications. To do so, a total of 13,267 Ksat measurements was compiled, standardized, and cross-checked to produce a harmonized compilation which is analysis-ready (i.e., it can directly be used to test various Machine Learning 5 algorithms for spatial analysis). We compiled data from existing datasets and, to improve the spatial coverage in regions with sparse data, we conducted a literature search to include Ksat measurements in geographic areas that were not yet included in other existing databases. In the manuscript, we first describe the data compilation process and then describe methodological steps used to spatially reference, filter, and standardize the existing datasets. As an illustrative application of the dataset, we derive pedotransfer functions (PTFs) for different climatic regions and measurement methods and discuss their transferability 10 to other regions/measurement methodologies. We fully document all importing, standardization, and binding steps using the R environment for statistical computing (R Core Team, 2013), so that we can collect feedback from other researchers and increase the speed of further updates and improvements. The newly created data set (SoilKsatDB) can be accessed via https: //doi.org/10.5281/zenodo.3752721.
2 Methods and materials 15
We searched soil hydraulic conductivity datasets using "saturated hydraulic conductivity database", "Ksat", and "hydraulic conductivity curves" as keywords. The collected datasets are listed in Table 1 together with the number of Ksat observations 20 for each study. They can be classified into three main categories, namely: i) existing datasets (in form of tables) published and archived with a DOI in peer-reviewed publications, ii) legacy datasets in paper/document format (e.g., legacy reports, PhD theses, and scientific studies) and iii) on-line materials.
Existing datasets include published datasets such as HYBRAS (Ottoni et al., 2018), UNSODA (Nemes et al., 2001), SWIG (Rahmati et al., 2018), and the soil hydraulic properties over the Tibetan Plateau (Zhao et al., 2018), from which we extracted 25 the required information as described in Table 2a. The major challenge with making the existing datasets compatible for binding (standardization, removing redundancy) was to obtain the locations for a particular sample as well as the corresponding measurement depths. For instance, the UNSODA database does not provide information on the geographical locations. To fill the gaps and make the data suitable also for spatial analysis, we used Google Earth to find the coordinates based on the given location (generally an address or a location name). Moreover, all datasets were cross-checked to avoid redundancy. For 30 example, the UNSODA data includes the data of Vereecken et al. (2017) and Richard and Lüscher (1983/87) while the SWIG database includes the measurements of Zhao et al. (2018). Hence we removed these from UNSODA and SWIG database and used the original sources. In the case of legacy datasets (non-digital tabular format, non-peer-reviewed data), we invested a significant effort to digitize, clean and cross-check the data to extract Ksat values. Two datasets were also collected directly from project websites such as the NASA project providing data on hydraulic and thermal conductivity (retrieved from https://daac.ornl.gov/FIFE/guides/ Soil_Hydraulic_Conductivity_Data.html and described in Kanemasu (1994)) and the Florida database (http://soils.ifas.ufl.edu) from Grunwald (2020). 5 There are many biomes and climatic regions, such as desert dunes, peatlands and frozen soils, for which very few Ksat measurements were publicly available. We have intensively searched for additional data for these areas and found 41 studies (each contains less than 5 Ksat measurements) to cover these regions. We thus digitized Ksat values from these studies (shown either in bar charts or line plots), georeferenced the maps where necessary, and then converted the data into tabular form. In some cases, we also contacted colleagues that worked in these regions to retrieve additional data.
10 Figure 1. Spatial distribution of Ksat measurements based on (red) laboratory and (blue) field measurements, respectively, in the SoilKsatDB.
A total of 1,910 locations are shown on the map.

Georeferencing Ksat values and definition of spatial accuracy
Georeferencing of Ksat measurements is important for using the data for local, regional, or global hydrological and land surface models. Although many studies provided information on the geographical location of the measurements, studies conducted particularly in the 70's and 80's only provided the name of the locations and approximate distance from a reference location. A limited accuracy of the position value may affect the application of the Ksat value in a spatially distributed model. For example, 15 in case of a location with contrasting hydraulic properties, it must be known to which subregion the measured value can be assigned and the user must know if the given location is accurate enough. For that purpose we assigned an accuracy value ('accuracy classes') to each measurement as explained next. We assigned each Ksat value to one of seven 'accuracy classes' ranging from highest (0-100 m) to lowest (more than 10,000 m or non available information (NA)) accuracy. For example, Forrest et al. (1985), Zhao et al. (2018), and Ottoni et al. (2018) provided exact coordinates of the locations, thus we assigned a location accuracy of 0-100 m (i.e., highly accurate; see Table 3 for more details). For other references, we digitized provided maps or sketches with locations of the points. We first georeferenced these maps using ESRI ArcGIS software (v10.3) and then 5 digitized the coordinates from georeferenced images. Some of the documents we digitized (e.g. Nemes et al. (2001)) provided the names of specific locations, and hence we used Google Earth to obtain the coordinates. We estimate that the spatial location accuracy of these points is roughly between 0 and 5 km. Similarly, spatial maps in jpg format (e.g. Becker et al. (2018)) were geo-referenced with 100-500 m location accuracy. In contrast, few studies (e.g. Yoon (2009)) provided the exact location of the sampling with assumed location accuracy of 10-20 m. In the SWIG database, the information related to location (coordinates 10 for each point) was missing, so we went through each publication referenced in Rahmati et al. (2018) and added coordinates.

Standardization
The database was cleaned to remove about 700 unrealistic low Ksat values as outliers (less than 10 −14 m/day deduced from infiltration time series in SWIG database). Moreover, in the SWIG database, soil depth information was not available, so we assumed that infiltration experiments were conducted in the topsoil and assigned a depth of 0-20 cm. Furthermore, we 15 computed sand (particles > 50 m), silt (2-50 m), and clay fraction (< 2 m) for the UNSODA database based on the available particle-size data, assuming a log-normal distribution, as described in Nemes et al. (2001).  Table 2b.

Statistical modeling of Ksat
To show a possible application of the database, we computed various PTFs. The PTF models were fitted using a random forest (RF) machine learning algorithm (Breiman, 2001) in the R environment for statistical computing (R Core Team, 2013). We fitted the RF model for log-transformed (log 10 ) Ksat values as a function of primary soil properties. In this application, PTFs 25 for Ksat were built based on bulk density and sand and clay content. The observed correlation between these primary soil properties and Ksat motivates their use as key variables for the estimation of PTFs. Organic carbon (OC) was not used to build the PTFs because (i) this information was missing for 15% of measurements and (ii) the correlation between OC and Ksat was poor (i.e. 0.005, Pearson's correlation coefficient). We derived two PTFs for Ksat: 1. PTF for temperate regions: the map of Ksat locations were overlaid on the Köppen-Geiger climate zone map (Rubel 30 and Kottek, 2010; Hamel et al., 2017) and then divided the measurements based on climatic regions (temperate, tropical, boreal, and arid) to account for differences in climate and related weathering processes (Hodnett and Tomasella, 2002). A With such validation, we intend to discuss the transferability of PTFs across different regions. Table 2b. Example of Ksat database structure with key variables (from left to right: unique ID, reference, longitude and latitude (decimal degree), minimum and maximum accuracy (m), top and bottom of soil sample (cm), horizon designation, bulk density (g cm −3 ), moisture content at field capacity and wilting point (%), soil textural class, clay, silt and sand content (%), soil organic carbon content (%), soil acidity, saturated hydraulic conductivity measured in lab or field (cm day −1 ), source of the data, location id and mean soil depth). NA is 'no value'.
Column names are explained in Table 2a.   apply the PTF deduced from laboratory data for prediction of Ksat measured in the field. We expect differences because field measurements scan larger soil volumes that may contain soil structural pores.
The relative importance of the covariates for modeling Ksat was assessed by the node impurity, which, for RF regression problems, is computed as the decrease of residual sum of squares (RSS) when a particular covariate splits the data at the nodes 10 Table 5. Mean values of soil hydro-physical properties for each soil textural class. The number of samples (N) is given in parenthesis under each soil variable for each soil texture classes. N values marked with * correspond to undefined soil texture class. BD = bulk density (g/cm 3 ), OC = soil organic carbon content (%), FC = moisture content at field capacity (% vol), WP = moisture content at wilting point (% vol), Ksat l and Ksat f is laboratory and field Ksat (cm/day), respectively. For Ksat the geometric mean is reported (due to the sensitivity to a few extreme values for the arithmetic mean). For all other properties the arithmetic mean is provided.  of a tree (Hastie et al., 2009, sections 10.13.1, 15.3.2). The variable that provides maximum decline in RSS (and consequently increase in node purity) is considered as the most important variable; the variable with the second largest RSS decrease is considered the second most important variable, and so on. Furthermore, the accuracy of the predictions was evaluated using bias, root mean square error (RMSE, in log-transformed Ksat measurement), and concordance correlation coefficient (CCC) (Lawrence and Lin, 1989).

Texture Classes Clay
Bias and RMSE are defined as: where y and y are observed and predicted Ksat values, respectively, and n is the total number of cross-validation points.
The CCC is a measure of the agreement between observed and predicted Ksat value and is computed as (3) 10 where µŷ and µ y are predicted and observed means, σŷ and σ y are predicted and observed variances and ρ is the Pearson correlation coefficient between predicted and observed values. CCC is equal to 1 for a perfect model.

Data coverage of SoilKsatDB
Based on the literature search and data compilation, we have assembled a total of 13,267 values of Ksat from 1,910 locations 15 (each location has a unique location_id) across the globe. Moreover, the database contains a total of 13,295 Ksat values because a few studies have reported both field and lab measurements for the same location. Figure 1 shows the global distribution of the sites used in this study. Most data originate from North America, followed by Europe, Asia, South America, Africa, and Australia. With respect to climatic regions, 10,093 Ksat measurements were taken in temperate regions (8,296 contained texture and bulk density information and were used to build PTF) and 1,443, 1,113, 582, and 36 in tropical, arid, 20 boreal, and polar regions, respectively, as shown in Figure 2b. The points are often spatially clustered with the biggest cluster of points (1,103 locations with 6,532 Ksat measurements) in Florida (Grunwald, 2020). The Ksat database includes 4,133 values from field measurement and 9,162 values from laboratory measurements. In particular, different types of infiltrometers (e.g., mini-infiltrometer, tension infiltrometer, double ring infiltrometer) and permeameters (e.g., Guelf permeameter, Aardwark permeameter) were used for the field measurements, whereas constant or falling head methods were mainly used in laboratory 25 analyses (Table 4). Out of the 13,267 Ksat measurements, 11,591 had information on soil texture, 11,269 on bulk density, 9,787 on organic carbon, 7,389 on field capacity, and 7,418 on wilting point, while for 8,994 measurements information for all soil basic properties (bulk density, soil texture, and organic carbon) was available (Figure 2a).
The methods used to compute these soil properties (as much as we could extract from the literature and existing databases) are listed in the CSV file sol_ksat.pnts_metadata.csv available at https://doi.org/10.5281/zenodo.3752721. Note 5 that in addition to 11,591 soil texture values, 75 measurements have soil texture information with total (sand+silt+clay) less than 98% or greater than 102%. We did not use these values in the PTF development but included it in the database as "Error" class in the soil texture column.

Statistical characteristics of SoilKsatDB
The distribution of measurements based on soil texture classes is shown on the USDA soil texture triangle in Figure 3a. The 10 database covers all textural classes, with a high clustering in sandy soils due to the numerous samples from Florida (Grunwald, 2020), while only few measurements belong to the silt textural class. The increase in Ksat values in clayey and loamy soils for field methods (compared to laboratory methods) is likely due to the effect of soil structure. ANOVA with post-hoc Turkey's HSD test showed that the mean values for all broad soil texture classes are significantly different from each other, except for clayey soils field Ksat values and sandy soils field Ksat values (see Table A1). The violin distribution plot in Figure 3c shows the Likewise, Figure 3d shows the violin distribution of Ksat based on soil texture classes. The arithmetic mean of Ksat was highest for the sand and loamy sand soils (i.e., 2.68 and 1.99, respectively in log 10 cm/day), while the lowest mean values were found for silt and silty loam (i.e., 1.12 and 1.15, respectively in log 10 cm/day). Table A2 shows that the Ksat values in sand and loamy sand soil texture classes are significantly different from all other soil texture classes. However, silt, silty clay, and silty clay loam classes are not significantly different from clay, sandy clay, and sandy clay loam Ksat values. Average values of Ksat and other hydro-physical properties are shown in Table 5. Higher average organic carbon and bulk density values were observed in clayey and loamy soils compared to sandy soils. Ksat values obtained from field measurements were on average higher than those obtained from laboratory Ksat values. Particularly, for the clay texture class much lower Ksat values were observed for laboratory (mean Ksat ≈ 8 cm/day) compared to field (mean Ksat ≈ 110 cm/day) measurements (Table 5). Figure 3b further illustrates the higher range of Ksat values obtained for finer texture soils (clay and loam) compared 5 to coarser soils (sand).

Ksat PTFs derivation
As a test application of SoilKsatDB, two PTFs were derived for Ksat (i.e., for measurements taken in temperate regions and based on laboratory measurements) using basic soil properties as covariates. General trends between Ksat and soil properties are shown in partial correlation plots in Figure 4. The figure indicates that Ksat decreases with clay content and bulk density, 10 and increases with sand content. Figure A1 shows the list of relative importance of the covariates to build PTFs for the measurements from temperate regions and laboratory-based measurements. Clay content was found to be the most important variable followed by sand and bulk density for the temperate climate PTF. On the other hand, sand content was the most important variable followed by clay and bulk density for the laboratory-based Ksat PTF. CCC, bias, and RMSE were respectively equal to 0.70, 0.002, and 0.69, for the temperate-based PTF, and 0.73, -0.0004, and 0.65 for laboratory-based PTF.
As we will discuss in more detail in the next section, PTF models derived for temperate and laboratory-based Ksat values underestimated Ksat for tropical-and field-based Ksat values, respectively (see Figure 6b and Figure  The Ksat values were, on average, higher for the field measurements compared to laboratory measurements for most soil texture classes (Table 5 and Figures 3b and 5). The difference in laboratory and field based Ksat values and larger range of Ksat values 10 for fine textured soil is probably related to the effect of biologically-induced soil structure that might be neglected in laboratory measurements. The omission of soil structures in many laboratory samples limits the possibility to properly reproduce field observations that are likely to be more affected by the presence of biopores (Fatichi et al., 2020) and other soil structural characteristics, such a cracks. In other words, variability in the Ksat values depends on the consideration (and existence) of soil structure by the measurement methods. Soil structural pores change the pore size distribution and subsequently affect Ksat values (Tuller and Or, 2002). Such an effect is more likely to be neglected in laboratory measurements rather than in field studies due to the small size of most laboratory samples. Presence or absence of large structural pores depends on the scale of measurements (that is usually larger in the field). Mohanty et al. (1994), for example, compared three field methods and one laboratory method and found that the sample size affects the measurement of Ksat due to the presence and absence of open-ended pores. Similarly, Ghanbarian et al. (2017) showed that the sample dimensions (e.g., internal diameter and height) 5 also impact Ksat. The authors further developed a sample dimension-dependent PTF, which performed better than other PTFs available in the literature. Likewise, Braud et al. (2017) used three field methods for Ksat measurements and found significant variation between these measurement methods. Davis et al. (1996) also highlighted the necessity to choose the most appropriate scale of measurement for a particular soil sample when undertaking conductivity measurements. They tested small cores (73 mm wide and 63 mm high) and large cores (223 mm wide and 300 mm high) using the constant head method in the laboratory 10 and found a difference of 1 to 3 orders of magnitude.

Temperate vs tropical soils: effect of soil formation processes
PTFs obtained for temperate soils performed poorly for tropical soils (Figure 6), with Ksat being underestimated by the temperate-based PTFs. This result is in agreement with Tomasella et al. (2000) who derived PTFs using data from tropical Brazilian soils, which did not properly capture observations in temperate soils. We argue that the significant differences for 15 tropical and temperate soils are due to the differences in the soil-forming processes that also define the clay type and mineralogy.
In fact, Oxisols (highly weathered clay soils as a result of high rainfall and temperature in tropical regions) are characterised by inactive (non-swelling) clay minerals. In contrast to tropical soils, active (smectite) and moderately active clay minerals (illite) are the dominant clay minerals in temperate regions. These swelling clay minerals retain water within internal structures with very low hydraulic conductivity. Therefore, such a difference in clay mineralogy is likely responsible for the underestimation 5 of Ksat in tropical soils from PTFs based on measurements in temperate areas. In addition, soil structure formation processes may be different in tropical and temperate regions (perennial activities of vegetation in the tropics) which would also lead to differences between measured Ksat values for the two climatic regions.

Limitations of SoilKsatDB
We put an effort to combine laboratory and field data from across the globe. However, we acknowledge that there are still gaps 10 in some regions, such as Russia and higher northern latitudes in general, which may result in uncertainties in Ksat estimates in such regions. The SoilKsatDB could also be of limited use for fine-resolution applications because many data points were characterized by limited spatial accuracy and missing soil depth information. Specifically, the spatial accuracy of many points is between tens of meters and several kilometers (see the methodology sections regarding the extraction of the spatial locations using Google Earth). Many of the records in the SoilKsatDB come from legacy scientific reports and the original authors could 15 not be traced and contacted, hence we advise to use this data with caution. In addition, in the SWIG database, the soil depth and measurement method information were not provided, and often one location was used to represent an entire watershed. We tried to revisit each publication and extract the most accurate coordinates of assumed sampling locations. In addition, we assumed that most of the samples were obtained from field measurements as authors used different infiltrometers to compute Ksat, so there might be a few points in our SoilKsatDB that belong to laboratory measurements and that we have incorrectly assigned 20 to field measurements. Moreover, the field measurements in the database are a mix of many Ksat measurement methods.
For each measurement, a location accuracy (0-100 m = highly accurate, >10,000 m = least accurate) was assigned based on the sampling location accuracy. The location accuracy can be used as a weight or probability argument in Machine Learning for Ksat mapping. We are aware that this was a rather subjective decision; a more objective way to assign weights would be to use the actual spatial positioning errors. Because these were not available for most of the datasets, we have opted for the 25 definition of a location accuracy estimated from the available documentation.

Further developments
The advancement in remote sensing technology opens the doors to link the hydraulic properties with global environmental data. Satellite-based maps of environmental characteristics such as local information on vegetation, climate, and topography for specific areas, which are often ignored by basic PTFs, can be incorporated. For example, Sharma et al. (2006) developed 30 PTFs using environmental variables such as topography and vegetation and concluded that these attributes, at fine spatial scales, were useful to capture the observed variations within the soil mapping units. Likewise, Szabó et al. (2019) used a random forest machine learning algorithm for mapping soil hydraulic properties and incorporated local environmental information such as vegetation, climate, and topography.

Data availability
All collected data and related soil characteristics are provided online for reference and are available at https://doi.org/10.5281/ zenodo.3752721 (Gupta et al., 2020).

6 Summary and conclusions
We compiled a comprehensive global dataset of Ksat measurements (N = 13, 267) by importing, quality controlling, and standardizing tabular data from existing soil profile databases and legacy reports, as well as scientific literature. The SoilKsatDB covers a broad range of soil types and climatic regions and hence is useful in global models. A larger variation in Ksat values was observed for fine-textured soils compared to coarse-textured soils, indicating the effect of soil structure on Ksat. Moreover, 10 Ksat values obtained from field measurements were generally higher than those from laboratory measurements, likely due to the impact of soil structural pores in field measurements.
The new database was used to develop PTFs using RF algorithms for Ksat values obtained for temperate climates and for laboratory measurements. PTFs developed for a certain climatic region (temperate) or measurement method (laboratory) could not be satisfactorily applied to estimate Ksat for other regions (tropical) or measurement method (field) due to the role of 15 different soil forming processes (inactive clay minerals in tropical soils and impact of biopores in field measurements).
There are still some gaps in the geographical representation of the data, especially in Russia and the higher northern latitudes, that could induce uncertainty in global modeling. Therefore, the data set can be further improved by covering the missing areas thus allowing better accuracy in modeling applications.
The SoilKsatDB was developed in R software and is available via https://doi.org/10.5281/zenodo.3752721. We have made 20 code and data publicly available to enable further developments and improvements.
Appendix A Table A1. Listed p-values under 95 percent confidence interval for each class to show the significant difference in Ksat between texture classes (see Figure 3b). The significant difference was computed using ANOVA (Analysis of Variance) with post-hoc Turkey's HSD (Honestly Significance Difference). The values highlighted in yellow show the significant difference in Ksat between two soil texture classes.  Figure A1. Importance of the variables for developing the PTFs for Ksat using random forest algorithm. The x-axis displays the average increase in node purity (the larger the value, the more important is a covariate). (a) Clay content was the most important variable followed by sand and bulk density for the Random Forest model built on data from temperate regions. (b) Sand content was the most important variable followed by clay and bulk density for the Random Forest model based on laboratory measurements.