SoilErosionDB : A global database for surface runoff and soil erosion evaluation

Soil erosion is a major threat to soil resources, continuing to cause environmental degradation and social poverty in many parts of the world. Many field and laboratory experiments have been performed over the past century to study spatio-temporal patterns of soil erosion caused by surface runoff under different 15 environmental conditions. However, these historical data have never been integrated together in a way that can inform current and future efforts to understand and model soil erosion at different scales. Here, we designed a database (SoilErosionDB) to compile field and laboratory measurements of soil erosion caused by surface runoff, with data coming from sites across the globe. The SoilErosionDB includes 18 columns for soil erosion related indicators and 73 columns for background information that describe 20 factors such as latitude, longitude, climate, elevation, and soil type. Currently, measurements from 99 geographic sites and 22 countries around the world have been compiled into SoilErosionDB. We provide examples of linking SoilErosionDB with an external climate dataset and using annual precipitation to explain annual soil erosion variability under different environmental conditions. All data and code to reproduce the results in this study can be found at: Jian, J., Du, X., Stewart, R., Tan, 25 Z. and Bond-Lamberty, B.: jinshijian/SoilErosionDB: First release of SoilErosionDB, Zenodo, doi:10.5281/zenodo.4030875, 2020b. All data are also available through GitHub: https://github.com/jinshijian/SoilErosionDB.


30
Soil is an essential natural resource for human sustainable development that is continually threatened by erosion and related land degradation processes (Borrelli et al., 2017;Poesen, 2017). Soil erosion is a geomorphic process that occurs when soil particles, soil aggregates, organic matter, and rock fragments detach from their original positions and become transported to other locations (Morgan, 1988;Toy et al., 2002). Erosion is a naturally occurring process affected by 35 both abiotic (e.g., rainfall, runoff, wind, and snow avalanches) and biotic drivers (e.g., animal trampling and tree fall), and may have significant consequences for carbon cycling (Berhe et al., 2007;Ito, 2007;Lal, 2003;Tan et al., 2020;Yue et al., 2016).
Erosion rates have been substantially increased by human activities such as cropland cultivation, mining, building construction, and deforestation (Borrelli et al., 2017;García-Ruiz et al., 2015;40 Poesen, 2017;Vanwalleghem et al., 2017). For instance, large-scale farmland development in forested and grass-covered prairies of the United States of America led to the Black Dust Bowl of the 1930s, resulting in widespread ecosystem damage and economic losses (Schubert et al., 2004). In China, population and demographic shifts have increased demand for food and other resources, leading to increases in cultivated croplands and concomitant soil erosion problems (Guo et al.,45 2015; Xu et al., 2010;Zhang et al., 2012;Zhao et al., 2013).
Many field and lab experiments have characterized soil erosion under different environmental conditions. Some have utilized soil erosion runoff plots, with nations and regions such as the USA (>10,000 plots), Europe (>8.000 plots), Brazil (3,525 plots) and Brazil (>2,500 plots) all having widespread installations (Poesen, 2017). These experiments have led to more than 31,000 different 50 peer-reviewed studies (searched at https://www.sciencedirect.com using "Erosion and runoff" as keyword). However, there has not yet been a successful effort to compile the data from those studies into a single and coherent dataset.
To bridge this research gap, we developed a global soil erosion database (SoilErosionDB) for standardizing and compiling historical soil erosion related measurements together. The database 55 can be used to support evaluation and parameterization of global soil erosion models, statistical modeling, non-point pollution evaluation, as well as cropland management recommendation. It can also be used to perform synthesis analyses, such as meta-analyses, and may inspire future efforts to better understand spatial and temporal patterns of soil erosion.

60
We designed the SoilErosionDB following FAIR protocols, i.e., Findable, Accessible, Interoperable, and Reusable (Wilkinson et al., 2016). All data, quality assurance/quality control (QA/QC) code, and analysis code are immediately available through a GitHub repository (https://github.com/jinshijian/SoilErosionDB), and each release will be issued a DOI through Zenodo to ensure reusability. The version format follows an "x.y.z" format, where x is the major 65 version number, y is the minor version number, and z is the patch number. We update the major version number only if the database changes its structure; we expect this to happen at an approximately decadal time step. We update the minor version number whenever the database has a significant data update; this usually happens at annual time steps. The patch number will be updated relatively often, whenever the database has an important documentation update or data 70 correction. We also made efforts to ensure interoperability so that SoilErosionDB could easily link to external datasets. For an example of linking SoilErosionDB to an external climate dataset please see "4. Linkages to external data sources" section below.

Publication collection
Publications were collected during an online literature search using "runoff, erosion" as keyword 75 in ScienceDirect (https://www.sciencedirect.com/). We had no restrictions on literature types, i.e., both peer-reviewed articles and no-peer-reviewed articles such as theses, dissertations, and conference collections were included. We initiated our search on January 10, 2020 and found 31,235 papers, with an increasing number of published papers by year (Figure 1a). The following criteria were used to determine whether a article should be included in the SoilErosionDB: (1) 80 measurements were measured in the field, at the laboratory with rainfall simulation experiment, or from indirect methods ( Table 1); (2) soil erosion was reported in units that could be converted to t ha -1 year -1 or g m -2 hour -1 ; and (3) articles were published in English or Chinese language journals after 1960. We included no other filtering criteria or restrictions to the literature. Note that we did not include a constraint for leaching data because a variety of leaching types (e.g., soil organic 85 carbon loss, organic matter losses, and total nitrogen loss) were reported in papers, with those measurements reported in different units.

Database structure design
The SoilErosionDB (i.e., "SoilErosionDB.xlsx" file in the GitHub repository) has 12 data sheets, and the core part is the "SoilErosionDB" data sheet, with 18 columns for soil erosion, surface 90 runoff, and nutrient leaching records ( Table 2) and 73 columns for background information ( Table  3). The "DataBase_fields" sheet describes all 91 columns in the "SoilErosionDB" data sheet. The "UnitsConverter" sheet contains a 'units converter' to standardize all surface runoff and soil erosion measurements into the same unit (i.e., soil erosion in units of t/ha/yr or g/m 2 /hr; runoff in units of mm/yr or mm/hr). The "CountryCode" sheet holds the international country code for the 95 usage of Site_ID. The "Slope" sheet provides the converter of transforming slope from % to °. The "Quality_flag" sheet describes the quality control flag of measurements collected from papers (see Table 4 for details). The "Meas_method" sheet describes soil erosion measurement methods reported in papers (see Table 1 for details). The "Biome" sheet describes different biome types. The "IGBP" sheet describes all 20 International Geosphere-Biosphere Programme (IGBP) 100 (Townshend, 1992) vegetation types reported in papers. The "Manipulation" sheet includes description and comments about 17 manipulation types used in the SoilErosionDB (see Table 5 for details). The "ReferenceList" sheet holds all reference details for all papers we compiled into the SoilErosionDB; and the "LiteratureSearch" sheets describes literature search details for the SoilErosionDB, such that users can reproduce the literature search results based on the description. 105 We read through each publication and compiled measurements and background information into SoilErosionDB. Currently we have collected and processed data from 124 papers that included measurements taken between 1980 and 2017 (Figure 1b). Each column in SoilErosionDB corresponds to either background information, surface runoff, soil erosion, or nutrient leaching indicator. When sites' location (latitude and longitude) was not reported, we estimated site 110 coordinates according to the site name or the maps provided in the paper. For the soil erosionrelated indicators, i.e., surface runoff, soil erosion, and nutrient leaching, data were either directly read from tables or digitized from figures. We used Data Thief (version III) (Flower et al., 2016) whenever we had to obtain the values from figures. Replications and standard deviation (SD) information were usually directly obtained from the original papers, however, sometimes 115 confidence interval (CI), coefficient of variation (CV), or standard error (SE) was reported rather than SD, we calculated SD using equations 1-3 of (Jian et al., 2020a).

Surface runoff, soil erosion, and nutrient leaching measurements
The field, unit, and explanation about the surface runoff, soil erosion, and nutrient leaching measurements are presented in Table 2. It should be noted that SoilErosionDB has been designed 120 to hold surface runoff, leaching and soil erosion measurements in terms of both annual amounts and instantaneous rates. However, nutrient leaching was organized in a different way, where the "Leaching" column holds values, the "Leaching_unit" holds the unit of measurement reported in the original paper, and the "Leaching_type" column records the leaching type reported in the original paper.  Table 3) includes descriptive data about sites and experimental design. Soil erosion measurement methods, quality control flag, and manipulation are further described in Tables 1, Table 4, and Table 5. Specifically, Table 1 describes 16 soil erosion measurement methods reported in literature; Table 4 describes 10 quality control flags to help the developer 130 record necessary information for quality control. Table 5 describes manipulation information, which is useful for the further analysis about how treatment affects surface runoff and soil erosion.

Technical validation
We carefully checked the data with the original paper to ensure the fidelity. We used the Mendeley bibliography management software (https://www.mendeley.com) to ensure papers were not 135 compiled into the database multiple times by different contributors. Each paper was first carefully read by the data collector, and any useful records were compiled into SoilErosionDB. Then a data quality checker compared the data in the database against the original paper. Specially, we paid attention to the methods sections, figures, and tables, where most of the surface runoff, soil erosion, nutrient leaching, and background information were located.

140
In addition, we developed an R markdown file (ErosionDB_validation.Rmd in the Github repository) to examine the data quality of SoilErosionDB. The file was created using R Version 3.6.1 (R Core Team, 2020). For the latitude and longitude inputs, we plotted sites by individual country (currently a total of 22 countries, Figure 2), then compared the sites with that country's boundaries to ensure that no sites fell outside. For any sites that appeared to be mislocated, we 145 went back to the original paper and corrected the coordinates in the database. For all numeric columns in the SoilErosionDB (except "Unique_ID" and "Study_number"), we plotted histograms for each column, and checked whether extreme values were included in the database. Figure 3 shows an example using the histograms of annual surface runoff and annual soil erosion.

Linkages to external data sources 150
Potentially important climate data (e.g., temperature and precipitation) are important factors affecting surface runoff and soil erosion; however, many papers did not report that information. Therefore, we linked the SoilErosionDB with a 0.5° × 0.5° resolution global climate data product (Willmott and C. J., 2000) to obtain annual temperature, mean annual temperature (MAT), annual precipitation, and mean annual precipitation (MAP) based on site latitude and longitude. The MAT 155 and MAP were calculated based on records between 1961 and 2015.
The results showed that temperature and precipitation data from the global climate dataset are highly correlated with that reported in the literature (Figure 4). Furthermore, we analyzed whether annual precipitation obtained from the external climate dataset can be used to explain annual soil https://doi.org/10.5194/essd-2020-283 erosion variability in SoilErosionDB. We found that annual precipitation from the global climate 160 dataset can explain ~7% of variability in annual soil erosion (Figure 5, R 2 = 0.07, p = 0.01). We presume that linking SoilErosionDB with other external data sources, e.g., leaf area index, vegetation type, climate type, and soil properties, can lead to increased explanatory power for spatial and temporal variability of soil erosion.  (Figure 1 to Figure 5) and described the analysis for this study. All the data processing and data 170 visualization were conducted using R (version 3.6.1).

Usage notes
We suggest users download the data and code directly from Zenodo (http://doi.org/10.5281/zenodo.4030875), as Zenodo provides DOI and generates the same results for all users. Another advantage of using data and code from Zenodo is that it avoids any run errors 175 caused due to adding new measurements during database updating. On the other hand, the data and code in the GitHub are for development purposes. In addition, as new records are added to the database, output results may differ from those generated using older versions, and may even cause run errors. The users are encouraged to contact the SoilErosionDB development team before using the data from Zenodo for analysis. We recommend that users contribute as a data quality 180 checker is a great first step to understand the data; with the provided R code, users could explore the database as the code explained the analysis and the data in details.

Future directions and contribution notes
We have decided to share this work at this initial database development stage for two reasons: 1) we want to receive feedback from the community about how to improve the data structure to ensure 185 optimal usage; 2) the large number of potentially relevant papers that have been or will be published makes it important to expand the development team. Thus, we welcome and invite scientists and data users who are interested in developing SoilErosionDB to download the dataset and consider contributing published or unpublished data. Our long-term goal is to update SoilErosionDB by including measurements from newly published papers every year.

Author contributions
Xuan Du and Jinshi Jian conceived the design of the data framework, compiled the data from papers to the SoilErosionDB. Xuan Du and Jinshi Jian wrote the manuscript, and all authors revised and approved the manuscript.

Competing interests
The authors declare no conflicts of interest.