Harmonized Soil Database of Ecuador (HESD): data from 2009 to 2015

. One of the largest challenges with soil information around the world is how to harmonize archived soil data from different sources and how to make it accessible to soil scientist. In Ecuador, there have been two major projects that have provided soil information


Introduction
There is an increasing need for updated soil datasets globally.These datasets are required to develop soil monitoring baselines, soil protection and sustainable land use strategies, and a better understanding of the soil response to global environmental change.Soil datasets are one of the most critical inputs for Earth system models (ESMs) to address different processes, such as the terrestrial carbon sinks and sources of greenhouse gases (Luo et al., 2016;Pfeiffer et al., 2020).Furthermore, access to spatially explicit, consistent, and reliable soil data is essential for digital soil mapping and for evaluating the status of soil resources with increased resolution to respond to and assess global issues (FAO, 2015;FAO and ITPS, 2015;Pfeiffer et al., 2020).Unfortunately, one of the biggest challenges for digital soil mapping is the limited available information (e.g., soil profile descriptions, soil sample analysis, hard soil data) representing soil variability across the world.
Over the last few years, there has been a growing focus on improving the quality and quantity of soil data as well as access to soil data and information (Díaz-Guadarrama et al., 2022;Smith et al., 2022;Pfeiffer et al., 2020;Orgiazzi et al., 2018;Hengl et al., 2017).Particularly, these efforts have endeavored to increase access to harmonized products containing comparable and consistent datasets.Global initiatives such as the World Soil Information Service (WoSIS; Batjes et al., 2020) or SoilGrids250m (Hengl et al., 2017), for global pedometric mapping, have provided increasing soil information to multiple users.Arrouays et al. (2017) affirm that over 800 000 soil profiles have been collected into databases during the past decades, but only a small number of these (117 000) are accessible or shared with the international community.According to Batjes et al. (2020), large numbers of soil profiles stored in many country-specific databases are not yet standardized and harmonized according to a global standard and are not shared; therefore, they are not available for use at a national level, let alone at a global level.
As acquiring new soil data is laborious and expensive, legacy soil databases and historically collected soil information are extremely valuable (Gray et al., 2015;Arrouays et al., 2017).This information is useful to test how soils change over time, but it usually comes from various projects that used different procedures, laboratory methods, standards, scales, taxonomic classification systems, and georeferencing systems.Therefore, data must be retrieved, compiled, and processed into standard, consistent, and harmonized datasets, which is a challenging process (Arrouays et al., 2018).
It is necessary to have consistent and spatially explicit information on different soil properties and attributes, such as soil organic carbon (SOC) content, and reality shows the existence of a severe deficit in coherent information at regional, national, and global levels (Arrouays et al., 2017).Rossiter (2016) highlights important barriers limiting the interoperability of soil databases with global soil modeling assess-ments, such as the scarce availability of soil datasets and the lack of harmonization efforts to bring multiple soil data structures into usable formats for diverse applications (e.g., digital soil mapping).Interoperability is defined as the collective effort of sharing information that can be used to produce and apply newly gained knowledge, and it is achieved by removing conceptual, technological, organizational, and cultural barriers (Vargas et al., 2017), which are common in soil-science-related communities Efforts to increase interoperability in soil science must come from various individuals and institutions, including government ministries/agencies, the scientific community, landowners, civil society groups, and business owners.
It is vital to model the status of soil resources globally at an increasingly detailed resolution in order to evaluate and have a better response to global and local issues, such us soil salinization, land degradation, and desertification (Pfeiffer et al., 2020;FAO, 2015;Hengl et al., 2014).Harmonizing soil databases will improve the estimation of current and future land potential productivity, help identify land practical limitations for land management, and identify land degradation risks, particularly soil erosion (Nur Syabeera et al., 2020).It will also contribute scientific knowledge that can aid with planning a sustainable transformation of agricultural production and with guiding policies to address emerging land competition issues around soil security, food production, bioenergy demand, and biodiversity threats (Montanella et al., 2016;FAO, 2015;McBratney et al., 2014).Thus, nationalto global-scale harmonized soil databases are of critical importance for natural resource management, making progress towards eradicating hunger and poverty, and addressing food security and sustainable agricultural development, especially concerning the threats of global climate change and the need for adaptation and mitigation (FAO/IIASA/ISRIC/ISS-CAS/JRC, 2009).
In Ecuador, there have been two main efforts that have collected national soil information: one by the Instituto Espacial Ecuatoriano (IEE) and one by the Ministerio de Agricultura y Ganadería within the Sistema Nacional de Información de Tierras Rurales e Infraestructura Tecnológica (MAGAP-SIGTIERRAS) program (Tracasa-Nipsa, 2015).These projects have comparable methodologies, but there are substantial differences, especially with respect to how the soil information is structured and presented.We have identified over 13 500 soil profiles (and 51 713 measured soil horizons) in Ecuador (Loayza et al., 2020) that can be used to support a national framework on pedometric (or digital soil) mapping (Guerrero et al., 2014).We highlight that this soil information in Ecuador has not been available to the scientific community to date, and only 94 Ecuadorian soil profiles are currently included in global soil information systems, such as WoSIS (Batjes et al., 2020).
The main objective of this study is to synthesize and harmonize available soil profile information collected by the IEE and MAGAP-SIGTIERRAS across Ecuador between 2009 and 2015.In this way, we developed a new soil database with the purpose of constructing a national soil information system following international standards for archiving and sharing soil data.Thus, this dataset can easily be integrated into global soil information systems.In addition, we provide an integrated framework combining various data analysis tools to convert legacy soil information in an analog format into digital information that is useful for further analyses and data sharing.
The original IEE data were available as a collection of portable document format (PDF) files, where each PDF represented one soil profile containing morphological and analytical information.In contrast, soil morphological and analytical information from MAGAP-SIGTIERRAS was stored in different files in PDF format.We unified the information from IEE and MAGAP-SIGTIERRAS into one harmonized database (Fig. 2) using a unique field: the profile identifier (ID_PER).Given the size of the database, manual extraction of the original information was not feasible.Therefore, we developed an automated workflow using two programming languages Python and R, to optimize data extraction of soil data and information from the original-format datasets.

Extracting data from PDF files
Each available soil profile was divided into two groups depending on its original source (i.e., IEE or MAGAP-SIGTIERRAS).Specialized data handling libraries, such as pandas (McKinney, 2011), openpyxl (Python Software Foundation, 2010), or PDF-Tools (Tracker Software Products, 2011), were used to automate this task.The first step to extract data was to convert the information from PDF format to a data format such as .xlsxor .txt.The data extracted contained categorical information about the profile morphological description and tabular information with chemical and physical properties for each available soil horizon.The target information extracted for MAGAP-SIGTIERRAS or IEE was organized using the pandas Python library and exported to the Harmonized Soil Database of Ecuador (HESD).
Data from MAGAP-SIGTIERRAS presented a homogeneous structure which simplified data extraction.The structure from the IEE information presented many irregularities that varied across the collection.Irregularities included the following: the number of fields and variables in the tables, table headers, and differences in categorical or descriptive fields.The heterogeneity of the structure in MAGAP-SIGTIERRAS and IEE hindered the design of a homogeneous extraction methodology; therefore, we applied two approaches, as explained below.

The MAGAP-SIGTIERRAS approach
The homogeneous structure of the MAGAP-SIGTIERRAS dataset allowed for the development of a methodological approach based on regular expression queries.Each query sought a target variable or information contained in the text.
First, all files from MAGAP-SIGTIERRAS were stored in a specific directory.Then, iteratively, each file was converted into a .txtfile, preserving the format of the tables, using the pdftools R package (Ooms, 2022).Once the files were converted, regular expressions were applied over the text to extract the key variables; to perform this process, inhouse scripts were used that required adaptation depending on the structure of the original database (Supplement A).The regular-expression-based queries were imported in a data frame that held the information for a single file.Next, the resulting data frame was appended to a target data frame (i.e., final data frame) that contained all of the processed information from all available files.Once all of the files were processed, the final data frame was converted to a .csvfile.

The IEE approach
Here, we aimed to convert the information stored in the PDF (text and tables) to .xlsxformat, where each sheet contained the text blocks or tables of the original PDF document.Our only option to extract the information with this format was the open-access program Smallpdf v 0.19.1.In this way, each sheet corresponded to the description of a group of morphological, chemical, or physical soil properties.
The conversion was not always successful due to inconsistencies among datasets.Example of inconsistencies are merged rows, joint characters inside the variable descriptions, inconsistent labeling of the tables, or a different number of tables per file.Therefore, a Python 3.10.2script was generated to overcome these difficulties and successfully extract the data.The goal was to read the .xlsxfiles and transfer the information into another file whose tables were designed with the target structure of the HESD (see Supplement D).
The rationale of the script was to generate a data frame for every sheet in an .xlsxfile, where each sheet corresponds to a table with a chemical or physical description for a soil https://doi.org/10.5194/essd-15-431-2023 Earth Syst.Sci.Data, 15, 431-445, 2023 profile.After creating a data frame for each table, all of the data frames were merged in a standard data frame for the .xlsxfile; finally, the file data frame was appended to a general data frame that contained the information for all of the .xlsxfiles.The files were then converted to a .csvformat for the next phase of correction and harmonization.Scripts and diagrams explaining the methodology used for each case can be found in the Supplement (see Supplement B, D).

Soil data correction and harmonization
All of the data obtained from the original sources went through a manual review process by an expert pedologist to minimize the data extraction errors and provide a curated harmonized dataset.Once the original databases were merged, the two subsets of the final database (profile information subset and horizon information subset) were manually revised a second time by the expert to detect any potential errors and inconsistencies.All fields in the database were checked using basic descriptive statistics, such as minimum, maximum, average, and standard deviation values, to verify the consistency of the data and the soil properties (e.g., pH range and C/N ratio).In some fields, it was necessary to make changes in the units of measurements in the harmonization tasks, either by standardizing the original datasets (i.e., IEE and MAGAP-SIGTIERRAS) or by converting all units to the International System of Units.The organic carbon (CO), organic matter (MO), and total nitrogen (NTOT) variables were transformed to grams per kilogram (g kg −1 ).The level of precision in the expression of each variable was standardized (maximum of two significant figures).Finally, some errors were found and corrected, such as duplicated information, missing data, errors in the information's agreement with the horizon, and formatting typos.Special attention was paid to the quantitative information of the analytical variables, for which the frequency histograms were plotted to identify outliers or physical inconsistencies, such as excessively low pH values (i.e., < 3), extremely high C / N ratios (i.e., > 35), or zero-value assignment in unrealized determinations.All inconsistencies that could not be resolved were reclassified as "without data".

Soil dataset overview
The HESD contains information from 13 542 soil profiles with over 51 713 measured soil horizons, including 92 different edaphic variables.With over 4.7 million records that include numeric (e.g., clay content, organic material, and soil pH) and class soil properties (e.g., horizon designation and geology), the HESD represents the most complete data compilation for mainland Ecuador.
The structure of the database is based on the Soil Organic Carbon Mapping Cookbook (FAO, 2018) and represents a complete soil data compilation for Ecuador, considering the effective soil depth (ESD).The ESD considers the solum, which includes the surface and subsurface horizons with presence of roots and biological activity (Soil Survey Staff, 1975), of the soil profile.Given the impossibility of designing a single structure for coupling the profile and the soil horizons' information, the data were divided into two datasets linked by a unique identifier.Thus, the use of a relational database can easily be queried and augmented for future synthesis studies.
The common identifier linking these dataset tables is the ID_PER field, which records the unique name assigned to the database.Both files (.csv) can easily be imported into statistical software such as R, after which they can be joined using the unique ID_PER.The first dataset contains information associated with the soil profile and its environmental characteristics (Table 1).It shows the variables in the profile dataset, with the soil profile information (e.g., classification, humidity and temperature regime, rockiness, and adequate depth) and the site-level data containing the environmental information (forming factors): landscape attributes, land cover type, slope.
The second dataset contains information associated with the soil profiles, divided into horizons and including qualitative and quantitative information.The dataset contains morphological information such as the designation or depth of the soil horizon, the presence or absence of roots, and the abundance of rock fragments.In addition, there are more than 30 variables related to soil physical properties (e.g., texture bulk density) and chemical properties (Table 2).We highlight that there is information regarding the soil organic fraction, the cation exchange capacity, the electrical conductivity and sodium exchange capacity, and the soil properties (e.g., soil drainage and soil tilth), which is relevant for the evaluation of soil health (USDA, 2022). https://doi.org/10.5194/essd-15-431-2023 Earth Syst.Sci.Data, 15, 431-445, 2023

Exploratory analyses of the HESD
We performed an exploratory analysis of some variables included in the HESD as an example of its usability.Soil variables behave differently when the soil depth increases, and Fig. 3 shows examples of soil property-depth relationships (for organic carbon, pH, soil electrical conductivity, electrical conductivity in water total clay, soil cation exchange capacity -CIC, and soil profile of effective depth -PRES).
For example, organic carbon has higher values at the surface and gradually decreases as soil depth increases.In contrast, pH ranges between 6 and 7 with an average of ∼ 6.5, and this value is maintained as soil depth increases.That said, we provide examples on how different soil properties vary as soil depth increases (Fig. 3).Information in the HESD can be used to evaluate how land use and management could affect soil properties (Beillouin et al., 2022).Table 3 shows the results of the statistical analysis of different variables within two different ecosystems: cropland and forest.Although the HESD presents the most complete information at the national level, we recog-nize that there are still information gaps.The two original projects from which the soil information was extracted were focused on agricultural areas; therefore, the HESD does not represent all ecosystems across Ecuador equally.For example, the HESD has 9675 soil profile descriptions for cropland and only 3694 for forest.These two are the most representative ecosystems at the national level.We highlight that forest ecosystems show evidence of higher SOC (27.9 g kg −1 ) than cropland ecosystems (24 g kg −1 ).Thus, forest ecosystems have a higher concentration of carbon but are not always well represented in the national database.

Spatial distribution and environmental representativeness of the database
Two different analyses carried out with the HESD: one focused on the representativeness of the data within the different biogeographical sectors and one focused on the probability of the spatial representativeness at the national level.
To undertake these analyses, we used the maximum entropy    approach (Maxent program; Phillips et al., 2020), which has been applied to assess the spatial representativeness of environmental observatory networks (Villarreal et al., 2019(Villarreal et al., , 2018)).

Representativeness index of Ecuadorian biogeographic sectors
The first analysis to test the representativeness was done considering the 15 biogeographic sectors of Ecuador (Fig. 4).We clarify that each biogeographic sector represents a group of plant communities that share flora affinity at at least the genus level and mainly at the species level; thus, these sectors define homogeneous environmental units (Ministerio de Ambiente del Ecuador, 2013).We calculated the representativeness index for each sector based on the number of data points divided by the total coverage percentage of each biogeographic sector.Based on this calculation, the higher the representativeness index, the better represented it is in the database (Pfeiffer et al., 2020).Table 4 shows the data points compiled in this work, by region, province, biogeographic sector, and the representativeness index for each biogeographic sector.
The biogeographic sector with the highest representativeness index is Western Cordillera of the Andes with 24.7 %, followed by Jama-Zapotillo (16.7 %), Northeastern Cordillera of the Andes (11.4 %), Southeastern Cordillera of the Andes (9.7 %), and Páramo (7.6 %) (Table 4).These areas are found mainly in the western part of Ecuador.The last four biogeographic sectors are grouped in what we call the "Northern Andes" province in the Andes region.In Ecuador, this zone encompasses the Andes mountain range that extends from north to south (Clapperton, 1993)  The Andes, in the biogeographic sector of Páramo, has a mean SOC of 45 g kg −1 .This sector is distributed in a valley and is almost uninterrupted over the forest line of the eastern and western mountain ranges of the Andes (Hofstede, 1999) around 3700-3400 m.a.s.l.This biogeographic sector occupies 23 452 km 2 (9.4 % of the national territory) (Table 4) and is probably the largest soil carbon reservoir in Ecuador.Despite the importance of Páramo as a large pool of SOC, its representativeness index is not as high as we expected (109.8).This is probably because a large part of the region is within some of the national protected areas, which are zones that were not considered by the original projects.
Most of the data are concentrated in the southwestern part of the country.In contrast, no soil data are available for the eastern section of the country, mainly in the Amazonian region (representativeness index of 31.4), but the mean carbon concentration (17.7 g kg −1 ) in this region is higher than the littoral region (3579 observations, 15.5 g kg −1 SOC).This may be because the organic soil layer of the tropical forest is no deeper than 10 cm, limiting carbon accumulation in soil, and the decomposition of the litter is so rapid that the plant material reaching the soil surface is, in most cases, oxidized before it can be incorporated into the soil matrix (Hofstede, 1999).

Spatial representativeness using the Maxent approach
The second analysis carried out was performed using the Maxent approach (Yackulic et al., 2012).This analysis provides an estimate (with most values between zero and one) that can be interpreted as the probability of presence or the probability of an area being represented by the spatial information included in the HESD.This analysis allowed us to compare the spatial representativeness of the HESD with the soil information currently available in WoSIS (Batjes et al., 2020), and we demonstrate that the HESD contributes to filling the spatial soil information gaps across Ecuador, particularly across the coast and in the highlands (as shown in Fig. 5).As evidenced in Table 4, there are areas not yet fully represented with available data in the HESD, such as in the eastern part of the country (Amazonia) and in a part of the Esmeralda Province (northwest), but a greater representativeness is evident with the HESD compared with that of the current WoSIS database.
The HESD shows a clustered distribution, with some areas better represented than others, due to the methodology used in the original projects that was biased to cropland areas (Table 4).We highlight that the original soil collection efforts (i.e., IEE and MAGAP-SIGTIERRAS) were not focused on biogeographical sectors, rather on populated areas or areas designated for agriculture, and did not consider pro- ) and all sectors in the Amazon region (Fig. 5, Table 4).We propose that the HESD can be updated as updated soil data become available (at the local to national level to gradually fill soil spatial information gaps and better represent the entire geographical range of Ecuador's territory.

Further considerations
The HESD aims to increase the quantity, quality, and accessability of soil information across Ecuador and the Latin American region.The HESD facilitates the exchange and use of soil data collected within the context of collaborative efforts at different scales (global, national, and local).Globally, the HESD has the structure to be considered for use in different international projects, including the Global Soil Organic Carbon Map (GSOCmap), a project of FAO the Global Soil Partnership (GSP), and the GlobalSoilMap.Net project. https://doi.org/10.5194/essd-15-431-2023 Earth Syst.Sci.Data, 15, 431-445, 2023 The proposed methodology demonstrates the possibility of transforming soil information that has previously been stored in formats that are not easily accessible for data analysis (e.g., in PDFs or scanned paper sheets) into usable formats for soil spatial variability studies at the regional to the national scale.We propose a systematic method for the organization of national soil information to reduce errors when generating new data in the future (Yigini et al., 2018;Baritz et al., 2008).We have substantially improved the publicly available spatial representation of soil information in Ecuador to support current soil information initiatives such as the WoSIS (Batjes et al., 2020), the GlobalSoilMap.Net project, and the FAO Global Soil Partnership, thereby increasing access to soil information across the world.The HESD includes information on more than 70 edaphic properties for Ecuadorian soils.It is evident that data gaps exist in certain areas, and there is a need to incentivize future soil survey programs to increase the sampling in underrepresented areas.The HESD could support the generation of new soil-related knowledge which could help to assess food production challenges, threats to soil security and soil health, climate change mitigation, and land degradation.
Author contributions.DA, MG, and CO conceptualized the study and developed the methodology.FB and PD developed the code and scripts to extract the soil information.RV and VO were responsible for writing, reviewing, and editing the manuscript.WJ helped with the original resources.CO was responsible for funding acquisition, DA prepared the manuscript with contributions from all co-authors.
Competing interests.The contact author has declared that none of the authors has any competing interests.
Disclaimer.Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Review statement.This paper was edited by Giulio G. R. Iovine and reviewed by three anonymous referees.

Figure 1 .
Figure 1.Spatial distribution of soil profiles in Ecuador compiled in the HESD.

Figure 2 .
Figure 2. Overview of the workflow for extracting data and structure database harmonization.The following abbreviations are used in the figure: ID_PER -profile identifier, ID_NAC -profile identifier in the provenance collection, COLP -source project, CORX -longitude coordinates, CORY -latitude coordinates, ALT -altitude, ID_HOR -horizon identifier, ORDHOR -horizon number, and HMORmorphological horizon.

Figure 3 .
Figure 3. Variation in the concentration of soil variables with respect to depth (cm): (a) profile average of organic carbon (CO); (b) profile average of pH H 2 O; (c) profile average of electric conductivity in water (CEAQ); (d) profile average of electric conductivity in water total clay (ARCILLA); (e) profile average of the cation exchange capacity (CIC); and (f) profile average of the effective depth (PRES).The blue area represents the range of variation in the properties.

Figure 4 .
Figure 4. Map of the biogeographic sectors of Ecuador, extracted from the "Sistema de clasificación de Ecosistemas del Ecuador Continental" (Ministerio de Ambiente del Ecuador, 2013).

Figure 5 .
Figure 5. (a) National representativeness (an estimate between zero and one of the probability of the presence) of soil information using the HESD.(b) Information available on the national representativeness of soil information from the World Soil Information Service (WoSIS).

Table 1 .
The HESD profile variable names, codes, descriptions, and units.
a USDA soil taxonomy (ST) developed by the United States Department of Agriculture and the National Cooperative Soil Survey (Soil Survey Staff, 1999).b Guidelines for soil description 4th edition, Food and Agriculture Organization (FAO) of the United Nations, Rome, 2006 (Food and Agriculture Organization of the United Nations, 2006).

Table 2 .
The HESD horizon coding conventions and soils property names, units of measurement, and descriptions.

Table 2 .
Continued.SAR), calculated from the concentrations of Na + , Ca 2+ , and Mg 2+ in soil solution (using the SPM) a Guidelines for soil description 4th edition, Food and Agriculture Organization (FAO) of the United Nations, Rome, 2006 (Food and Agriculture Organization of the United Nations, 2006).b The USDA system classifies soils into 12 soil texture classes.

Table 3 .
Statistical analysis of key variables in the HESD.The two most nationally representative types of ecosystems were selected: cropland (9675 data points) and forest (3694 data points).
The abbreviations used in the table are as follows: CO -organic carbon, PHAQ -pH H 2 O, CEAQ -electric conductivity in water, ARENA -sand total, ARCILLA -clay total, CIC -cation exchange capacity, and PRES -effective depth.SD denotes standard deviation, and CV is the coefficient of variation.
. In terms of SOC, these regions present the highest mean values (27.8 g kg −1 ).

Table 4 .
Distribution of SOC data points per ecosystem sector (vegetation formation) according to the Ministerio del Ambiente del Ecuador (2013).