Improving the Latin America and Caribbean Soil Information System (SISLAC) database enhances its usability and scalability

. Spatial soil databases can help model complex phenomena in which soils are a decisive factor – for example, evaluating agricultural potential or estimating carbon storage capacity. The Latin America and Caribbean Soil Information System, SISLAC, is a regional initiative promoted by the Food and Agriculture Organization’s (FAO) Latin America and the Caribbean Soil Partnership to contribute to sustainable management of soil. SISLAC includes data from 49 084 soil proﬁles distributed unevenly across the continent, making it the region’s largest soil database. In addition, there are other soil databases in the region with about 40 000 soil proﬁles that can be integrated into SISLAC and improve it. However, some problems hinder its usages, such as the quality of the data and their high dimensionality. The objective of this research is evaluate the quality of the SISLAC data and the other available soil databases to generate a new improved version that meets the minimum quality requirements to be used for different purposes or practical applications. The results show that 15 % of the existing soil proﬁles had an inaccurate description of the diagnostic horizons and 17 % of the additional proﬁles already existed in SISLAC; therefore, a total of 32 % of proﬁles were excluded for these two reasons. Further correction of an additional 4.5 % of existing inconsistencies improved overall data quality. The improved database consists of 66 746 proﬁles and is available for public use at https://doi.org/10.5281/zenodo.7876731 (Díaz-Guadarrama and Guevara, 2023). This revised version of SISLAC data offers the opportunity to generate information that helps decision-making on issues in which soils are a decisive factor. It can also be used to plan future soil surveys in areas with low density or where updated information is required


Introduction
Soil is a three-dimensional natural body consisting of strata called horizons when there are chemical, biological, and even physical relations (i.e., transference of components or products of their alteration among them) or simply layers when they are a consequence of successive deposition of different sediments.Both horizons and layers are a mixture of degraded mineral materials, organic material, air, and water (Bockheim et al., 2005).Soil is a product of the soil itself (such as information on the soil at a point at a site), climate, organisms, topography, the parent material, time, and spatial position, together also known as the SCORPAN factors of soil formation (Mcbratney et al., 2003).Soil provides various ecologic or productive contributions besides its obvious importance as a critical factor in food production, for example, in urban ecosystem services (such as water buffering capacity of open areas), human health (breakdown of toxic contaminants), or climate regulation through carbon storage (Otte et al., 2012).Its sustainable management is of the utmost importance in the main environmental challenges such as food security, climate change, and the loss of biodiversity (Dewitte et al., 2013).Soil data are an essential starting point to reach an adequate level of knowledge about soil status, raise awareness about its importance, and preserve this valuable resource (Bouma et al., 2012).Digital soil data (such as soil profiles) are in great demand as inputs to, for example, estimate the potential of agricultural land (Amirinejad et al., 2011;Bini et al., 2013;Owusu et al., 2020); in addition, their availability is key to assessing soil functions such as water and climate regulation, energy supply, and biodiversity (Greiner et al., 2017;Varón-Ramírez et al., 2022).Greater diffusion of soil information has substantial benefits in disciplines such as agricultural sciences because it allows for better estimation of current and future crop productivity or the identification of constraints and risks of land degradation (FAO and IIASA, 2009;Hopmans et al., 2021;Paterson et al., 2015).The Food and Agriculture Organization (FAO) indicates that more and better soil data can further achievements in the fight against poverty and hunger and to advance sustainable development (FAO, 2017).
Earth Syst.Sci.Data, 16, 1229Data, 16, -1246Data, 16, , 2024 https://doi.org/10.5194/essd-16-1229-2024S. Diaz-Guadarrama et al.: Improving the Latin America and Caribbean Soil Information System database 1231 Technological advances and increased computing capabilities have led to the development of soil databases on both a regional and global scale (Hendriks et al., 2019;Keskin et al., 2019;Rossiter, 2018).Global databases exist such as the World Soil Information Service, WoSIS (Batjes et al., 2017(Batjes et al., , 2020)), or the World Inventory of Soil Emission Potentials, WISE (Batjes, 2016); regional databases such as the Africa Soil Profiles database (Leenaars, 2013); and national ones such as SISINTA in Argentina (Angelini et al., 2018), the Harmonized Soil Database of Ecuador 2021 (Armas et al., 2022), or IRAKA in Colombia (Araujo-Carrillo et al., 2021).These datasets are an example of efforts at different levels to have soil profile data that help to support decision-making on problems involving this resource's management.Organizations such as the FAO's Global Soil Partnership (GSP) and the Latin America and the Caribbean Soil Partnership (LACS) emphasize the need to preserve such data as, in some parts of the world, soil survey data are the only source of information available (Beaudette and O'Geen, 2009;Hengl and Macmillan, 2019).
The mentioned databases allow scientists to generate information on soil properties such as soil organic carbon (SOC).SOC is one of the most important chemical properties related to soil fertility and climate regulation and the key to multiple functions in ecosystem services (Owusu et al., 2020).Global projects such as the FAO's Global Soil Organic Carbon Map (FAO and ITPS, 2018); national projects in Brazil (Gomes et al., 2019), Ghana (Owusu et al., 2020), and Cameroon (Silatsa et al., 2020); or regional projects in Andalusia, Spain (Armas et al., 2017), or in páramo ecosystem soils in Colombia (Gutierrez et al., 2020) have been some of the works that have estimated SOC (in its vertical or horizontal dimensions) from soil databases.
The Latin America and Caribbean Soil Information System, SISLAC, is an initiative coordinated and financed by the FAO's Global Soil Partnership (GSP) to contribute to the sustainable management of this resource in the region (SIS-LAC, 2013).SISLAC (Fig. 1a) has data on almost 50 000 soil profiles and 140 000 horizons and layers, making it the most extensive database in the region.The data include a description of the site for each profile, its spatial location, its comprising layers, its physical and chemical properties, its data provider, and its metadata.In addition to SISLAC, there are other soil databases available in the region that should be analyzed and integrated with it in order to improve it.
When analyzing available data, it is evident that some of them present inconsistencies due to the high heterogeneity of sources that provide such data.These inconsistencies can be due to, for example, old descriptions using obsolete description systems or errors in transferring information from the field to the office.So, if they are not corrected, the analysis results will have a high degree of uncertainty and inaccuracy, primarily since the performance of a model depends on the quality of the training data (Garg et al., 2020).Data quality is a multidimensional concept involving management, analysis, quality control, storage, and presentation (Chapman, 2005).It is closely related to their potential use and ability to meet user needs (English, 1999), which Krol (2008) calls "use aptitude".
Therefore, this research aims to evaluate the quality of SISLAC data and existing soil databases in terms of logical consistency to generate a new version of the SISLAC database that meets the minimum requirements of completeness in the description of profile horizons.

Data and methods
The flow diagram (Fig. 2) shows the work carried out, consisting of four phases: the first comprises a revision of the special correspondence, the second an identification of spatially duplicated profiles, the third a validation of errors in the description of horizons, and the fourth a correction of minor inconsistencies.

Study area
The study area (Fig. 1a) is composed of the Latin American and Caribbean countries listed in Table 1, where since 2016 we have had a soil database representative of such a diverse region.In the same figure, the number of profiles per region can be seen aggregated in orange circles.

Data
The SISLAC database, which can be downloaded from the official site (http://54.229.242.119/sislac/es, last access: 6 March 2024), consists of 49 084 profiles (with a total of 139 746 horizons).The number of these by country is detailed in Table 1.Validations are applied to 100 % of the data.
Profile attributes are detailed in Table 2, where the name of the attribute is listed in the first column, description in the second, and data type in the third.The location is given in geographic coordinates in WGS 84 datum.As for horizons and layers, their attributes are listed in Table 3 in the same way as they appear in the profiles.
The additional available databases are listed in Table 4, detailing the country, link to the data, number of profiles, license of use, and spatial reference system.The data of the new version are in geographic coordinates, EPSG:4326, whereas those expressed in a different system are reprojected.As with SISLAC data, 100 % of the data is analyzed.The total number of profiles analyzed is 96 783.These databases contain more or fewer attributes than those of the SISLAC structure; in this case, only those within the SISLAC structure will be processed.
The above databases have different structures and attributes.

Quality assessment and improvement of SISLAC data
The evaluation of the quality and the improvement of the data were carried out in three stages, the first two for the site data and the third for the different layers.The first stage consisted of checking that the profiles were in the correct location (spatial correspondence).It was carried out by spatial intersection between the profiles (points) and the cartography of the countries (polygons).Based on the country_code attribute of the profiles, this correspondence was verified, and those that coincided with their respective country were considered valid (Fig. 3a).Those that did not coincide were verified one by one; those that were within the limits of their country, considering the cartographic scale of the reference information, the precision of the equipment with which the coordinate was taken, or the reference systems under which original data were taken and were considered valid (Fig. 3b).Still, others had the coordinates inverted (Fig. 3c) and the latitude and longitude values exchanged, and their correspondence was verified again.Finally, the profiles outside their zone that could not be corrected for having the wrong location were excluded (Fig. 3d).
The second stage consisted of verifying if there were overlapping profiles in addition to verifying if the values in their attributes were different.For this, the number of times the same pair of coordinates is to be repeated was validated en masse.Unlike the previous validation, these cannot be arbitrarily excluded since the correct profile cannot be determined.Those with duplicity were then marked so the user of the data can use the ones they consider appropriate.A new attribute in the profiles (perfil_duplicado of binary type) indicates if the profile has duplicity (TRUE) or is unique (FALSE).With respect to the additional databases, the existence of these profiles in SISLAC is also verified.If this occurs, the profiles with the highest number of valid attributes are be validated in order to be kept in the new database.
The third stage consisted of validating the description of the horizons or layers of each profile, verifying that u where u is the upper limit and v the lower limit.The upper limit of the layer must be lower than its lower limit, and the lower limit must be lower than or equal to the upper limit of the next layer.Gaps may exist, but there is never overlap between layers.Gaps can occur for reasons such as the data not being taken at the site, the loss of data in the office, or an error or omission in transcription.Errors, those in which the structure could not be corrected, https://doi.org/10.were first validated so these profiles could be excluded.Table 6 lists the three applied rules, their description, and an example of each.
After excluding the profiles with errors, the existence of inconsistencies was validated.Unlike errors, these can be corrected by guidelines that do not alter the structure of the profile.Next, Table 7 lists the rules applied to their description and the guideline for their correction.For a better understanding of the content of  The PCA included profile characteristics (soil variables), profile depth, number of profile horizons, and profile classification according to the World Reference Base for Soil Resources (IUSS Working Group WRB, 2007).The soil variables used were effective cation exchange capacity (ECEC), pH, organic carbon (OC), and clay and sand content.These variables were selected because they are those with the high-est number of records in the database.To represent the soil profile at each site using the values registered by the horizon, the mean, minimum (min), and maximum (max) of each variable were calculated.The profile depth was identified as the maximum value of each site's "bottom" variable.Finally, the profile classification was obtained from the most proba-  .Layer error validation.In the example, the layers with errors are highlighted in bold; for the first and third case, the last layers of the profiles are the ones with errors, while in the second case both layers have errors because the limits have no data.

Validation Description Example
Duplicated layers Layer limits are duplicated, and the values of the attributes are different.
Empty limits Upper and lower limits do not contain data.

Layers overlap
Layers overlap in a profile.
Table 7. Description of the validation of inconsistencies and the guidelines for their correction.

Validation Description Correction guideline
Organic layer When the first layer is described in the opposite direction, and the normal description begins from the second.This layer is commonly known as organic.
Invert the values of the first layer and rescale subsequent limits based on the thickness of the organic layer.

Inverted layer
The value of the limits of a layer is inverted, it is verified considering also the previous and later layers.
Invert the values of the layer.
Continuous final layer The value of the lower limit of the last layer is empty.
Assign the value of the upper limit of the last layer plus 10.Defined by expert judgment to guarantee a minimum thickness in these layers.

Duplicate layer
Horizon that presents duplicate layers in all its attributes.
Delete the duplicate layers.
Upper limit is null The upper limit of a layer is null; in addition, the lower limits of that layer and the previous one are not null.
Assign the lower-limit value of the previous layer.
Lower limit is null The lower limit of a layer is null; in addition, the upper limits of that layer and the next are not null.The last layer is not validated.
Assign the value of the upper limit of the next layer.
ble soil group layer from SoilGrids at a spatial resolution of 250 m.Finally, 18 variables (17 quantitative and 1 qualitative) were included in the PCA.Those soil groups with fewer than 100 profiles were removed from the dataset, and finally, a total of 27 960 soil profiles (those with complete cases) distributed in the LAC region were analyzed.The PCA was performed with the FactoMineR package in R (Lê et al., 2008).

Quality assessment and improvement of SISLAC data
With the first validation, 2726 profiles were found that did not match their country.Table 9 lists these profiles at the country level.As can be seen, Bolivia has the largest number of these with 2472 profiles (90 % of the cases).After the review, it was identified that 2471 of those cases (from Bolivia) had the coordinates inverted, so after changing the val- With the second part of the validations, 1989 duplicate profiles were identified in SISLAC.Table 10 lists the country and the number of these profiles.The largest number is concentrated in Brazil, 1680, which is 84.5 % of the total and 21 % of the total profiles provided by that country (of 7842).As commented in the previous section, the profiles with duplicity were marked in the table; the profiles with duplicity in the perfil_duplicado field contain the value TRUE.In addition, profiles that already existed in SISLAC were excluded from the available databases.In Argentina, the number of these profiles was 3374 out of 6180, in Ecuador 4633 out of 13 542, and in Mexico 7274 out of 13 965.
Regarding the revision of the horizons from SISLAC, 7380 errors were found (in 7357 profiles).Table 11 details the number of these by country and type.Most were present in profiles from Mexico, Paraguay, and Brazil.Profiles with empty limits were the main error with 6831 cases.Those 7357 profiles were excluded for being inconsistent.On the other hand, in the additional data, 61 profiles from Argentina, 13 from Chile, and 67 from Ecuador were found to have overlapping horizons, and 6493 profiles from Mexico had empty limits, so they were also excluded.An additional point was presented with the data from Mexico; the SISLAC data (12 223 profiles) were the same as those of series I and II (13 965).The first ones had fewer attributes and an incorrect spatial location, and for that reason all the data from Mexico were replaced by the valid profiles from series I and II.
Inconsistencies are described in Table 12.Most were found in Paraguay, Argentina, and Colombia.The main causes were the null lower limit, continuous final horizon, and duplicate horizon.All of these were corrected according to the established guidelines.Although 5474 inconsistencies were found, these correspond to 2215 profiles, so there were profiles with more than one inconsistency; for example, although in Paraguay there are 4066 inconsistencies, these are present in 931 profiles, the total number of profiles from that country.Finally, the following tables summarize the results obtained.First, Table 13 lists the countries with a change in the number of profiles.As can be seen, there was an increase in the first five countries, since the available databases correspond to these countries, while for the following countries profiles were excluded due to errors in their descriptions.In addition, Table 14 lists the sources of the data that contribute to this new version of SISLAC; as can be seen, there are almost 10 000 profiles obtained from WoSIS, and the rest are contributed by institutions in the countries of the region.To conclude, Table 15 shows the initial and final percentage of records with valid values for the soil property attributes, showing that SOC, pH, clay, silt, and sand are the attributes with the highest percentage.From SISLAC, after the processes carried out, of the 49 084 initial profiles 15 % were excluded and another 4.5 % were corrected so that they met the minimum integrity requirements; in addition, 17 % of the profiles in the other databases already existed in SISLAC.Of the 96 783 total profiles analyzed, 32 % were excluded due to erroneous descriptions or because they already existed in the SISLAC data.The revised version consists of 66 746 profiles made up of 192 568 horizons and layers.In the PCA, five dimensions have eigenvalues greater than 1 (Table 16).These first five dimensions explain 86.49 % of the total variance in the dataset.The first two dimensions express 52.52 % of the total variance, which means that 52.52 % of the individuals' (or variables') total cloud variability is explained by the plane formed by the first dimension (Dim 1) and the second dimension (Dim 2).soil texture (clay and sand content) and the cation exchange capacity variables (Fig. 5a).On the other hand, the second dimension (23.79 % of variance explained) captures the variability in pH, organic carbon, and cation exchange capacity (Fig. 5a).The third dimension (Dim 3; 16.28 % of variance explained) comprises profile depth, number of profile horizons, and cation exchange capacity (Fig. 5b).The organic carbon content and pH variables represent the fourth dimension (Dim 4; 9.72 % of variance explained) (Fig. 5b).

Characterization of
The qualitative factor map shows the distance between soil groups in the first plane (Fig. 6).In the first plane, the soil groups most different from others are Arenosols, Calcisols, Solonetz, Acrisols, and Lixisols (Fig. 6a).The Arenosols are those most correlated with Dim 1, representing the soil's texture and cation exchange capacity.Meanwhile, Calcisols, Lixisols, Solonetz, and Acrisols are most correlated with Dim 2, which represents the soil's pH and or-ganic carbon.On the other hand, in the second plane, the soil groups most different from others are Solonetz, Calcisols, Phaeozems, Lixisols, Ferralsols, and Gleysols (Fig. 6b).The Solonetz, Phaeozems, Ferralsols, and Lixisols correlate most with Dim 3, which represents profile attributes such as profile depth and number of horizons; meanwhile, Solonetz, Calcisols, and Arenosols are most correlated with Dim 4, which represents principally organic carbon content and the pH of the soil.

Discussion
This work made it possible to determine that the main problems in the SISLAC profiles occur systematically in some countries.In addition, we were able to incorporate new data to improve this database and make a greater number of soil profiles of the region available to the soil community.

Quality assessment and improvement of SISLAC data
As shown in Table 1, the most frequent error in the profiles was due to empty limits, which occurred mainly in Mexico and Paraguay with 67 % and 25 % of the total errors, respectively.In Mexico, these errors correspond to 40 % of the profiles provided, while in Paraguay they correspond to 65 %.On the other hand, most of the inconsistencies (Table 12) are found in Argentina, Paraguay, and Colombia with 44 %, 42 %, and 12 % of the total, respectively.Although all these inconsistencies were corrected, it is observed that, for example, in Paraguay of the total profiles provided (2830), only nine contain SOC values, whereas the rest all have the attribute empty.The foregoing represents a limitation if one wanted to carry out any type of analysis with these data.The validations were defined by expert judgment; they coincide with those described in the works of Batjes (1995) and Leenaars (2013) and were applied to all the elements.For the horizons, it was guaranteed that they were correctly described, since as these authors indicate, if they are not adequately described, in-depth analyses cannot be carried out since the analysis tools may fail or a high degree of uncertainty may be generated.
In the profiles of the available databases, the data had a correct description of the profiles, so most of them are incorporated into SISLAC.In these, the main attributes available were SOC, pH, clay, silt, and sand.With these data, an increase in the database of more than 50 % was achieved, since the revised SISLAC database had just over 42 000 records and the number of soil profiles in the new version exceeds 66 000 from the entire Latin American region.

Latin American and Caribbean (LAC) soils
A principal component analysis (PCA) considering the profile attributes and soil variables with the highest number of records (SOC, pH, ECEC, and clay and sand content; number of horizons; and profile depth) was carried out to characterize the new SISLAC database.A way to validate the database information was to relate those profile attributes and soil variables with a soil classification.In the database, just 37 % of the soil profiles have a taxonomic classification, 26 % based on the USDA (profiles in Argentina, Colombia, and Ecuador, principally)    taxonomic classification system.Therefore, it was necessary to identify the most probable soil group from a unified global source (SoilGrids 2.0) for the 27 960 soil profiles with complete records for the soil variables included in the PCA.Even if it is not a field-based taxonomic classification of each soil profile, the SoilGrids product represents the global tendency of the world soils (Poggio et al., 2021).Some soil groups are separated from others and strongly correlated with Dim 1 or Dim 2 according to soil variables.As expected, soil groups characterized by the variables included in the PCA are those most differentiated in the analysis.Soil groups characterized by textural attributes such as Arenosols (high content of sand) are strongly correlated with Dim 1, which represents the sand and clay content of the mineral soil.Meanwhile, soil groups characterized by accumulation of sales such as Calcisols (high content of calcium) or Solonetz (high content of exchangeable sodium) are correlated with Dim 2 due to the effect of sales in the pH of the soil; similarly, those soil groups with an accumulation of organic matter such as Phaeozems (dark superficial layers) are also mostly correlated with Dim 2, which represents organic matter characteristics too.
On the other hand, those most represented soil groups in the new SISLAC database not characterized by the variables included in the PCA are not differentiated from other soil groups.Cambisols (which are identified by edafogenetic alteration evidence but not stronger alteration or accumulation processes), Andosols (which are identified by their relationship between Fe and Al, bulk density, and phosphate retention), and Ferralsols (which are identified by Fe or Mn accumulation in the soil profile) are the soil groups most represented in the database (57 % of the total soil profiles).These soil groups appear at the central portion of the factor maps in the PCA and do not show a specific correlation with dimensions.
The PCA showed the relation between soil variables in the new SISLAC database and soil groups (from a different source), making this new database's value and potential use evident.However, it is essential to highlight that this PCA was made with 42 % (27 960) of the total soil profiles in the new SISLAC database (66 746).This analysis does not rep- resent regions with few complete data such as Central America (Guatemala, Honduras, Nicaragua, Cuba, and Dominican Republic among others) and South America (Chile, Peru, Bolivia, Paraguay, and south of Venezuela and Brazil).

Limitations and future directions
A factor not considered in this work was the validation of the attributes of the horizon properties in a simple or combined way to identify outliers, for example, using Tukey's rule (Pham et al., 2019) or checking whether they are out of range (pH values less than 0 or greater than 14).This omission was due to the fact that a large part of the horizons did not have assigned values.As shown in Table 15, only four attributes (SOC, pH, clay, and sand) exceed 65 % of records with values, while another two (silt and effective cation exchange capacity) have just over 50 % of data with values.The other attributes do not exceed 20 %, and there are even three properties with less than 6 % of data with values, which are inorganic carbon, coarse fragments, and water retention.
A possible reason why the profiles have been provided incomplete may be the one mentioned by Arrouays et al. (2017) or Rossiter (2004) about how privacy or data ownership policies, in addition to institutional, legal, and cultural factors, prevent data from being fully shared.Breaking down those barriers would allow that data to be used by a larger number of global users.
Given the importance of these databases, it is important to make new efforts to collect data from other sources, such as research centers or universities, in order to strengthen this database or other databases.This revised version of SISLAC data offers the potential to generate information that helps decision-making on issues in which soils are decisive.It can also be used to plan future soil surveys in areas with low density or where updated information is required.Another possible use of these data may be to improve existing information (in scale and depth), such as the Global Soil Organic Carbon Map (FAO and ITPS, 2018), or to generate new information such as that presented by Gutierrez (2020) using SISLAC data.
In summary, from the total dataset, 38 % of profiles were excluded and another 4.5 % were corrected, and from the available databases nearly 24 000 soil profiles were incorporated.This work tried to exclude as few profiles as possible given their importance in areas with low spatial density.Furthermore, as mentioned by Hengl (2019), these data are the only thing available at this time in many places, so their availability is important.Knowing the level of integrity of the data, what the main problems are, and where they occur can help the countries involved know where to put more efforts to have more reliable data.In that sense, this work may contribute to supporting soil conservation efforts, increasing food and water security, maintaining healthy ecosystems, and reducing climate change's impact.The source code used for data processing is also available in the same repository.

Conclusions
This work was successful in improving the SISLAC database, thus generating a revised version of the database in which all the soil profiles have high quality and completeness to be efficiently used in multiple applications (e.g., digital soil carbon mapping and reporting).In the revised SIS-LAC database, 15 % of soil profiles were excluded (e.g., horizon information duplicated or overlapped) and 4.5 % of the soil profiles were adjusted to the same data structure.With the available soil databases, it was possible to increase the database by more than 50 %; initially the valid SISLAC profiles were around 41 000, so the additional profiles represent more than 25 000 records.SISLAC is a product of the cooperation of national institutions of the countries of the region, investing efforts in the collection of additional data; for example, those produced in universities or research centers could lead to an increase in the volume of the revised version of SISLAC (as new and better data become available), and these in turn may allow for the new generation of spatial information on soil properties to improve what is currently available.Review statement.This paper was edited by Dalei Hao and reviewed by Jinshi Jian and José Lucas Safanelli.

Figure 1 .
Figure 1.(a) SISLAC interface; each number in the orange circles indicates the number of profiles in that area (from the SISLAC web page).(b) Vertical variability in the percentage of organic carbon in Latin America.

Figure 2 .
Figure 2. Flowchart of this research.The blue box shows the validation processes applied to 100 % of the data.

Figure 3 .
Figure 3. Example of criteria found in spatial validation (the source for panels a, b, and d is ESRI, 2022; the source for panel c is the SISLAC web page).

Table 8 .
Illustration of inconsistencies and their correction guidelines.In the second column in bold the layers with inconsistency are shown; in the third column, also in bold, it is shown how to correct them using the established guidelines.In the first case all profile limits are modified and for the rest only those of the layer with inconsistency.validation, their correct location was verified, and they were considered valid.A total of 36 profiles (1.3 % of those reviewed) were excluded for having an erroneous location, as presented in Fig.3d, of those 3 from Colombia and 33 from Paraguay.A total of 96 747 profiles (of the initial 96 783 considering SISLAC and the additional databases) passed the second validation.

Figure 4 .
Figure 4. Frequency of soil profiles by soil group according to the World Reference Base (WRB).

Figure 5 .
Figure 5. Map of variables for the first plane.The quantitative variables are pH, effective cation exchange capacity (ECEC), organic carbon (OC), clay and sand content, the number of horizons (num_horizons), and profile depth (depth) (Dim: dimension).

Figure 6 .
Figure 6.Quality factor map with soil group according to the World Reference Base (WRB).(a) The first plane and (b) the second plane of the principal component analysis (Dim: dimension).

Financial support .
This research has been supported by the UN-ESCO International Geoscience Programme (grant no.765) and the Consejo Nacional de Ciencia y Tecnología's (CONACyT) Ciencia de Frontera 2023 scholarship (grant no.CF-2023-I-1846).
Table 5 lists the SISLAC attributes found in those databases that are added to this one.As can be seen, SOC is the common attribute in all of the databases, followed by clay, silt, sand, and pH.Guadarrama et al.: Improving the Latin America and Caribbean Soil Information System database

Table 1 .
Initial profiles and their layers by country.The countries are ordered by the number of profiles, and those with fewer than 100 profiles were grouped together (n/a: not applicable).

Table 2 .
Profile attributes and attributes related to the site description.

Table 3 .
Layer attributes, where the measured attributes are numerical attributes (excluding the top and bottom, which are the limits of each layer); in the last column, for each attribute measured, the percentage of records with valid data is indicated (n/a: not applicable).

Table 4 .
List of databases available for incorporation into the new version of SISLAC.

Table 7 ,
Table 8 below illustrates the described inconsistency (middle column) and how it was corrected (third column).
function evaluation, soil process recognition, and soil impact identification in the ecosystems.Finally, in this research, we present a brief characterization of Latin American and Caribbean (LAC) soils through a principal component analysis (PCA).Earth Syst.Sci.Data, 16, 1229-1246, 2024 https://doi.org/10.5194/essd-16-1229-2024S. Diaz-Guadarrama et al.: Improving the Latin America and Caribbean Soil Information System database 1235

Table 5 .
SISLAC physical and chemical property attributes available in the databases.The attribute in common is soil organic carbon (SOC).The databases of Argentina, Ecuador, and Mexico have the most attributes in common (Y: yes and N: no).

Table 9 .
Spatial validation results sorted by country with the highest number of inconsistencies (second column); the third column indicates how many profiles were excluded, and the fourth column indicates how many were considered valid after being reviewed one by one.

Table 10 .
Profiles from SISLAC with spatial duplication by country.

Table 11 .
Layer error validation; only countries with errors are listed.The profiles with errors may be fewer than the errors per country because one profile may have more than one type of error.

Table 12 .
Layer inconsistency validation; in these, the bottom limit with null validation was the only one that did not present records with this inconsistency.

Table 13 .
Details of the SISLAC data validation processes; total numbers of layers are in parentheses, and the errors caused the profile to be excluded, while the inconsistencies were corrected.

Table 14 .
Number of profiles contributed by each data source to the new version of SISLAC (NA: not available).

Table 15 .
Percentage of valid records for soil properties, showing that SOC is the attribute with the highest number of valid records followed by pH, clay, silt, and sand.

Table 16 .
Decomposition of the total inertia obtained from the principal component analysis based on profile characteristics of 28 460 sites of the new version of the SISLAC database (Dim: dimension).