Harmonized in situ JECAM datasets for agricultural land use mapping and monitoring in tropical countries

The availability of crop type reference datasets for satellite image classification is very limited for complex agricultural systems as observed in developing and emerging countries. Indeed, agricultural land use is very dynamic, agricultural census are often poorly georeferenced, and crop types are difficult to photo-interpret directly from satellite 30 imagery. In this paper, we present nine datasets collected in a standardized manner between 2013 and 2020 in seven tropical and subtropical countries within the framework of the international JECAM (Joint Experiment for Crop Assessment and Monitoring) initiative. These quality-controlled datasets are distinguished by in situ data collected at field scale by local experts, with precise geographic coordinates, and following a common protocol. Altogether, the datasets completed 27 074 polygons (20 257 crop and 6 817 non-crop) documented by detailed keywords. These datasets can be used to produce and 35 validate agricultural land use maps in the tropics, but also, to assess the performances and the robustness of classification methods of cropland and crop types/practices in a large range of tropical farming systems. The dataset is available at https://doi.org/10.18167/DVN1/P7OLAP. https://doi.org/10.5194/essd-2021-125 O pe n A cc es s Earth System Science Data D icu ssio n s Preprint. Discussion started: 25 May 2021 c © Author(s) 2021. CC BY 4.0 License.


Introduction
Land use and land cover (LULC), and their changes, are key information to study and monitor carbon and water cycles, threats 40 to biodiversity, but also to set up land use planning and public policies. In particular, accurate mapping of cropland and associated cropping practices is of primary importance for food security, agricultural and environmental monitoring as well as land management. However, cropland and crop type mapping using Earth observation data is still challenging as it requires large sets of training and validation data, and as the land use (field limits and content) generally changes annually, even seasonally. If large data sets on cropping practices are available in the Global North, it is not the case in most of the developing 45 and emerging countries. In these countries, cropland and crop types can be particularly difficult to map (Waldner et al., 2015) because the fields are often small to medium size (Fritz et al., 2015), the crops are easily confused with natural vegetation and fallows, and cropping systems are typically highly variable in time and space. Each farming system has its own specificities in terms of crop type and composition, field size, cropping calendar, irrigated/rainfed mode and other practices . It is thus necessary to adapt the classification approaches (satellite data and algorithms as well as training and validation 50 in situ data) to the large variability of the farming systems in the world (Dixon et al., 2001), and thus to have access to appropriate training data.
The arrival of Sentinel-1 and 2 satellite image time series, the emergence of new classification algorithms in the domain of machine learning and artificial intelligence, and an easy access to pre-processed images and image processing tools on webplatforms, have democratized image processing, and opened-up new avenues for LULC mapping over large areas. Following 55 this trend, large benchmark datasets acquired using annotation tools of satellite images all over the world have multiplied to train algorithms and validate remote sensing-derived products (Long et al., 2020). However, these datasets have a broad LULC nomenclature, and agricultural land use is often reduced to a single class due to difficulties in discriminating cropping practices from satellite images. The main data sources currently available for agricultural land use mapping in the Southern countries are listed below. 60 At a global and continental scale, initiatives that freely distribute land cover reference datasets exist (see review by Tsendbazar et al. (2015)) such as GOFC-GOLD (Global Observation for Forest and Land Cover Dynamics; http://www.gofcgold.wur.nl/sites/gofcgold_refdataportal.php) that regroups existing global datasets prior 2015, or the LULC reference dataset (150,000 samples ; 300 m-1 km resolution) (Fritz et al., 2017) and the cropland dataset (36 000 points ; 300 m resolution) (Laso Bayas et al., 2017b) both collected through crowdsourcing campaigns using the Geo-Wiki tool (photo-65 interpretation of very high spatial resolution satellite images). Lately, the LandCoverNet dataset has been released for the African continent (Alemohammad et al., 2020) with 130 million of labelled 20 m pixels of 1 980 image chips (256 x 256 pixels), spanning 66 tiles of Sentinel-2 acquired in 2018. These data are used to validate global (Hoskins et al., 2016)  compare results based on disparate sources of data, using various methods, over a variety of local or regional cropping systems.
Data are acquired following a given protocol and nomenclature (see Defourny et al. (2014)). The experiment has been operating 80 since 2013, and some in situ datasets produced at field scale have been used in different benchmarking mapping studies (Waldner et al., 2016;Inglada et al., 2015).
The aim of this data paper is to share with the community, harmonized in situ agricultural land use datasets mostly acquired within the JECAM initiative, at local scale, and focusing on emerging/developing countries on the tropics. These datasets include data collected on nine sites in seven countries of the tropical belt ( Figure 1) with various farming systems (Figure 2; 85 Table 1). The acquisition protocol has been adapted from Defourny et al. (2014) to take in account the characteristics of tropical agriculture (e.g. small field size, accessibility). Information on crop type and cropping practices was collected locally, at the field level, with a detailed nomenclature. The acquisition period is between 2013 and 2020, and the number of monitoring years per site is between 1 and 7.

Study sites 100
Except for Cambodia, the study sites belong to the JECAM network (http://www.jecam.org/), and cover several hundred squared kilometers each. The table 1 provides a synthesis of the database by site.
The JECAM Burkina Faso study site is a 60 x 60 km² area located around the commune of Koumbia, Tuy province, in the South-West of the country. The climate is tropical. The absence of significant relief and the relatively good conditions in terms of soil and climate favoured the densification of cropped surfaces, which span the majority of the area: arable lands cover more 105 than 60% of the site, the remaining surface being either unsuitable for cultivation (e.g. rocky) or protected areas for nature conservation. The landscape is characterized by an alternation of large cropland areas made up of a patchwork of diversified small cropped fields (about 1 ha) and areas covered by natural vegetation. With the exception of few lowland rice plots, all crops are rainfed, hence cultivated during the rainy season that occurs from May to October (around 1000 mm average annual rainfall). Main crops are more or less equally distributed between cash crops (mainly cotton) and staple crops, with a significant 110 predominance of cereal crops (maize, sorghum, millet, and locally rice) over oleaginous (sesame, groundnuts) and leguminous (peas/cow peas, soybeans).
The JECAM Madagascar study site is a 60 x 60 km² zone located in the Vakinankaratra region, around the Anstirabe city, in the central highlands of the country. It is characterized by mountainous terrain of terraced from 1200 to 1500 m of altitude, rice-growing valleys positioned between grassy hills and rocky outcrops. The climate is subtropical with a rainy season from 115 December to February. The average annual precipitation is 1300 mm. The growing season occurs from October to June.
Cultivated crops are diversified, although maize and rice predominate. Fruit production is also present in the area. The mean size of an agricultural field in the area is very small (about 0.05 ha), but contiguous fields of the same crop type occasionally give rise to larger single crop patches. Rice is mainly grown in irrigated areas, but has recently mingled with other rainfed crops on slopes (called tanetys). Other main crops are carrots, potatoes, sweet potatoes, soybeans or cassava. 120 The JECAM São Paulo site in Brazil is a large area of 90 x 130 km² located in the São Paulo State, close to Botucatu city. It is composed of a relatively smooth relief with slopes mostly <5%. The region is classified as subtropical humid-dry in the winter. Average temperature is 19°C and average annual precipitation is 1400 mm with a rainy season from December to March. The area is diversified and can be divided into four main agricultural sub-regions: (1) in the South-West, annual crops (maize, wheat, soybean) including summer (growth cycle from October to May) and winter crops (June to September) -some 125 of them irrigated with centre pivot systems; (2) in the Centre forest plantations for wood production; (3) in the East pastures, and (4) in the North sugarcane, which has variable planting and harvesting dates: the first sugarcane cycle occurs between September and March, and is grown for around 12-18 months. Sugarcane reaches maximal growth in April, in this region.
After the first harvest, the cycle of the ratoon sugarcane starts, with the annual cut between April to December. Natural forests, The JECAM Kenya study site is a 25 x 10 km² area located about 50 km north of Naïrobi, including Kangema and Muranga 150 towns, in the central province of Kenya. It is settled in a very hilly landscape with steep slopes and strong local relief variations in a general toposequence trend following an East-West altitude gradient from 1000 m to 2800 m. Climate is wet tropical, somewhat temperate by the altitude and regularized by two rainy seasons (from Mars to May or June, and from October to November) with 1200 to 2000 mm annual rainfall depending on the altitude. The permanent moisture and good natural drainage of a rich volcanic loam allows for intensive agriculture, mainly based on perennial crops (mostly banana, various fruits, coffee, 155 and tea) associated with dairy farming and rainfed horticultural as well as food crops (eg. French beans, cabbage, maize, cassava). These latter are cropped all year long, except in January and July which are dry months, and without a defined seasonal calendar (maize, for instance, can have three cycles per year). The mean size of an agricultural field in the area is very small (about 0.08 ha) resulting in a patchwork landscape of heterogeneous fields with a great diversity of structures.
The Cambodia study site corresponds to a 30 km radius buffer area around Wat Pi Chey Saa Kor, Kom Poung Kor village, 160 Kandal Province, where the ecology of fruit bats Pteropus lylei was recently investigated (Choden et al., 2019). The area is characterized by a tropical climate with a rainy season from May to October. The annual rainfall is between 1000 and 1500 mm. Two main rivers, the Mekong and the Bassac, cross the area. In this flat region, rice is the dominant crop, mainly grown in irrigated areas from May to October. Fruit plantations (mango, sapodilla) and natural wetlands are also present. The mean field size is small (around 1 ha). The population lives in villages along the roads composed of small houses with fruit trees 165 backyards.
The JECAM South Africa study area is a 60 x 60 km² site, located in the Mpumalanga province in the North-East of the country, close to the Mozambique border corresponding mostly to a subsistence agriculture area. The climate is subtropical with a rainy season from November to February. The annual rainfall is between 600 and 800 mm. The site is characterized by a bush-clad plain between the Drakensberg Mountains (West) and savannahs (East) with several wildlife reserves (e.g. Kruger 170 Park). The study area is characterized by smallholder's agriculture (generally less than 1 ha), with diversified crops: cereals, groundnuts, potatoes, vegetables, fruit crops. Important timber plantations are present on the west part of the site.

Data collection
The acquisition protocol is based on the JECAM guidelines (Defourny et al., 2014) with adaptations to consider some characteristics of tropical agriculture (mainly small field size and accessibility). Field surveys were conducted yearly in each study zone, either around the growing peak of the cropping season, for the sites with a main growing season linked to the rainy 185 season such as Burkina Faso, or seasonally, for the sites with multiple cropping (e.g. São Paulo site). Except for Senegal where a stratified sampling plan for field surveys was used (Ndao et al., 2021), the GPS waypoints were gathered following an opportunistic sampling approach (called the "windshield survey") along the roads or tracks according to their accessibility (that can be difficult during the rainy season, leading to less surveys in secondary roads or tracks in some study areas), while ensuring the best representativity of the existing cropping systems in place (Defourny et al., 2014;Waldner et al., 2019). Pleiades or SPOT 6/7 images ordered just before the surveys, or PlanetScope images). This equipment allowed in situ recording of attributes relative to each waypoint on data entry forms (with automatic filing of IDs or dates and scrollable lists for other attributes to avoid data entry errors). For each waypoint, a set of attributes, corresponding to the cropping practices (crop type, cropping pattern, management techniques) were recorded. An attribute referred to as "Keywords" was also created in order to associate various generic terms (land cover, crop group, crop type, cropping practice, etc.) to each polygon. This attribute has 200 two objectives: (i) facilitating keyword search for the user, (ii) allowing the user to create his own nomenclature (hierarchic or not) with different levels of detail so that the nomenclature can be dedicated to the user's needs. These terms are based on the FAO land use definitions (FAO, 2020) and JECAM hierarchic nomenclature (Defourny et al., 2014), which were adapted to take into account the diversity of the farming systems in the surveyed sites. All these attributes are described in Table 2.
In the specific case of Burkina Faso, Senegal-Niakhar and Brazil-São Paulo sites, the same fields were revisited each year, in 205 order to study crop rotations and fallow practices in the region. For the South Africa site, some points were collected by helicopter using the Producer Independent Crop Estimates System (PICES (Fourie, 2009)) method developed by the National Crop Statistics Consortium. Flights were performed at an average altitude of 500 feet and a low flying speed allowing to record GPS points and to determine land use using a GPS tablet associated with a GIS interface and a recent VHRS image. Only clearly identifiable land covers have been kept in the database. 210

Post-processing
Once the waypoints were acquired, the boundaries of each field or non-crop entity were digitized on the VHSR images in the QGIS software, and the class labels (and other attributes, see Table 2) were attached to the polygon database. Additional noncrop polygons were added by CAPI (Computer Assisted Photo Interpretation) of the VHSR images for the built-up areas, water bodies, wetlands, mineral surfaces, and natural forest classes (land covers clearly identifiable on images). 215 To avoid digitizing errors, this step was performed by the same operator as the one who did the field surveys. Despite this, if there was doubt on the delineation of a given entity (e.g. fuzzy boundaries, high heterogeneity), the given entity was removed from the database. Finally, the topology of each entity was controlled externally.

Data Records
This database, which contains 27 197 records, is a geographic layer in Shapefile format. Each record corresponds to a polygon 220 with 16 attributes ( * For each field in the Tocantins site, the operator recorded the crop type of the 2 seasons (summer / winter) by observing the crop residues on the field or by interviewing the farmers. Consequently, the acquisition date of those polygons does not always correspond to the actual land cover of the field. The user must refer to the SOS and EOS dates to identify the season corresponding to the crop type recorded. 230 Table 2. Description of the attributes recorded for each polygon of the database.

Quality Checking
Due to the nature of the dataset (in situ observation), validation is not possible. However, quality control was performed all along the data chain, from the acquisition to the post-processing, to ensure the quality of the datasets and their homogeneity 245 throughout sampled years and locations.
First, the acquisition protocol was described in a technical guide provided to the field teams so that nothing was forgotten during the campaigns. The dropdown list in the data entry form reduced input and post-processing errors. The surveys were carried out by agronomists with geoprocessing skills, accompanied by a national researcher or technician with expertise in the local farming systems. 250 Second, during the post-processing step, the orthorectification of the VHSR images used to digitize the fields was checked from one year to the next, for multi-year sites, and corrected if necessary by taking homologous points. The fields were then https://doi.org/10.5194/essd-2021-125 manually digitized on the VHSR images, and the photos taken in situ were used whenever necessary. In case of doubtful data, these have been discarded and removed from the dataset.
Finally, the fact that the same person performed the whole acquisition and processing chain -from waypoint collection to 255 polygon labelling -minimizes errors and contributes to the overall quality of the datasets.

Representativeness of data sets
Because of their small size, these sites cannot be considered representative of the entire country in which they are located; however, they are claimed to be representative of an area that encompasses more than the JECAM site. In order to specify the extent of this representative area, we referred to existing zoning maps. We used the two reference maps available for Southern 260 countries: the FEWS-NET livelihood zones map (https://fews.net/fews-data/335) and the FAO farming systems map (http://www.fao.org/farmingsystems/mapstheme_01_en.htm). The livelihood zones are produced at national scale and are available for 38 developing countries. The zones are defined as geographical areas within which people share broadly the same patterns of livelihood (i.e., broadly the same production system, the same income earning opportunities and patterns of trade) (see Grillo and Holt (2009) for more details). Farming system maps are available for the Global South (covering 130 265 countries). The classes are defined as a population of individual farm systems that have broadly similar resource bases, enterprise patterns, household livelihoods and constraints (Dixon et al., 2001;Auricht et al., 2014).
Although these two maps were not produced for the same purposes, they are derived using similar criteria (agro-climatology, elevation, landscape, dominant pattern of farm activities, etc.) that are closely related to the agricultural land use as recorded in the database. In Table 3, are given the type and extent of the zones where are located our JECAM study sites, for both maps. 270 Unfortunately, the livelihood maps are available only for four of the JECAM countries presented here.

Livelihood zone (FEWS-NET)
Farming systems (FAO) Livelihood type (year of production)) km²  With a mean size of the zone around 20 000 km² (Table 3), we are pretty confident that our JECAM sites are representative of 275 the livelihood zone they belong to. The datasets presented here can thus be used to train or validate land cover maps of the corresponding zones. The farming system zones are much larger (between 300 000 km² and 2 Mkm²) and include a larger diversity of environmental and farming conditions; in these conditions it is not possible to argue that the JECAM sites are representative of such large areas; thus, the JECAM datasets need to be completed with other datasets belonging to the same https://doi.org/10.5194/essd-2021-125 farming system class before to be used for training land cover classification algorithms. However, they can still be used for 280 algorithm / product validation or comparison.
It is also important to mention that other agro-ecological zoning (AEZ) can be used (even if only few are directly related to the agricultural land use) or that each user can produce their own AEZ and use it to delineate the area in which the JECAM dataset can be used to train classification algorithms.

Validation through study cases 285
In addition, the in situ JECAM dataset and its derived land use/land cover products have been used in a wide spectrum of studies covering several aspects linked to agricultural monitoring attesting the good quality of the dataset and good spatial representativeness of tropical countries farming systems.
First specific site studies have been conducted to test several methodological aspects. For instance, land use maps combining a supervised object-based approach with multi sources high spatial resolution time series were developed in Madagascar 290 (Lebourgeois et al., 2017) andin Brazil (de Oliveira Santos et al., 2019). The brazilian site (São Paulo) was also included in a broader study presenting an inter-comparison of several cropland mapping methodologies over 5 contrasting JECAM sites (Brasil, Ukraine, Russia, Argentina and China) in terms of growing conditions, characteristics and cropping practices (Waldner et al., 2016). Very recently, following the rapid dissemination of up-to-date artificial intelligence approaches, Gbodjo et al. (2020) and Ienco et al. (2020) proposed to test the potential of deep learning architectures for land cover mapping respectively 295 in Senegal (Niakhar) and Burkina Faso.
Second, in situ data coming from the Burkina Faso site and the Madagascar site have been included as test sites in the Sen2-Agri system. The Sen2-Agri system is an operational processing system that provides several agricultural products from Sentinel-2 and Landsat-8 time series along the cropping season. The two sites have been included in preliminary studies preparing the Sen2-Agri system processing chain (Bontemps et al., 2015;Valero et al., 2016), while the Madagascar site has 300 been considered later in the demonstration phase of the system at local scale (http://www.esa-sen2agri.org/systemdemonstration/). Lastly, the different in situ data and the derived products have been valorized in studies covering different aspects of agricultural monitoring. For instance a semi-automated clustering approach has been proposed for the cropping system mapping over the Tocantin's region in Brazil (Bellón et al., 2018). Using the land use map derived from the Burkina Faso site 305 and the Senegal site (Niakhar), remote sensing-based statistical crop yield models have been proposed for maize (Leroux et al., 2019) and pearl millet (Leroux et al., 2020b). Based on the land use map derived from the Niakhar and Nioro sites in Senegal, Ndao et al. (2021) proposed an approach to characterize the agricultural landscape heterogeneity in agroforestry https://doi.org/10.5194/essd-2021-125 parklands, which was then used to analyse how far the agricultural landscape diversity contribute to the household food security (Leroux et al., 2020a). 310

Data availability
The dataset is ready for use on any GIS software, and can be filtered by region, year or key words. It is distributed with a CC-

Conclusion
Accurate mapping of cropland and associated cropping practices in smallholder farming systems of tropical countries is crucial for the improvement of agricultural monitoring systems at local and or global scales. The essential prerequisite to reach such objectives is to have available in situ datasets representative of the diverse agricultural practices in tropical countries. This paper presented an harmonized in situ crop type dataset acquired between 2013 and 2020 over nine sites spread over seven 320 tropical countries. This dataset collected in the framework of the JEAM initiative is unique and very valuable because it is produced at the field scale, based on in situ observation, quality-controlled, and standardized observation for various tropical cropping systems, including small-holder farming systems. These characteristics allow this dataset to be used as a benchmark to assess the performances and the robustness of newly developed classification algorithms for cropland and crop type/practices mapping in diverse and documented agricultural conditions. In addition, this dataset can also be used to validate the cropland 325 class of existing global or national LULC products, in particular those recently produced with Sentinel/Landsat image time series, and some crop type and practices (fallow, double cropping) classes. In the end it should be part of publicly online datasets and algorithm sharing platforms as promoted by the JECAM network and Long et al. (2020) who encourage the sharing of datasets for remote sensing applications, and more broadly to the scientific community, land use planners and agricultural monitoring agencies. This dataset will be further enriched with new ground surveys that are already planned on 330 many on the presented sites.