LUCAS Copernicus 2018: Earth Observation relevant in-situ data on land cover throughout the European Union

. The Land Use/Cover Area frame Survey (LUCAS) is a regular in-situ land cover and land use ground survey exercise that extends over the whole of the European Union. LUCAS was carried out in 2006, 2009, 2012, 2015, and 2018. A new LUCAS module speciﬁcally tailored to Earth Observation was introduced in 2018: the LUCAS Copernicus module, aiming at surveying land cover extent up to 51 meters in four cardinal directions around a point of observation. This paper ﬁrst 5 summarizes the LUCAS Copernicus protocol to collect homogeneous land cover on a surface area of up to a 0.52 ha. Secondly, it proposes a methodology to create a ready-to-use dataset for Earth Observation land cover and land use applications with high resolution satellite imagery. As a result, a total of 63 364 LUCAS points distributed over 26 level-2 land cover classes were surveyed on the ground. Using homogeneous extent information in the four cardinal direction, a polygon was delineated for each of such point. Through geo-spatial analysis and by semantically linking the LUCAS core and Copernicus land cover 10 observations, 58 428 polygons are provided with a level-3 land cover (66 speciﬁc classes including crop type) and land use (38 classes) information as inherited from the LUCAS core observation. The open-access dataset supplied with this manuscript (https://doi.org/10.6084/m9.ﬁgshare.12382667.v3 (d’Andrimont, 2020)) provides a unique opportunity to train and validate decametric sensor-based products such as those obtained from the Copernicus Sentinel-1 and -2 satellites. A follow-up of the LUCAS Copernicus module is already planned for 2022. In 2022, a simpliﬁed version of the LUCAS Copernicus module will 15 be carried out on 150 000 LUCAS points for which in-situ surveying is planned. This guarantees a continuity in the effort to ﬁnd synergies between statistical in-situ surveying and the need to collect in-situ data relevant for Earth Observation in the European Union.


Introduction
The Land Use/Cover Area frame Survey (LUCAS) is a regular :: an :::::: evenly :::::: spaced in-situ land cover and land use data collection exercise that extends over the whole of the European Union (EU) (Gallego and Delincé, 2010;Eurostat, 2018c). LUCAS has 1 been carried out in 2006, 2009, 2012, 2015, and 2018. During these five campaigns, a total of 1 351 293 points :: at ::::::: 651 780 ::::: unique :::::::: locations : were surveyed and 5.4 million photos were collected. On each of these surveyed points, observations were 25 recorded on up to 109 variables. The combination of the information collected in the five LUCAS surveys has resulted in the most comprehensive in-situ database on land cover and land use in the EU (d' Andrimont et al., 2020).
LUCAS in-situ data collection was designed for EU-wide standardized reporting of land cover and land use area statistics and not for training and validation of remote sensing data algorithms. The LUCAS activity is complementary to the CORINE Land Cover (CLC) inventory that collects land cover data by interpreting satellite images and orthophotos. In addition, in 30 2018 the Copernicus High Resolution Layers (HRL) have been produced to provide information about different land cover characteristics. Five HRLs describe some of the main land cover characteristics: impervious (sealed) surfaces (e.g. roads and built up areas), forest areas, grasslands, water and wetlands, and small woody features.
In the scientific community, LUCAS has been widely used for soil studies thanks to the topsoil survey module (Orgiazzi et al., 2018). LUCAS data has also already been valuable in the context of land cover and land use research and remote sensing 35 specifically. Esch et al. (2014)  According to Weigand et al. (2020), LUCAS in-situ data is a suitable source for classifying high-resolution Sentinel-2 imagery at a large scale. Weigand et al. (2020) tested the accuracy of different pre-processing approaches of the LUCAS data based on positioning and semantic selection. These studies highlight that there is an interest and value to the remote sensing research 45 community in using LUCAS in-situ data. Nevertheless, the LUCAS core protocol has major limitations in terms of spatial scale and representativeness when it comes to collecting in-situ data for calibration, training, and/or validation of EO products.
While LUCAS survey data had been valuable in providing in-situ observations relevant for remote sensing as highlighted, the LUCAS survey was designed to collect statistics and thus has inherent shortcomings when used in the context of EO. In 2018, a new LUCAS module ::::::::::::: LUCAS module specifically tailored to Earth Observation (EO) was introduced: the LUCAS 50 Copernicus module. The Copernicus module was designed to improve the value of LUCAS in-situ surveying for EO and to address the three specific EO limitations :: for ::::: three :::::: specific ::::::: reasons described hereafter.
Second, remotely sensed observations of the Earth are increasingly frequent along with finer spatial and spectral detail and, in the case of the observations by the fleet of Sentinel satellites of the EU's Copernicus Earth Observation Program, accessible to everyone. These remote observations need complementary in-situ observations. At the same time, there is an enormous and 65 continuing growth in a variety of services relying on geo-location. In this context, it is fair to say that we are witnessing a renewed recognition of the importance of in-situ data for EO.
Therefore, the third motivation is that the LUCAS Copernicus collected in-situ data should be representative, comprehensive, of such data-set ::::::: data-sets acquired with transparent protocols is key to assess the quality :: of : EO products resulting from various public and commercial activities. Thus, the Copernicus module gives the opportunity to further integrate the classical LUCAS survey purpose of collecting statistically representative information with the need to collect in-situ data to produce better EO-75 derived products, specifically for the EU's Copernicus program. The Copernicus module equips the EU with an in-situ dataset specifically fitting EO land applications monitoring allowing to develop consistent land monitoring at EU level.

LUCAS 2018 Copernicus Protocol
In the LUCAS core ::::::::::: LUCAS core protocol, the surveyor aims to get as close as possible to the theoretical point. The surveyor 120 then provides the so-called LUCAS core observations for the LUCAS theoretical ::::::::::::::::::: LUCAS theoretical grid point from the location that the surveyor was actually able to reach. Thus, although typically close to each other, the nominal geolocation of a LUCAS point may not exactly coincide with the actual observation location, that is not recorded for LUCAS core points.
As an illustrative example, the observation is made from an unknown location and assigned to the LUCAS nominal point in red on Figure 1. The exact geolocation of the surveyor observation is recorded only in the corresponding LUCAS-Copernicus 125 entry (green point in Figure 1). The LUCAS theoretical grid :::::::::::::::::::: LUCAS theoretical grid point observation is representative for a circle of 1.5 meter radius. In some specific cases, the window of observation is extended to a 20 meter radius whenever the land cover at the point is heterogeneous (Eurostat, 2018d). This occurs in areas such as permanent crops (

140
More specifically, the following additional data are collected on the LUCAS Copernicus surveyed points: (i) the exact :::::::: measured location of the observation, and (ii) the land cover (level 2) extent up to 51 m from the point in the four cardinal directions (N, W, S, E), as well as the neighboring LC. Note that the surveyor records 51 m to indicate that the land cover is homogeneous for more than 50 m. However, as the exact extent is not reported, we conservatively set it to the minimum extent of 51 m. Figure 1 illustrates the Copernicus protocol for one point and the respective collected data is shown in Table   145 2. On the basis of these LUCAS Copernicus observations, a quadrilateral polygon with homogeneous LC can be constructed.
As part of the Copernicus module, the surveyor collects 13 additional variables and three types of observations (Table 2): the level-2 LC (one variable); the extent of the Copernicus land cover (LUCAS LC classification at level 2) registered at the point reached in the field (four variables); the next land cover (up to 50 m) (four variables) and the breadth of the next land cover (four variables). The breadth corresponds to the % of the width of the land cover in this sector, as visible on the landscape 150 photo ::: (i.e. ::::::::: landscape ::::: photos ::::: taken :: in :::: each ::::::: cardinal :::::::: direction : : ::: N, :: E, :: S, ::: W). This means that the breadth is 100% if the next LC is seen all over the photo from one side to the other. If the next land cover is not visible on the photo because it is completely behind a linear feature (e.g. hedge) or because it is completely hidden by the terrain, then the next land cover is to be recorded but the breadth is 0%. For more information about the breadth and the next land cover, see Eurostat (2018d). observations on 13 variables: land cover (LC) at LUCAS legend level 2 (here B3 is "Non-permanent industrial crops"), the extension of the LC in the four cardinal directions (up to 50 m, 51 means more than 50 m), the breadth of the next LC in the four cardinal directions (%) and the next LC in the four directions (here, E2 means "Grassland without tree/shrub cover" in the N and W). Figure 1 shows how this information is used to build the geometries of the Copernicus polygon with homogeneous LC. The radial distance "d" is measured between the Copernicus point and the next LC, with "888" and "8" meaning "not relevant". The following sections describe how the LUCAS Copernicus data are prepared and cleaned to obtain a ::: the ready to use data 155 set provided with this manuscript. The following workflow was done in R (Code and Data availability, see Section 8).

Adding an explicit LUCAS land cover and land use legend
The LUCAS land cover classification is hierarchical and contains four levels briefly described hereafter (for a detailed description, see the Technical reference document C3 Classification (Eurostat, 2018e)). The land cover classification system is  To facilitate the usability of the data, in addition to the code describing the land use or land cover (e.g. B21 or U112), an explicit legend label was added to the dataset provided with this manuscript. This was done by adding nine Label explicit fields 165 (Table 4) to the data for the LC and LU legend.
In the results section, details on the hierarchical legend structure classes are also provided (Table 3 on legend level-2, Figure   5 on legend level-3 ("LC1" field in the data), and the 40 Land Use sub classes as shown by the level-3 distribution of the Copernicus polygons in Table A2), "LU1" in the data). ( Table :: 1). The quadrilateral diagonals can measure up to 102 m, but are smaller if the surveyor found a field boundary within 51 m of the LUCAS-Copernicus point.

Resulting LUCAS Copernicus Data
The 63 287 Copernicus polygons surveyed are published along with this paper. They are distributed among 26 level-2 LC 190 classes (Table 3) in eight level-1 LC classes (see map in Figure 3).

Sand F2 30
Lichens and moss F3 3 Other bare soil F4 1503 Inland water bodies G1 1 Inland running water G2 9 Inland wetlands H1 164 Coastal wetlands H2 6  in the provided dataset) and are thus flagged as "COPERNICUS_CLEANED" in the data. For these polygons, the more detailed level-3 land cover class of LUCAS core can be inherited to the LUCAS Copernicus polygon ("COPERNICUS_CLEANED" 210 is "TRUE"). Figure 4 illustrates the variety in shapes of the constructed quadrilateral Copernicus polygons as projected on top of satellite imagery for different land cover types. The resulting polygons are distributed over 66 specific LC classes as shown in Figure 5. Similarly, the level-3 Land Use (LU) is also available distributed in 38 classes organised in four main classes (see Table A2). 14 5 Public data and usage note 215
With this paper we provide LUCAS Copernicus polygons constructed at 63 287 locations. In addition, we provide a dataset that benefits from inheriting attributes collected on those same points via the LUCAS core protocol. This results in 58 428
The LUCAS Copernicus module is also planned to be carried out during the LUCAS 2022 survey. However, a simplified protocol has been designed for the LUCAS 2022 survey. In this protocol, the observations on the distance of homogeneous LC from the point, and the LC remain, but observations on the neighboring LC and breadth of the neighboring LC have been discarded. Despite this simplification, the coverage of :: the ::::: 2022 LUCAS Copernicus module will be expanded in 2022 to 235 150 000 LUCAS points for which in-situ surveying is planned.

Conclusions
For the first time, the LUCAS 2018 survey contained a module that was specifically tailored to the needs of EO. The LUCAS Copernicus module collected homogeneous land cover data over areas with a size relevant to 10-m satellite sensors. A total of 63 364 Copernicus polygons were obtained across the EU representing 66 land cover type classes at LUCAS legend level-2.

240
A follow-up of the LUCAS Copernicus module is planned for 2022. In 2022, a simplified version of the LUCAS Copernicus module will :: is ::::::: planned :: to be carried out on 150 000 LUCAS pointsfor which in-situ surveying is planned. This guarantees a continuity in the effort to find synergies between statistical in-situ surveying and the need to collect in-situ data relevant for Earth Observation in the European Union. The LUCAS Copernicus 2018 dataset is provided as a polygon shapefile along with a csv table containing 109 attributes.
Among the 109 attributes (list in Table 4), 97 attributes are the original fields as described in Eurostat (2019c), nine attributes are the legend-explicit LC and LU obtained as described in Section 3.1 and three attributes are obtained as described in the previous Section 4.
To use the data, the attribute "POINT_ID" should be used to join the attribute table of the shapefile and the csv table. While 255 the Copernicus related level-2 LC could be used for every polygon, the level-3 LC and LU, along with other LUCAS core information, should be used only for polygons with "COPERNICUS_CLEANED" as "TRUE" as described in the previous section.  Eurostat (2019c), 9 are the legend-explicit LC and LU attribute obtained as described in d' Andrimont et al. (2020) and 3 attributes are obtained as described in Section 8.