LUCAS Copernicus 2018: Earth Observation relevant in-situ data on land cover and use throughout the European Union

. The Land Use/Cover Area frame Survey (LUCAS) is an evenly spaced in-situ land cover and land use ground survey exercise that extends over the whole of the European Union. LUCAS was carried out in 2006, 2009, 2012, 2015, and 2018. A new LUCAS module speciﬁcally tailored to Earth Observation (EO) was introduced in 2018: the LUCAS Copernicus module. The module surveys the land cover extent up to 51 meters in four cardinal directions around a point of observation, offering 5 in-situ data compatible with the spatial resolution of high-resolution sensors. However, the use of the Copernicus module being marginal, the goal of the paper is to facilitate its uptake by the EO community. First, it summarizes the LUCAS Copernicus protocol to collect homogeneous land cover on a surface area of up to a 0.52 ha. Secondly, it proposes a methodology to create a ready-to-use dataset for Earth Observation land cover and land use applications with high resolution satellite imagery. As a result, a total of 63 364 LUCAS points distributed over 26 level-2 land cover classes were surveyed on the ground. Using 10 homogeneous extent information in the four cardinal directions, a polygon was delineated for each of these points. Through geo-spatial analysis and by semantically linking the LUCAS core and Copernicus module land cover observations, 58 428 polygons are provided with a level-3 land cover (66 speciﬁc classes including crop type) and land use (38 classes) information as inherited from the LUCAS core observation. The open-access provides a unique opportunity to train and validate decametric sensor- based products such as those obtained from the Copernicus Sentinel-1 and -2 satellites. A follow-up of the LUCAS Copernicus module is already planned for 2022. In 2022, a simpliﬁed version of the LUCAS Copernicus module will be carried out on 150 000 LUCAS points for which in-situ surveying is planned. This guarantees a continuity in the effort to ﬁnd synergies between statistical in-situ surveying and the need to collect in-situ data relevant for Earth Observation in the European Union.

products is underpinned by the availability and thematic representativity of precisely geo-located in-situ observations. Such in-situ data is essential to train and validate algorithms applied to EO products. Comprehensive and thematically rich in-situ 60 data can lead to better classifiers and more accurate multi-temporal land surface mapping.
Second, remotely sensed observations of the Earth are increasingly frequent along with finer spatial and spectral detail and, in the case of the observations by the fleet of Sentinel satellites of the EU's Copernicus Earth Observation Program, accessible to everyone. These remote observations need complementary in-situ observations. At the same time, there is an enormous and continuing growth in a variety of services relying on geo-location. In this context, it is fair to say that we are witnessing a 65 renewed recognition of the importance of in-situ data for EO. Therefore, the third motivation is that the LUCAS Copernicus collected in-situ data should be representative, comprehensive, precisely geo-located, available over larger areas, available across political borders, and with open access. Free and open accessibility is in fact essential for contributing to the creation of common in-situ data-sets and protocols as currently pursued by e.g. the Land Product Validation (LPV) of the Working Group on Calibration and Validation of the Committee on Earth 70 Observation Satellites (CEOS) and by the Joint Experiment for Crop Assessment and Monitoring (JECAM). The availability of such data-sets acquired with transparent protocols is key to assess the quality of EO products resulting from various public and commercial activities. Thus, the Copernicus module gives the opportunity to further integrate the classical LUCAS survey purpose of collecting statistically representative information with the need to collect in-situ data to produce better EOderived products, specifically for the EU's Copernicus program. The Copernicus module equips the EU with an in-situ dataset 75 specifically fitting EO land applications monitoring allowing to develop consistent land monitoring at EU level.
While data from the Copernicus module has been available since 2019, it has not been used in EO applications (to the best of our knowledge). This study is reducing the complexity of the data to ease the uptake of LUCAS Copernicus data by the remote sensing community.
This manuscript describes and provides the LUCAS Copernicus data in a ready-to-use format. More specifically, this study 80 (i) describes the LUCAS 2018 Copernicus in-situ survey protocol, (ii) presents a methodology to produce polygons from the surveyed data to be used in EO studies, (iii) proposes a method to inherit more detailed information from the LUCAS core, and (iv) highlights the added value of the survey in order to derive a simplified protocol for the LUCAS Copernicus module that will be integrated in future LUCAS surveys (e.g. in 2022).

85
The survey consists of a two phases sampling. In the first phase, 1.1 million geo-referenced points are systematically drawn forming a 2x2 km 2 grid, i.e. one point every 2 km in the EU. The points are then stratified according to land cover classes to allow the second phase sampling. In 2018, this resulted in 337 854 points on which statistical information is collected by surveyors in the field or by photo interpretation in the office. The sampling design methodology used for the LUCAS 2018 survey is described in detail in Scarnò et al. (2018). The grid is static and includes 1 090 863 points stratified according to 90 land cover class and is available in csv format from Eurostat (2019b). For a detailed description of the grid data see Eurostat (2018a), and for technical details about the stratification see Eurostat (2018b).
In 2018, the campaign involved more than 1300 actors including more than 900 surveyors and lasted for 23 months. The actual in-situ data collection occurred between March and September 2018. The raw data have been available online since May 2019 (Eurostat, 2019a) as a downloadable csv table with 97 columns and 337 854 records (Table 4 presents the attribute names of the 97 original fields; a record descriptor is available in Eurostat (2019c); the detailed survey instructions in Eurostat (2018d)). Out of the 337 854 points surveyed in 2018, 23% points had been included in three previous surveys (2009, 2012, and 2015), 25% had already been surveyed once or twice before (e.g. in 2009 and 2015), and the remaining 52% of the points were new entries. In the LUCAS 2018 survey, 70.45% of the points were surveyed in-situ, and 29.54% were obtained through the interpretation of detailed ortho-photos (Table 1 and Table A1). Out of the 337 854 LUCAS points sampled in 2018 (combining in-situ and photo-interpreted points, Table A1), the sample of the Copernicus module was a third-phase sampling nested in the two-phase sampling scheme. The Copernicus module was planned for 90 620 points and actually executed for 63 364 points (Table 1). For 27 256 (30.08%) planned points, the surveyors 110 did not manage to reach the point to make the observation, for example due to natural or human-made obstacles. Therefore, the Copernicus module was carried out in-situ for a total of 63 364 points, corresponding thus to 69.92% of the planned Copernicus points. Table 1. LUCAS 2018 points totaling 337 854 points. The points were either surveyed In-situ (238 014, 70.45%), Photo-Interpreted in the office (99 803, 29.54%) or not surveyed (i.e. "In situ PI not possible", "Out of national territory" or "Out of EU28"). The Copernicus module was collected for a subset of the in-situ points in addition to the LUCAS core protocol collected for any in-situ point In the LUCAS core protocol, the surveyor aims to get as close as possible to the theoretical point. The surveyor then provides 115 the so-called LUCAS core observations for the LUCAS theoretical grid point from the location that the surveyor was actually able to reach. Thus, although typically close to each other, the nominal geolocation of a LUCAS point may not exactly coincide with the actual observation location, that is not recorded for LUCAS core points. As an illustrative example, the observation is made from an unknown location and assigned to the LUCAS nominal point in red on Figure 1. The exact geolocation of the surveyor observation is recorded only in the corresponding LUCAS-Copernicus entry (green point in Figure 1). The LUCAS 120 theoretical grid point observation is representative for a circle of 1.5 meter radius. In some specific cases, the window of observation is extended to a 20 meter radius whenever the land cover at the point is heterogeneous ( W, S, E), as well as the neighboring LC. Note that the surveyor records 51 m to indicate that the land cover is homogeneous for more than 50 m. However, as the exact extent is not reported, we conservatively set it to the minimum extent of 51 m. Figure 1 illustrates the Copernicus protocol for one point and the respective collected data is shown in Table 2. On the basis 140 of these LUCAS Copernicus observations, a quadrilateral polygon with homogeneous LC can be constructed. As part of the Copernicus module, the surveyor collects 13 additional variables and three types of observations (Table 2): the level-2 LC (one variable); the extent of the Copernicus land cover (LUCAS LC classification at level 2) registered at the point reached in the field (four variables); the next land cover (up to 50 m) (four variables) and the breadth of the next land cover (four variables).
The breadth corresponds to the % of the width of the land cover in this sector, as visible on the landscape photo (i.e. landscape 145 photos taken in each cardinale direction : N, E, S, W). This means that the breadth is 100% if the next LC is seen all over the photo from one side to the other. If the next land cover is not visible on the photo because it is completely behind a linear feature (e.g. hedge) or because it is completely hidden by the terrain, then the next land cover is to be recorded but the breadth is 0%. For more information about the breadth and the next land cover, see Eurostat (2018d   (Table 2) are used to build the geometry of the Copernicus polygon. As the LUCAS theoretical point is inside in the Copernicus polygon, the LC legend of the LUCAS theoretical observation (here B32 -Rape and turnip rape) could be inherited to the Copernicus polygon (B3 -Non-permanent industrial crops) as described in Section 4. The background RGB imagery is obtained from "Map data ©2019 Google".
The following sections describe how the LUCAS Copernicus data are prepared and cleaned to obtain the ready to use data 150 set provided with this manuscript. The following workflow was done in R (Code and Data availability, see Section 8).

Adding an explicit LUCAS land cover and land use legend
The LUCAS land cover classification is hierarchical and contains four levels briefly described hereafter (for a detailed description, see the Technical reference document C3 Classification (Eurostat, 2018e)). The land cover classification system is subdivided into eight main level-1 land cover categories: Artificial Land, Cropland, Woodland, Shrubland, Grassland, Bare 155 Land, Water and Wetlands. The legend level-2 contains 26 classes (e.g. 8 under level-1 B Cropland) and level-3 comprises 73 classes (e.g. 9 under level-2 B1 Cereals). Only a limited number of observations has a level-4 land cover information distributed into 205 classes ("LC1_SPEC" field in the data). Similarly, the Land Use comprises 40 subclasses. Table 2. Example of information collected by the Copernicus protocol (for point with ID 45223358). The Copernicus protocol collects observations on 13 variables: land cover (LC) at LUCAS legend level 2 (here B3 is "Non-permanent industrial crops"), the extension of the LC in the four cardinal directions (up to 50 m, 51 means more than 50 m), the breadth of the next LC in the four cardinal directions (%) and the next LC in the four directions (here, E2 means "Grassland without tree/shrub cover" in the N and W). Figure 1 shows how this information is used to build the geometries of the Copernicus polygon with homogeneous LC. The radial distance "d" is measured between the Copernicus point and the next LC, with "888" and "8" meaning "not relevant". To facilitate the usability of the data, in addition to the code describing the land use or land cover (e.g. B21 or U112), an explicit legend label was added to the dataset provided with this manuscript. This was done by adding nine Label explicit fields 160 (Table 4) to the data for the LC and LU legend.
In the results section, details on the hierarchical legend structure classes are also provided (Table 3 on legend level-2, Figure   5 on legend level-3 ("LC1" field in the data), and the 40 Land Use sub classes as shown by the level-3 distribution of the Copernicus polygons in Table A2), "LU1" in the data).  (Table 1). The quadrilateral diagonals can measure up to 102 m, but are smaller if the surveyor found a field boundary within 51 m of the LUCAS-Copernicus point.

Quality check
While the Copernicus protocol was implementable for 63 364 polygons, several surveyed polygon locations (i.e. LUCAS Copernicus polygon as defined by "GPS_LAT" and "GPS_LONG") were either missing or wrong. The missing locations 175 could be flagged for 67 polygons ("GPS_PREC"=8888 or "GPS_LAT"=0 or "GPS_LONG"=0 ). In addition to these, 10 polygons were discarded because the surveyor geolocation ("GPS_LAT", "GPS_LONG") was far away from the nominal location ("TH_LAT" , "TH_LONG") ), i.e. difference larger than 0.1 degree (i.e. about 7.1 km in the center of the EU). In addition to the missing GPS measured locations, some macro errors were flagged and removed by selecting polygons for which the longitude and latitude differences between the GPS measured location and theoretical location ("TH_LAT" , "TH_LONG") 180 is larger than 0.1 degree. This allows to flag and remove 10 polygons which are all wrongly located because of the "GPS_EW" field (i.e. GPS Observation East/West). This location quality check permits to flag and remove a total of 75 polygons resulting in a final total of 63 287 polygons.

Resulting LUCAS Copernicus Data
The 63    Shrubland with sparse tree cover D1 1308 Shrubland without tree cover D2 1546 Grassland with sparse tree/shrub cover E1 2078 Grassland without tree/shrub cover E2 13 053 Spontaneously re-vegetated surfaces E3 2608 Rocks and stones F1 35

Sand F2 30
Lichens and moss F3 3 Other bare soil F4 1503 Inland water bodies G1 1 Inland running water G2 9 Inland wetlands H1 164 Coastal wetlands H2 6 TOTAL 63 287  in the provided dataset) and are thus flagged as "COPERNICUS_CLEANED" in the data. For these polygons, the more detailed level-3 land cover class of LUCAS core can be inherited to the LUCAS Copernicus polygon ("COPERNICUS_CLEANED" 205 is "TRUE"). Figure 4 illustrates the variety in shapes of the constructed quadrilateral Copernicus polygons as projected on top of satellite imagery for different land cover types. The resulting polygons are distributed over 66 specific LC classes as shown in Figure 5. Similarly, the level-3 Land Use (LU) is also available distributed in 38 classes organised in four main classes (see Table A2).   information, should be used only for polygons with "COPERNICUS_CLEANED" as "TRUE" as described in the previous section.   LUCAS module Additionally to the LUCAS core variables collected, other specific LUCAS protocol called "modules" were carried out on demand such as (i) the transect of 250 m to assess transitions of land cover and existing linear features (2009,2012,2015), (ii) the topsoil module (2009, 2012 (partly), 2015 and 2018), (iii) the grassland module (2018), and