EcoDes-DK15: High-resolution ecological descriptors of vegetation and terrain derived from Denmark’s national airborne laser scanning 2 data set 3

11 Biodiversity studies could strongly benefit from three-dimensional data on ecosystem structure derived from contemporary 12 remote sensing technologies, such as Light Detection and Ranging (LiDAR). Despite the increasing availability of such data 13 at regional and national scales, the average ecologist has been limited in accessing them due to high requirements on computing 14 power and remote sensing knowledge. We processed Denmark’s publicly available national Airborne Laser Scanning (ALS) 15 data set acquired in 2014/15 together with the accompanying elevation model to compute 70 rasterized descriptors of interest 16 for ecological studies. With a grain size of 10 m, these data products provide a snapshot of high-resolution measures including 17 vegetation height, structure and density, as well as topographic descriptors including elevation, aspect, slope and wetness 18 across more than forty thousand square kilometres covering almost all of Denmark’s terrestrial surface. The resulting data set 19 is comparatively small (~94 GB, compressed 16.8 GB) and the raster data can be readily integrated into analytical workflows 20 in software familiar to many ecologists (GIS software, R, Python). Source code and documentation for the processing workflow 21 are openly available via a code repository, allowing for transfer to other ALS data sets, as well as modification or re-calculation 22 of future instances of Denmark’s national ALS data set. We hope that our high-resolution ecological vegetation and terrain 23 descriptors (EcoDes-DK15) will serve as an inspiration for the publication of further such data sets covering other countries 24 and regions and that our rasterized data set

such as Denmark's nationwide data set "DHM/Punktsky" -outputs from many survey flights are co-registered and merged, 48 resulting in very large point clouds with hundreds of billions of points and data volumes of multiple Terabytes 49 (Geodatastyrelsen, 2015). For further information on ALS data acquisition, we recommend Vo  global assessment of Essential Biodiversity Variables (EBVs). Finding ways of making regional and nationwide ALS data 76 more accessible to the average ecologist is therefore not only a critical priority for accelerating research on regional biodiversity 77 patterns and species -habitat relationships, but also for the facilitation of global assessments such as those carried out by 78

IPBES (2019) and alike. 79
To open up opportunities for researchers and practitioners not familiar with ALS processing or without access to the required 80 facilities, we present a new national ALS based data set for Denmark primarily aimed at ecological research with possible uses 81 in other disciplines. With a grain size of 10 m, these ecological descriptor (EcoDes) rasters provide a snapshot of high-82 resolution measures of vegetation height, structure and density, as well as topographic descriptors including elevation, aspect, 83 slope and wetness for almost all of Denmark's terrestrial surface between spring 2014 and summer 2015 (DK15). In this 84 publication, we a) describe the source data and outline the processing workflow (Sect. 2.1-2.3), b) summarise the data set's 85 main characteristics (Sect. 3.1-3.2), c) describe each descriptor in detail and highlight its use and limitations (Sect. 3.3-3.4), d) 86 provide guidance on data access and illustrate how the data could be used in an example of ecological landscape classification 87 (Sect. 4). We finish by e) briefly discussing the general limitations of the data set and processing workflow, as well as providing 88 perspectives on how the presented data can be complemented with other data sources (Sect. 5). We hope that ease of access 89 Table 1: Overview of the data sources used for generating the EcoDes-DK15 data set. Three versions of the DHM/Pointcloud 120 were merged to obtain a point cloud data set that contained no vegetation points scanned after 2015 and as little vegetation 121 points before 2014 as possible. DHM/Terrain tiles were matched sources from the same data source as the corresponding point 122 cloud tiles. A copy of the source data is archived on the internal long-term data storage at Aarhus University and is available 123 on request. For further information see documentation on GitHub code repository and Sect. 3  geolocation. However, a small number of tiles (n = 30) in the DHM/Point-cloud data sets did not have corresponding tiles in 127 the DHM/Terrain data sets, these were removed prior processing resulting in the total of 49673 tiles shown in Table 1. 128

Processing 129
We processed the source data using OPALS 2. 3  dtm.py. The dk_lidar modules also contain two further files, common.py a script containing specifications of common functions 161 used by the points.py and dtm.py, as well as settings.py which is used to set global processing options, specify file paths etc. 162 Finally, two helper scripts are provided progress_monitor.py which facilitates progress monitoring and estimates the time 163 remaining and debug.py a script for testing the workflow for a single tile. Together the Python scripts and modules allow to 164 generate the ecological descriptor outputs from the two input data sets. Further documentation of the dk_lidar modules and 165 workflow scripts can be found on the GitHub repository associated with this publication: 166 https://github.com/jakobjassmann/ecodes-dk-lidar. 167 to 57.75 °N, 8.07 °E to 15.20 °E). The data is projected in ETRS89 UTM 32 N based on the GRS80 spheroid (EPSG: 25832). 171 The data set is available as GeoTIFFs with 10 m grain size via a data repository on Zenodo (see Sect. 6). For each descriptor 172 the nation-wide data are split into 49673 raster tiles of 1 km x 1 km with a 10 m grain size based on 25-fold aggregations of 173 the 0.4 m national grid of Denmark. A virtual raster mosaic (VRT) file is provided for each descriptor (except the 174 point_source_counts, point_source_ids and point_source_proportion descriptors), and a file containing the tile footprint 175 geometries can be used for geographical sub-setting of the data. We also provide masks for inland water and the sea. 176

177
The final data set consists of just under 94 GB of data (compressed for download 16.8 GB). To reduce the size of the data set 178 we converted numerical values from floating point precision to 16-bit integers where possible. In some cases, this required us 179 to stretch the values by a set factor to maintain information content beyond the decimal point. The descriptor conversion factors 180 are available as a csv file provided with the data set and in Table 2. Missing data (NoData) is denoted by a value of -9999 181 throughout the data set. 182

Overview and file naming convention 183
An overview of the eighteen terrain and vegetation structure descriptors as well as the auxiliary data provided can be found in 184 Table 2. Generally, the descriptor names in Table 2 reflect the prefix of the file name of a GeoTiff file within the data set. This 185 prefix is followed by a suffix representing the unique identifier for each tile based on the UTM coordinates of the tile (see 186 Sect. 3.4.3 for more detail). When working with the complete data set, tiles from the same descriptor are grouped within a 187 folder using the same descriptor name as used for the file name prefix. For example, for the tile with the unique id "6239_446" 188 the GeoTiff for the "dtm_10m" descriptor can be found in "dtm_10m/dtm_10m_6239_446.tif". The exceptions are the point 189 counts, vegetation proportions and point source information, please see the relevant sections below for more detail.

Completeness of the data set 202
The processing of the data set was almost completely successful. Processing failed on average for only 18 out of the 49673 203 tiles per descriptor with a maximum of 65 tiles failing for the canoy_height, normalized_z_mean and normalized_z_sd 204 descriptors. The majority of these tiles were located on the fringes of the data set, including sand spits, sandbanks etc, we 205 therefore did not attempt re-processing of those tiles. Instead, we generated NoData rasters for all missing descriptor -tile 206 combinations (i.e. we assigned -9999 to all cells in those tiles). We provide a text file listing the affected "NoData" tiles in the 207 folder of each descriptor (the file is named empty_tiles_XXX.txt, where XXX is the descriptor name). 208

Elevation-model derived descriptors 209
The following descriptors were solely derived from the 0. An orthophoto and the tile location relative to Denmark are shown in (a). The terrain model (dtm_10m) is illustrated in (b). 215 purposes, we amplified the altitude above sea-level by a factor of two in the 3D visualisations and divided the solar radiation 219 values by 10 5 . The 3D raster visualisations were generated using the rayshader v0.19.2 package in R (Morgan-Wall, 2020). 220 Orthophoto provided by the Danish Agency for Data Supply and Efficiency (https://sdfe.dk/hent-data/fotos-og-geodanmark-221 data/). 222

Elevation (dtm_10m) 223
We aggregated the 0.4 m DEM by mean to match the 10 m x 10 m national grid of the remainder of the data set. We used 224 gdalwarp to carry out the aggregations. Values represent the elevation above sea level in metres (DVR90, EPSG: 5799) 225 multiplied by a factor of 100, rounded to the nearest integer and converted to 16-bit integer. 226

Aspect (aspect) 227
The topographic aspect describes the orientation of a slope in the terrain and may, amongst other things, be related to plant 228 growth through light and moisture availability. We calculated the aspect in degrees, with 0° indicating North, 90° East, 180° 229 South and 270° West. Values represent the aspect derived from a 10 m aggregate of the elevation model (aggregated by mean 230 with 32-bit floating point precision). Calculations were carried out using gdaldem binaries and the "aspect" option, which by 231 default uses Horn's method to calculate the aspect (Horn, 1981). To avoid edge effects, all calculations were done on a mosaic 232 that included the focal tile and all available directly neighbouring tiles (maximum eight). The mosaic was cropped back to the 233 extent of the focal tile upon completion of the calculations. We then converted the value for each cell from radian to degrees, 234 multiplied it by a factor of 10, rounded to the nearest integer and stored the results as a 16-bit integer. Finally, we assigned a 235 value of -10 (-1°) to all cells where the slope was 0° (flat). Limitations in the aspect arise in relation to edge effects that occur 236 where a neighbourhood mosaic is incomplete for a focal tile (i.e., less than eight neighbouring tiles), such as for tiles along the 237 coastline or at the edge of the covered extent. For those tiles, no aspect can be derived for the rows or columns at the edge of 238 the mosaic. The cells in those rows and columns have no neighbouring cells and were assigned the NoData value (-9999). 239 Please also note that we calculated the aspect descriptor from the 10 m aggregate of the DTM/Terrain data set rather than

Slope (slope) 245
The topographic slope describes the steepness of the terrain and amongst other things may be related to moisture availability, 246 exposure and erosion. We derived the topographic slope in degrees with a 10 m grain size from a mean aggregate of the the focal tile and all available directly neighbouring tiles (maximum eight). The mosaic was cropped back to the extent of the 250 focal tile upon completion of the calculations. The value for each cell was converted from radian to degrees, multiplied by a 251 factor of 10, rounded to the nearest integer and stored as a 16-bit integer. Limitations in the slope arise in relation to edge 252 effects that occur where a neighbourhood mosaic is incomplete for a focal tile (i.e., less than eight neighbouring tiles), such as 253 for tiles along the coastline or at the edge of the covered extent. For those tiles, no slope can be derived for the rows or columns 254 at the edge of the mosaic. These cells in those rows and columns have no neighbouring cells and gdaldem assigns the NoData 255 value (-9999) to these cells. Please also note that we calculated the slope descriptor from the 10 m aggregate of the 256 DTM/Terrain data set rather than deriving it from the 0.4 m original resolution rasters and then aggregating it. The latter 257 approach could represent the aspect/slope at the original resolution better (Grohmann, 2015; Moudrý et al., 2019), but would 258 create inconsistencies within how the remaining DTM/Terrain descriptors are calculated in this dataset. 259

Landscape openness mean (openness_mean) 260
Landscape openness is a landform descriptor that indicates whether a cell is located in a depression or elevation of the 261 landscape. We calculate the landscape openness following Yokoyama et al.

Landscape openness difference (openness_difference) 277
In addition to the mean of the landscape openness, we also derived a landscape openness difference measure. This difference 278 measure is an indicator of whether a cell is part of a linear feature in the landscape that runs in one cardinal direction, such as a ridge or valley, therefore providing additional information to the landscape openness_mean descriptor. We calculated the 280 landscape openness difference based on the 10 m mean aggregate of the elevation model (32-bit floating point precision) and 281 with a search radius of 50 m. We chose these parameters as we consider them best suited to capture the relatively narrow 282 valleys and ridgetops common in the Danish landscape. First, we generated a mosaic including the focal tile and all available 283 tiles in the direct neighbourhood (max. eight neighbouring tiles) to reduce edge effects in subsequent calculations. We then 284 calculated the minimum and maximum of the positive landscape openness from all eight cardinal directions for all cells in the 285 mosaic using the OPALS Openness module with a search radius of 50 m (feature = 'positive', kernelSize = 5 , selMode = 1 286 for minimum and selMode = 2 for maximum). Next, we converted the minimum and maximum values from radian to degrees 287 and calculated the difference between the maximum and minimum value. We rounded the result to the nearest full degree. For 288 the cases where the neighbourhood mosaic was incomplete, i.e., containing less than eight neighbouring tiles, we masked out 289 all cells within the first 50 m of all edges with a missing neighbourhood tile. The final output mosaic was then cropped to the 290 extent of the focal tile and stored as a 16-bit integer GeoTIFF. As a consequence of the edge effect related masking, focal tiles 291 on the edges of the data set, such as those on coastlines or at the edge of the coverage area, have no data available for the first 292 50 m. 293

Solar Radiation (solar_radiation) 294
Incident solar radiation is a key parameter for plant growth as it represents the electromagnetic energy available to plants 295 required for photosynthesis. However, in the comparatively flat country of Denmark, shading by other vegetation likely exerts 296 a larger influence on photosynthetic activity than terrain related shading. Here, the impact of incident solar radiation on the 297 local climate likely plays a more important role for determining plant growth due to its influence on drought/water dynamics 298 where L is the centre latitude of the cell in degrees, S is the slope of the cell in degrees and A is the aspect of the cell in degrees. 304 The resulting estimate is given in: MJ x 100 -1 m -2 x yr -1 (McCune and Keon, 2002). Slope and aspect for each 10 m x 10 m 305 grid cell were sourced from the slope and aspect rasters. We saved the result as 32-bit integers. Due to propagation from the 306 calculation of slope descriptor, no solar radiation values can be calculated for cells found right on the edge of the data set, for 307 example in tiles situated along the coastline or at the edge of the sampling extent. 308 but this characteristic is probably better captured in our solar radiation descriptor (see above) that was developed to improve 311 shortcomings in the heat load index (McCune and Keon, 2002). However, in a previous study (Moeslund et al., 2019) we show 312 that -in Denmark -the index was moderately correlated with soil moisture, and can therefore serve as a useful indicator of the 313 amount of moisture available to plants. We calculated the heat load index based on the aspect rasters (described above) 314 following the equation specified in McCune and Keon (2002) using gdal_calc: 315 where A is the aspect in degrees. We stretched the result by a factor of 10000, rounded to the nearest integer and stored it as a 318 16-bit integer. As the heat_load_index is not meaningfully defined for flat cells (slope = 0° / aspect = -1°), we set the value of 319 those cells to NoData (-9999). Finally, for cells that are located on the outermost edges of the data set the heat_load_index is 320 not defined due to propagation of the NoData value assigned to the aspect in those cells. 321

Topographic wetness index (twi) 322
The topographic wetness index (TWI) provides a proxy measure of soil moisture or wetness based on the hydrological flow 323 modelled through a digital terrain model. Here, we derived the TWI following the method recommended by Kopecký et al. 324 (2020). We based our calculations on the aggregated 10 m elevation model (dtm_10m, 16-bit integer) and used a 325 neighbourhood mosaic (max. 8 neighbours) for each focal tile to derive the TWI. The exact procedure is detailed in the next 326 paragraph. As such the index values calculated by us only consider a catchment the size of one tile and all its neighbours (for 327 non-edge tiles this is a 3 km x 3 km catchment, for edge tiles it is smaller depending on the completeness of the neighbourhood 328 mosaic). We then cropped the resulting output back to the extent of the focal tile, stretched the TWI values by a factor of 1000, 329 rounded to the next full integer and stored the results as a 16-bit integer. 330 We calculated the TWI using SAGA GIS v. 7.8.2 binaries. First, we sink-filled the neighbourhood mosaic of the terrain model 331 using the ta_preprocessor 5 module and the option "MINSLOPE 0.01" (Wang and Liu, 2006). Second, we calculated the flow 332 accumulation based on the sink-filled neighbourhood mosaic of the terrain model (from step one) using the ta_hydrology 0 333 module with options "METHOD 4" and "CONVERGENCE 1.0" (Freeman, 1991;Quinn et al., 1991). Third, we derived the 334 flow width and specific catchment area based on the sink-filled neighbourhood mosaic of the terrain model (from step one) 335 and the flow accumulation (from step two) using the module ta_hydrology 19 (Gruber and Peckahm, 2008;Quinn et al., 1991). 336 Fourth, we calculated the slope based on the sink-filled neighbourhood mosaic of the terrain model (from step one) using the 337 ta_morphometry 0 module with option "METHOD 7" (Haralick, 1983). Finally, we derived the TWI based on the specific documentation (SAGA-GIS Tool Library Documentation v7.8.2, 2021). 341 The TWI descriptor calculated for EcoDes-DK15 is subject to two main limitations: edge effects and small catchment size. 342 Tiles with incomplete neighbourhoods (i.e., less than 8 direct neighbours are available) will suffer from edge effects in the 343 direct vicinity of the relevant border and overall due to a reduced catchment size. Furthermore, even in the ideal case of the 344 neighbourhood being complete, for most cells flow accumulation is therefore only calculated for the direct neighbourhood of 345 a focal tile, comprising a 3 km x 3 km catchment area. While we hypothesize that, due to the relatively low variation in 346 topography in Denmark, the TWI based on this comparably small catchment area will serve as a reasonable proxy for terrain-347 based wetness in most cases, it may be less reliable in areas with exceptionally high variation in topography or for lakes and 348 rivers with large catchments. In addition, we would like to point the reader towards the general limitations of the TWI as a 349 proxy for soil moisture or terrain wetness as for example discussed by Kopecký et al. (2020). These general limitations should 350 be taken into account when interpreting the TWI values provided in EcoDes-DK15. 351

Point-cloud derived descriptors 352
The DHM/Point-cloud point cloud was pre-classified into eleven point categories (Geodatastyrelsen, 2015) following the 353 ASPRS LAS 1.3 standard (ASPRS, 2011). For the EcoDes-DK15 data set, we restricted the analysis to six of these classes, 354 including ground points ("Terraen") -class 2, water points ("Vand") -class 9, building points ("Bygninger") -class 6, as well 355 as low ("lav"), medium ("mellemhøj") and high vegetation ("høj vegetation") -classes 3, 4 and 5, respectively. We to four different sources. We therefore recommend using the two amplitude descriptors with care, and -if possible -in 412 conjunction with information on the point source ids contained in the point_source_info descriptors described below. 413

Canopy height (canopy_height) 414
Canopy height is a key parameter of vegetation structure related to biomass and ecosystem functioning. We derived the canopy 415 height in metres as the 95th-percentile of the normalised height above ground of all vegetation points within each 10 m x 10 416 m cell using the OPALS Cell module. The resulting canopy heights were multiplied by a factor of 100, rounded to the nearest 417 integer and stored as 16-bit integers. In cases where there were no vegetation points in any given cell, we set the canopy height 418 value of the cell to zero metres. Please note that the canopy height is therefore also set as zero metres even if there are no points 419 present in the cell at all (such as ground or water points). Furthermore, our algorithm calculates the canopy height even if there 420 is only a small amount of vegetation points in a cell. In rare cases, this might lead to erroneous canopy-height readings if 421 vegetation is found on artificial structures or points have been mis-classified. For example: A tall communications tower can 422 be found just south of Aarhus and returns from the tower were miss-classified as vegetation. The resulting canopy height for 423 this cell is calculated as > 100 m above ground, which would not make sense if interpreted as a height of the vegetation above 424 ground. For such cases, the building proportion descriptor may be used to separate cells with artificial structure from those 425 with vegetation only. See also the "normalized_z" descriptor below for a closely related measure. 426

Normalised height -mean and standard deviation (normalized_z_mean and normalized_z_sd) 427
Similar to the canopy height descriptor, the normalised height describes the structure properties of the point cloud above 428 ground. The key difference between the two descriptors is that for the normalised height we also included non-vegetation points (buildings & ground) and derived the summary statistic as the mean rather than the 95%-quantile. For the normalised 430 height descriptor, we also provide a measure of variation in form of the standard deviation. Specifically, we calculated the 431 normalised mean and the standard deviation of the mean height above ground (normalised z attribute) for all points in each 10 432 m x 10 m grid cell using the OPALS Cell module. The results were multiplied by 100, rounded to the nearest integer and stored 433 as 16-bit integers. We used the normalised z attribute generated during the ingestion of the point cloud reflecting the height of 434 a point relative to the ground level determined by the DHM/Terrain raster. Here, all points refer to all points belonging either 435 to the ground, water, building or vegetation class. By definition the normalised height mean will be highly correlated with the 436 "canopy_height" descriptor for cells where mainly vegetation points are present. We kept the American spelling of the 437 descriptor name for legacy reasons with previous versions of the data set.  water_point_count_-01m-01m -1 m to 1 m water points (class 9) ground_and_water_point_count_-01m-01m -1 m to 1 m ground and water points (classes 2,9) vegetation_point_count_00m-50m 0 m to 50 m vegetation points (classes 3,4,5) building_point_count_-01m-50m -1 m to 50 m building points (class 6) total_point_count_-01m-50m -1 m to 50 m ground, water, vegetation and building points (classes 2,3,4,5,6,9)

Vegetation density or total vegetation proportion (vegetation_density) 476
Vegetation density is an important component of ecosystem structure. Here, we calculated the vegetation density as the ratio 477 between the vegetation returns across all vertical height bins (vegetation_point_count_00m-50m) and the total point count 478 (total_point_count_-01m-50m). Calculations were done using gdal_calc based on the two point count rasters (Sect. 3.3.5). 479 Results were multiplied by 10000, rounded to the nearest integer and stored as 16-bit integers. In addition to actual difference 480 between vegetation density in a cell, the vegetation_density descriptor is also influenced by the canopy properties, e.g. a dense 481 upper layer will prevent penetration of the light beam to lower layers or even the ground, and the points sources within a cell, 482 e.g. multiple sources from different viewing angles provide a more complete estimate of the vegetation density. These 483 additional influences are important to keep in mind when interpreting the vegetation_density descriptor. 484

Canopy openness, or ground and water proportion (canopy_openness) 485
Canopy openness is an important ecological descriptor particularly of forest canopies, as it describes the amount of light 486 penetrating through to the levels of the canopy. To some degree the canopy openness serves as the inverse for the vegetation 487 density. For EcoDes-DK15, we calculated the canopy openness of a 10 m x 10 m cell as the proportion of the ground and water points (ground_and_water_point_count_-01m-01m) to the total point count (total_point_count_-01m-50m) within the cell. 489 The raster calculations were done using gdal_calc. Results were multiplied by 10000, rounded to the nearest integer and stored 490 as 16-bit integers. Please note that the same considerations as for the vegetation_density descriptor (Sect. 3.3.7) regarding 491 canopy properties and differences in point sources between the cells apply when interpreting the canopy_openness descriptor. 492 In addition, it is important to note that building points will reduce the canopy openness the same way that vegetation points 493 would. 494

Building proportion (building_proportion) 495
In a densely populated country such as Denmark, buildings form an important part of the landscape. For ecological studies the 496 distance to buildings, their presence, absence or density may be of relevance. The building_propotion descriptor of EcoDes-497 DK15 provides a proxy for how much building infrastructure can be found within a 10 m cell. We calculated the descriptor as 498 the number of building points (building_point_count_-01m-50m) divided by the total number of points (total_point_count_-499 01m-50m) within each cell using gdal_calc. Results were multiplied by 10000, rounded to the nearest integer and stored as 500 16-bit integers. While most returns from three dimensional infrastructure are classified as buildings in the DHM/Point-cloud, 501 we would like to highlight that many roads are classified as ground (class 2) and some structures such as pylons and power 502 lines were assigned a separate class (not described in (Geodatastyrelsen, 2015). These structures are therefore not included in 503 the building_proportion descriptor. We would further like to note that the majority of building points are likely based on returns 504 from the roofs of the buildings. Walls and other vertical structures are probably represented at a lower frequency in the point 505 clouds. Finally, we would like to point the reader to the "DCE Basemap" (Levin, 2019) which may assist in the identification 506 of basic land cover types that include buildings and other manmade structures. 507

Auxiliary data 508
In addition to the terrain and point cloud derived descriptors we provide three sets of auxiliary data with EcoDes-DK15. These 509 are four layers of ALS point source information, a mask for inland water and a sea mask, as well as a shapefile of the footprints 510 of the 1 km x 1 km tiles in the data set and their unique identifier. 511

Point source information 512
The point source attribute of the DHM/Point-cloud represents differences between sensor units or aircrafts that may have been 513 used during the nationwide LiDAR campaign, differences in the acquisition time and date and differences in the viewpoint or 514 acquisition angle of the cells. To aid in interpretation of descriptors that may be particularly influenced by point source, like 515 the amplitude descriptors or the vegetation proportions, we provide summary information about the point sources within each 516 10 m x 10 m cell. We summarised this information in four descriptors, the "point_source_counts", "point_source_ids", 517 "point_source_nids" and "point_source_proportions". For each tile (file name suffix = tile id), these descriptors are found in 518 four subfolders bundled up in the parent "point_source_info" folder.
point_source_ids -Multi-layer raster containing one 16-bit integer layer for each point source id found in a tile. If a point 520 with a given point source id is present the value of the cell is set to the point source id (an integer number) in the respective 521 layer for the point source id, otherwise the value of a cell is set to 0. This multilayer raster can be used to match the file names 522 of the point_source_counts and point_source_proportions rasters to a given point source id. Point source ids were extracted 523 using Opals Cell. Calculations were carried out using gdal_calc. The final proportions were multiplied by a factor of 10000, rounded to the 537 nearest integer and stored as 16-bit integers. 538

Water masks (inland_water_mask and sea_mask) 539
We also provide rasterized water masks for use cases that require masking inland water bodies or the sea. To represent all 540 permanent lakes in Denmark, we merged three shapefiles containing (1) lakes protected by the Danish nature protection 541 legislation ( §3, available at https://arealinformation.miljoeportal.dk), (2) other valuable lakes (available on request at the 542 Danish Farming Agency in the "good farming and environmental condition" data set) and (3) a layer containing the remaining 543 rather small lakes and ponds (GeoDanmark, https://kortforsyningen.dk/). The combined shapefile is provided on the GitHub 544 code repository (see below). We then burned the geometries within the shapefile into the 10 m x 10 m grid using gdal_rasterize. 545 The masks are binary, a cell value of 1 indicates land and a value of -9999 (NoData) indicates sea or inland water, respectively. 546 When using the masks please consider that the shape, presence and absence of water bodies and coastlines may fluctuate over 547 time. We created the masks to present a snapshot of the water bodies as close as possible to the time point of the DHM/Point-548 cloud acquisition (spring 2014 -summer 2015), but inaccuracies may still arise. When combining the data with more recent 549 observations, keep in mind that inland water bodies and coastlines may have changed since then. Finally, while we aimed to produce the inland water mask to be as comprehensive as possible, some small ponds and water bodies may have been missed.
The time point at which the source data was collected may be of interest to certain applications that are using EcoDes-DK15 555 for the end-users conducting ecological research. Furthermore, determining the date_stamps was not possible for a proportion 572 of tiles where the GPSTime in the source data was not converted from seconds per GPS week to GPS time in seconds since 6 573 January 1980. A post-hoc conversion is not possible without the knowledge of the exact GPS week number, which is not 574 provided in the source data. In these cases, we assigned the NoData value to the date_stamps. The majority of the tiles affected 575 is located in the areas around Mols Bjerge and Sønderborg (Fig. 5). However, from auxiliary information about the source data 576 sets we know that these areas were surveyed April-May 2015 and October 2014, respectively. 577

Footprint file (tile_footprints.shp) 587
To assist data access and creation of data subsets, we have produced an ESRI shapefile containing the footprints of all 1 km x 588 1 km tiles in the EcoDes-DK15 data set. The shapefile was generated based on the "dtm_10m" rasters and the tile identifier of 589 each footprint geometry is specified in the "tile_id" attribute column. 590 4. Data access and ecological use case example 591

Data access and handling 592
Depending on the extent of the study, it may be preferable to work with a subset of the data set rather than the nationwide VRT 593 files (Fig. 6). We suggest starting by identifying the relevant EcoDes-DK15 descriptors of interest, then retrieving the relevant 594 data from the repository and decompressing the archives (instructions provided on data repository). If the study area of interest 595 covers a large fraction of Denmark's extent and sufficient processing power is available, the nationwide VRT data should 596 provide the most convenient access to the selected descriptors. However, if the study area does not cover a large proportion of 597 Denmark, then we suggest sub setting the data using the tile footprints to decrease demands on computational resources. After 598 sub setting, local / regional VRT files or mosaics can be generated if needed. We provide an example R script illustrating how 599 this sub setting could be done for the use case example shown in the next section on the code repository 600 (manuscript/figure_7/subset_data set.R). We have also made the resulting subset available as a "teaser" (5 MB) to help the 601 reader assess the value of EcoDes-DK15 without having to commit to the multi-gigabyte download of the complete data set 602 (see Sect. 6). 603 604 Figure 6: Schematic chart of two possible approaches for accessing and integrating EcoDes-DK15 data into ecological studies. 606 The first step is to identify which descriptors are of interest, these descriptors can then be downloaded from the Zenodo 607 repository and decompressed. Next a decision needs to be made whether the whole data set (nationwide) or only a subset of 608 the tiles is required (e.g., a regional study). As the whole data set is relatively large (~94 GB), storage and processing limitations 609 need to be taken into account when planning data processing and handling. If a subset of tiles is sufficient for a study, the 610 provided tile footprints can be used to identify which tiles are required based on a geometry (e.g., a shapefile) of the study 611 region(s). Finally, for easy data handling in subsequent analysis, a mosaic of the selected tiles can be created. For nationwide 612 use we provided virtual mosaics (VRT files) containing all tiles for the descriptors. An R script illustrating how the sub setting 613 can be done for a regional study can be found on the GitHub repository: https://github.com/jakobjassmann/ecodes-dk-614 lidar/blob/master/manuscript/figure_7/subset_dataset.R. 615 Figure 7 illustrates a use case for the EcoDes-DK15 data set with an example of an ecologically motivated landscape 617 stratification of the "Husby Klit" old-dune protected area in western Denmark. We developed this stratification for a group of 618 Master's projects carrying out vegetation monitoring in the area. Our aim was to capture the variation in the dominant 619 vegetation based on vegetation structure as well as the variation in fine-scale topography created by the dune systems across 620 the landscape. In addition to using the descriptors already provided, the stratification required us to derive a topographic  (Weiss, 2001) from the terrain model (dtm_10m). Following (Weiss, 2001) we then classified each cell based on 633 the scaled TPI into three categories. A scaled TPI below a value of -0.5 was classified as a "trough or lower-slope", a scaled 634 TPI between -0.5 and 0.5 as "mid-slope or flat", and a scaled TPI above 0.5 as a "ridge or top". For the vegetation structure 635 component (b), we calculated the proportion of returns in three simplified height bins: 1) 0 m to 1.5 m, 2) 1.5 m to 3.0 m and 636 3) 3.0 m -50 m. Here we included both ground and vegetation returns as the divisor for the standardisation, but not the returns 637 from buildings or water. Based on a priori knowledge we deduced that there are three dominant vegetation communities within 638 the protected area: communities dominated by grass and heath with vegetation growth generally below 1.5 m, communities 639 dominated by shrubs and small trees (including the invasive Pinus mugo) with vegetation growth predominantly below 3.0 m, 640 and communities dominated by trees (including the native Pinus sylvestris), generally with growth above 3.0 m. We used this 641 knowledge to assign the three vegetation classes based on the proportion of point returns in the simplified height bins. For the 642 "grass and heath" class we used a strict cut off with no points present above 1.5 m. For the "shrubs and small trees" class we 643 used a fuzzy cut off allowing the proportion of points in the 3.0 m and above bin to reach up to 10% of the maximum proportion 644 found in this heigh bin. All remaining cells were then assigned to the "trees" class. Finally, we combined the two classifications 645 into one as illustrated in c). Panel d) shows the location of the protected area within Denmark. The 3D raster visualisations 646 were generated using the rayshader v0.19.2 package in R (Morgan-Wall, 2020). 647

Discussion -limitations and future perspectives 648
Our data set demonstrates how the complex information in ALS point cloud data sets spanning more than 40.000 km 2 , can be 649 condensed into a compact data set of rasterized descriptors of interest for ecological studies. For the whole of Denmark, we 650 provide 70 raster layers representing eighteen measures that describe a snapshot of vegetation height, structure and density, as 651 well as topography and topography-derived habitat characteristics, including slope, aspect, solar radiation and wetness for the 652 time period 2014-2015. These measures are of direct relevance for ecological research on species' habitat characteristics, 653 distribution modelling, biodiversity and conservation applications. Condensing the ALS derived information into a compact 654 set of raster descriptors makes it more accessible to the community of ecological researchers and practitioners, allowing them 655 to access information on the vertical structure of vegetation and terrain otherwise difficult to obtain for large extents such as 656 those of a whole country. 657 We would like to highlight some key ecological and physical limitations that should be kept in mind when using the data or 658 derivatives. Firstly, we were able to only carry out a simple qualitative assessment of the errors in the EcoDes-DK14 data set 659 within the scope of this project. All descriptors should therefore be seen as proxies for the geographical and biological 660 properties they describe. Errors in the original point cloud and DTM will have propagated through to the final descriptors and 661 future studies are needed to assess to which degree the proxy measures correlate with in-field data. Furthermore, the EcoDes 662 data set is a snapshot in time representing the state of the vegetation in the one and a half years between spring 2014 and 663 summer 2015 (with some exception in western Jutland, where the data is from 2013). Like anywhere on Earth, the landscapes 664 of Denmark may change over time and by the time point of publication of this data set over 5 years may have passed since the 665 collection of the source data. External data sources containing information about on-going or past changes (such as satellite 666 imagery -see below) might help overcome this bias. Additionally, the geographical differences in the timing of the point cloud 667 collection across the country (see Sect 6.3.4) may introduce noise and could affect cross-comparability of the data between 668 regions, for example due to seasonal differences in foliage (see e.g., Leiterer et al., 2015). Furthermore, there are implicit 669 limitations in spatial scale due to the set grain size of the data set. We chose a 10 m x 10 m grid for efficiency in computation 670 and data handling, as well as to overcome limitations in the density of the source point cloud (four to five points per m 2 ). Our 671 data set might therefore not serve well for capturing some ecological relevant variation in terrain and vegetation structures at 672 scales below the 10 m x 10 m grain size. We believe that our data set is nonetheless valuable in providing ecologically relevant 673 information at the geographical extent of Denmark. 674 While some of the descriptors in the presented data set such as elevation, slope and vegetation height are quite straightforward 675 to interpret, the ecological meaning of other descriptorsfor example those related to vegetation structuremay not be as 676 obvious as they are influenced by multiple ecological and sensing methodology related factors. The amplitude, point count 677 and point proportion descriptors are amongst those measures. For example, while the (non-calibrated) amplitude in the 678 DHM/Point-cloud source data may generally relate to the reflectance properties of the surface that generated the return, the 679 incident light angle, scattering and subsequent generation of echoes may result in several different surfaces generating similar 680 amplitude signatures. Furthermore, the point counts may be influenced by a whole suite of factors, including incident light 681 angle, scattering, density of flight strips covering a given cell, as well as canopy properties -most importantly the penetration 682 ability. While standardising the point counts as proportions to the total counts may help to account for some of these factors, 683 it is likely that notable uncertainties will remain even in the proportions especially for lower layers of the canopy. Nonetheless, 684 we believe that these measures can be informative if appropriate care is taken in their interpretation. 685 Two code developments could enhance the EcoDes-DK15 processing workflow in efficiency and transferability: using gdal 686 Python bindings and switching to an open-source point cloud handler. First, for practical reasons we reverted to using gdal 687 binaries rather than the Python bindings as we encountered issues with the gdal bindings provided by the OPALS shell on our 688 computational server. Solving this issue and using the bindings instead of the binaries could reduce hard drive access time and 689 overheads from launching subprocesses and therefore potentially speed up the raster manipulations in the workflow. However, 690 as the point cloud processing takes the majority of time (we estimate 75-80%) we did not invest further resources to do so in 691 the first development round. Secondly, while our Python source code is open source and freely available, OPALS itself requires 692

Code availability 722
The source code for the processing pipeline is openly available under a simplified BSD license via GitHub: 723 https://github.com/jakobjassmann/ecodes-dk-lidar 724

Conclusions 725
Open data sets like EcoDes-DK15 will allow ecologists with limited computational resources and little expertise in handling 726 LiDAR point clouds to use large-scale ALS data for their research. We see our efforts not only as a first step for providing 727 ready-to-use descriptors of local vegetation and terrain features, but also for providing an example workflow and tools that 728 allow for the replication of the processing. We have described and documented the measures of terrain and vegetation structure 729 contained in the data set and pointed out possible applications and limitations. We are confident that EcoDes-DK15 provides 730 a meaningful collection of ecological descriptors at a 10 x 10 m resolution for the extent of a whole country and we encourage 731 the community to use our workflow and collection of codes as inspiration to process other large-scale ALS data sets in a similar 732 manner. Ultimately, we hope the publication of this data set will help facilitate the uptake of ALS-derived information by 733 ecological researchers and practitioners in Denmark and beyond. 734

Author contributions 735
All co-authors developed the data set with focus on its ecological relevance, providing input on the ecological meaning, spatial 736 scale and calculation of the descriptors. JJA developed the code with input from JEM. JJA carried out the computations. JJA 737 led the writing of the manuscript, with all co-authors contributing to the manuscript in a collaborative manner. SN provided 738 funding and supervision for this project. 739

Competing interests 740
The authors declare that they have no conflict of interest. 741

Acknowledgements 742
We would like to thank Andràs Zlinszky for his contributions to earlier versions of the data set, Charles Davison for feedback 743 regarding data use and handling, as well as Matthew Barbee and Zsófia Koma for sharing their insights on the source data 744 merger and Zsófia's script to generate summary statistics for the different versions of the DHM point clouds. Funding for this 745 work was provided by the Carlsberg Foundation (Distinguished Associate Professor Fellowships) and Aarhus University 746