Development of a standard database of reference sites for validating global burned area products

Over the past two decades, several global burned area products have been produced and released to the public. However, the accuracy assessment of such products largely depends on the availability of reliable reference data that currently do not exist on a global scale or whose production require a high level of dedication of project resources. The important lack 20 of reference data for the validation of burned area products is addressed in this paper. We provide the first publicly available Burned Area Reference Database (BARD) that was created by compiling existing reference BA datasets from different international projects. BARD contains a total of 2,661 reference files derived from Landsat and Sentinel-2 imagery. All those files have been checked for internal quality and are freely provided by the authors. To ensure database consistency, all files were transformed to a common format and were properly documented by following metadata standards. The goal of generating 25 this database was to facilitate BA algorithm developers and product testers reference information that would help to develop or validate new BA products. BARD is freely available at: https://doi.org/10.21950/BBQQU7 (Franquesa et al., 2020).

2 comparing our results to reference data, assumed to represent the actual conditions of the target variable at the satellite overpass time. In the case of global studies, it is very difficult to generate reference data for the wide variety of planetary conditions, 35 thereby complicating validation. Some of the global variables (e.g. temperature and surface radiation) can be validated from ground sensor networks, such as weather stations, buoys or Aerosol Robotic NETwork (AERONET) sensors. Other variables are more difficult to validate, as they require generating global reference data that are based on higher-resolution sensors than those used to obtain the global product. This is the case of land cover or burned area products, which require first designing a sample strategy using statistically valid protocols and then extracting from the selected sites the reference polygons to be 40 compared with the global datasets. Despite the time and effort required to derive reference datasets, accuracy assessment is a critical part of any global RS project and making these reference datasets publicly available will facilitate product comparison and lower the burden of validating future products.
Several global burned area (BA) products have been produced in the last two decades, providing an estimation of fire activity worldwide (Chuvieco et al., 2019). The first of these products was the Global Burned Area (GBA2000), based on daily 45 VEGETATION (VGT, 1 km resolution) images acquired in the year 2000 and was generated by the Joint Research Centre of the European Union (Grégoire et al., 2003). The same year, the European Space Agency developed the GLOBSCAR BA product, also at 1 km 2 , derived from daytime ERS-2 (European Remote Sensing Satellite) ATSR-2 (Along Track Scanning Radiometer) data (Simon et al., 2004). Other 1 km resolution global BA products released by European projects include the L3JRC (Tansey et al., 2008) covering the period from 2000 to 2007; GlobCarbon (Plummer et al., 2006), produced from 1998 50 to 2007; and the Copernicus GIO_GL1_BA products. These three products were derived from VGT images, although in the GlobCarbon project, ATSR images were used as well. More recently, the FireCCI (Climate Change Initiative) project (https://esa-fire-cci.org, last access: 25 March 2020), part of the European Space Agency (ESA) CCI programme, has generated three global BA products, based on Medium Resolution Imaging Spectrometer (MERIS) at 300m resolution (FireCCI41: Alonso-Canas and  and Moderate Resolution Imaging Spectroradiometer (MODIS) 250m data (FireCCI50: 55 Chuvieco et al., 2018 andFireCCI51: Lizundia-Loiola et al., 2020). NASA (National Aeronautics Space Administration) released in mid-2008 the MCD45A1 product derived from 500 m MODIS imagery (Roy et al., 2008), which has now been superseded by MCD64A1 at the same resolution but with a different BA algorithm approach (Giglio et al., 2009;. These global BA products have been validated by comparing them with reference data generated from medium resolution sensors (such as those on board the Landsat, SPOT (Satellite Pour l'Observation de la Terre), or Sentinel-2 missions). These 60 reference data were typically derived from multitemporal pairs of images to properly date the validation period.
According to the representativeness of samples used to perform product validation, the CEOS-WGCV Land Product Validation (LPV) subgroup defined four validation stages with the level of sampling effort and statistical rigor increasing at each stage (https://lpvs.gsfc.nasa.gov/, last access: 25 March 2020). Early validation exercises were subjected to a first stage validation, usually based on small samples of reference sites that were not selected using a probability sampling design, but rather by a 65 purposeful or convenience selection based on data availability or expert knowledge to ensure diverse wildfire conditions were included in the sample (Tansey et al., 2004;Roy et al., 2005). Roy and Boschetti (2009), for instance, reported validation results for the MCD45A1 product using a set of 11 Landsat scenes distributed across southern Africa. Chuvieco et al. (2008) validated a regional product for Latin America using 19 Landsat scenes and 9 China-Brazil Earth Resources Satellite (CBERS) scenes that were donated by regional space agencies when access to the Landsat archive was not yet free and open to the 70 public, thereby limiting the number of selected validation sites. The MCD64A1 Collection 5 was not formally validated, and the most recent MCD64A1 Collection 6 products were first validated using a set of 108 Landsat scenes distributed across a wide range of fire-affected ecosystems but not selected via probability sampling (Giglio et al., 2018). A recent study has provided a validation of the MCD64A1 product implementing a probability sampling design and using Landsat-8 Operational Land Imager (OLI) images, but only for a single year . Previous statistical validation of NASA and 75 FireCCI BA products were conducted by Padilla et al. (2014; using a set of 105 randomly selected Landsat scenes for a single year (2008) and by Chuvieco et al. (2018) using a multitemporal reference dataset of 12 years. Other projects covering large areas have been developed in the USA using Landsat data across six years (Vanderhoof et al., 2017) and Africa using Sentinel-2 Multispectral Instrument (MSI) images (Roteta et al., 2019) where validation sites were selected through probability sampling. In all cases, reference datasets were created based on independent interpretation of BA, controlled by visual 80 inspection. The importance of applying probability sampling to collect reference data has been highlighted by different authors as a critical feature of the sampling design protocol to achieve statistically rigorous assessment (Stehman, 2001;2009;Olofsson et al., 2014;Stehman and Foody, 2019). Thus, in contrast to such reference data collected by convenience, ease of access, or other methods that lack randomization, data collected through probability sampling makes it possible to obtain rigorous estimates of accuracy. 85 The main bottleneck for validating global BA products or global BA algorithms is the generation of reference BA datasets. To facilitate the activity of BA algorithm developers, this paper aims to present and deliver to the scientific community the Burned Area Reference Database (BARD), a set of reference BA perimeters that can be used as reference data for validation of BA products or to help the development of BA algorithms (obviously, the same files cannot be used for both training and validating an algorithm). These validation files were compiled from different international projects and years, therefore the resulting 90 database will facilitate the assessment of BA algorithms in a wide range of ground conditions. The BARD includes the following datasets of reference data: FireCCI global (2008) The paper presents the methods that were used to generate the BA reference data paying particular attention to the sampling design and reference data retrieval methods applied to the different BARD datasets. The data specifications to transform all the files to a common standard format and file structure are then presented. Finally, a detailed description of each dataset included in BARD is provided and the main dataset features are then summarized to facilitate a general overview. 4 2 Methods 100

Selection of validation sites: sampling design
High-quality reference data generation is an expensive and time-consuming task, which constrains the total number of validation sites that can be established in any validation exercise. For this reason, sampling design is critical to make the most of the resources available and ensure the highest precision of accuracy estimates given the available resources to generate reference data. Padilla et al. (2014; implemented a stratified random sampling design that allowed for global BA 105 accuracy inferences for the first time. Boschetti et al. (2016) Boschetti and Roy (2008).
The sampling design protocols to validate BA products were therefore developed considering the rarity and ephemeral nature of the BA, which is indeed a special case of land-cover change (Stehman and Foody, 2019). When selecting samples for obtaining probability inferences, the allocation of samples should follow a probability sampling design, to compute unbiased population estimates. For BA product validation, this implies selecting samples considering the spatial and temporal 115 dimension. The spatial dimension of sampling units is usually defined by the Thiessen scene areas (TSAs) constructed by Cohen et al. (2010) and Kennedy et al. (2010) specifically for use with Landsat WRS-2 frames (Worldwide Reference System, Fig. 1a). The key advantage of TSAs is that they provide non-overlapping Landsat-like frames, which allow for a convenient computation of unbiased estimators (Gallego, 2005). The temporal dimension of sample units is defined by the acquisition dates of the pre-and post-fire images. For example, in Boschetti et al. (2019), the validation period (1 year) was divided into 120 equal temporal size sampling units using the 16-day Landsat 8 acquisition interval, thus allowing for the temporal random selection of the reference images. This temporal partitioning, also makes it possible to intensify the sample in strata that comprise the fire season and where burning is more likely to occur (Stehman and Foody, 2019). However, longer period intervals (>100 days) are used to define sampling units to allow a long temporal overlap of reference data with the BA product, which helps to disentangle the spatial errors from the temporal errors of the BA product (Roteta et al., 2019;Lizundia-Loiola 125 et al., 2020).
In any case, sample units are then stratified to properly represent the variety of conditions that affect the accuracy of BA products. This stratification is usually based on (a) major Olson biomes (Olson et al., 2001) (Fig. 1b) and (b) the BA extent provided by a global BA product considered to be reliable or active fire detections, assigning each sample unit to high or low BA strata based on a threshold that can be specifically adapted to each biome stratum as in Padilla et al. (2017) or simply set 130 as the 20 th quantile of the cumulative distribution of active fire counts as in Boschetti et al. (2016;. One of the advantages of the stratified sampling design adopted for BA maps validation previously mentioned was that it allows for rigorous estimates of global BA accuracy. However, another key advantage of stratified random sampling design that should be strongly emphasized is that it makes it possible to increase the sample size of an initial global sample for specific regions or rare land-cover classes (Stehman et al., 2012). This is the case of the CONUS Landsat Burned Area (1988-2013) 135 dataset where reference sites for the CONUS extent were augmented based on the initial sample of the FireCCI global (2008) dataset.
Stratified random sampling design was applied to several datasets included in BARD: FireCCI global (2008) the only dataset of BARD that was obtained through convenience sampling rather than probability sampling.
To report BA accuracy from these stratified sample datasets, users should apply the proper estimation formulas detailed in the associated articles (see Table 2) and use the additional information as the stratum of each sampled unit and the stratum sizes of the stratified sampling, provided in the metadata files and tables of appendix A, respectively.

Reference data generation methods 145
Following the recommendations of the CEOS Calibration/Validation group, all the burn perimeters of BARD were derived from multitemporal comparison of medium resolution satellite imagery (Landsat TM (Thematic Mapper)/ETM+ (Enhanced Thematic Mapper plus)/OLI or Sentinel-2 MSI). Burned patches included in the files are only those that occurred in between the two satellite images used to generate the reference data (Fig. 2). The procedures implemented to obtain those burned patches are diverse, depending on the dataset, but all include a semi-automatic procedure (e.g. Bastarrika et al., 2011) and then 150 a visual inspection to confirm that the detected perimeters were actually burned areas. In some cases, the semiautomatic classification was enhanced with polygons manually digitized. In several cases, this visual inspection was confirmed by another interpreter to double check the quality. When parts of the scene could not be observed or interpreted because of clouds or sensor problems (i.e. Scan Line Corrector (SLC)-off problems of ETM+), either in the pre-or post-fire images, they were classified as no-data. This was done to make sure that only areas with reliable data were included in the reference files. 155 Regarding 'unburned' category of reference data, different criteria were applied to label seas and inland water bodies in the It should be noted that reference data are not just high accuracy BA products generated by well-designed algorithms using 160 medium-or high-resolution imagery. Rather, reference data following international standards should provide reliable burned area but also the unburned surface of the interpreted geographic region and the unobserved/unmapped areas within the region, as shown in Fig. 2c.

6
Like the sampling units from which reference data are derived, reference data can be defined by its spatial and temporal dimension. The spatial dimension is a function of the geographic extent interpreted to obtain the reference data, where the size 165 varies depending on the criteria adopted in each project. For example, reference data from the FireCCI global (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014) dataset were spatially defined by a frame of 30 x 20 km located at the centre of the Landsat images, whereas the entire Landsat scenes were used in the case of the CONUS Landsat Burned Area (1988-2013) dataset. The spatial extent used in the datasets included in BARD will be specified in section 2.4 where a detailed description of each dataset is provided.
The temporal dimension of the reference data represents the period defined by the acquisition date of the pre-and post-fire 170 images used to generate them. Regarding the temporal length of the reference data, the FireCCI project adopted the terms 'short unit' (SU) and 'long unit' (LU). The former refers to those reference data derived from a pair of consecutive images separated by 16 days or less (the temporal span between two Landsat acquisitions). The latter is defined by a series of consecutive SUs covering at least 100 days. LUs allow for long temporal overlaps between validation and product data, reducing or minimizing the impact of the product's temporal reporting accuracy in the accuracy estimates (Padilla et al., 2018). 175 The combined use of SUs and LUs is useful to assess such and contextualize impact (Lizundia-Loiola et al., 2020). A LU BA map consists in the combination of consecutive SU maps (Fig. 3). A pixel classified as no-data in any of the SU maps is kept as such in the LU BA map. This is to ensure that any pixel available data is observed frequently (every 16 days or less) and an eventual burn is not missed due to simply a fast recovery of the vegetation. The permanently observed pixels, were classified as burned in the LU if they were detected as burned in any SU of the time series covered by the LU. The presence of no-data 180 (e.g. due to clouds) in a single image may reduce drastically the spatial cover of available data in the resulting LU. Therefore, BA maps are generated for every single SU, but the BA map for a LU is generated by accumulating the consecutive SUs of the same TSA. The length of the LU would depend on the existing cloud-free consecutive SUs. For example, if 8 consecutive SUs, all covering the same temporal length (e.g. 16 days) are cloud free and the 9 th image has 90% of the area cloud covered, the LU would include only the first 8 SU maps, even if SU were generated for the 9 th and 10 th consecutive images. 185 As burning is detected on any given single image in between the period covered by two satellite acquisitions, all burned patches are dated based on the second reference image of a multitemporal pair. Therefore, SUs will have the same date for all the burned patches, while LU reference data will have burned patches from different dates as multiple pairs of images are used to build the LU (Fig. 3).
Among the datasets included in BARD, SUs were used in the FireCCI global (2003-2014) dataset as part of the sampling 190 design, and LUs were used for the FireCCI Africa (2016) dataset. Reference data from the rest of the FireCCI project datasets (FireCCI global (2008) and FireCCI Africa S2 (2016)) and CONUS Landsat Burned Area (1988-2013) dataset, were retrieved from a single pair of images with a variable time lapse between pre-and post-fire images. Thus, the temporal length of those reference data was determined by the availability of suitable images and the duration of the burned signal. The NOFFi Greece (2016-2018) reference data were obtained considering a time-series of Sentinel-2 images, but with variable length and non-195 consecutive time-series step.

Data specifications
Each dataset of BARD is organised in three folders with associated files including: (a) 'metadata', which contains a .csv file containing the file name of all the reference files included in the dataset, along with additional information such as the temporal length (days), the total number of images interpreted (n_images), the area (m 2 ) of each mapped category ('burned', 'unburned' 200 and 'unobserved'), the land surface and total area of each reference data file. For those datasets where a stratified random sampling design was used, the .csv file also specifies the stratum of each sampled unit and the size (tsa_area)  yyyymmdd (year, month, day). The first date corresponds to the pre-fire date, which is the date of the first image used for BA detection; the second one refers to the post-fire date, which is the date of the last image used for generating the reference fire perimeters. 215 The following attribute fields are included in the shapefiles (Table 1)     area: Area in square meters (m 2 ) calculated on the WGS84/UTM Cartesian plane.

FireCCI global (2008) 235
The FireCCI global 2008 reference dataset was created using a stratified random sampling design ( (Padilla et al., 2014;, Table A1). Two levels of spatial stratification were used to select the spatial units based on TSAs derived from the Landsat World Reference System 2 (WRS-2). Spatial units were first stratified across seven aggregated Olson biomes (Olson et al., 2001). Each biome was stratified into high and low BA extent based on the Global Fire Emissions Database (GFED) Version 3 (Giglio et al., 2009;2010). A total of 101 images from Landsat-5 TM and 109 for Landsat-7 ETM+ satellite sensors were 240 used to retrieve BA perimeters. The complete scene was used for Landsat-5 TM images, whereas only the centre of Landsat-7 ETM+ scenes were interpreted in order to avoid data SLC gaps. BA perimeters were derived using a semi-automatic algorithm developed by Bastarrika et al. (2011), where high burn severity pixels were selected to train core burned area, and adjacent lower burn severity pixels were added to the core detected patches using a region-growing algorithm.
The FireCCI global 2008 dataset includes 105 reference data files, derived from single pair of images, for the year 2008. The 245 temporal length of reference data varies between 8 and 144 days: 79% of image pairs were separated by 32 days or less, 16% between 32 and 100 days, and 5% by more than 100 days with a maximum time gap between the pre-and post-fire image of 144 days. The total area of reference data is 1.76•10 6 km 2 , of which 1.35% corresponds to burned category, 88.35% to unburned and 10.30% to unobserved category. The location and temporal length of the reference data is shown in Fig. 4. This reference dataset is compliant with CEOS-LPVS Stage 3. 250

FireCCI global (2003-2014)
The FireCCI global (2003-2014) dataset covers a period of 12 years, from 2003 to 2014 (Padilla et al., 2018), and was generated in the framework of the FireCCI project with the collaboration of the Copernicus Global Land Service (CGLS). The reference data were derived from consecutive Landsat images separated by 8-16 days for each selected TSA and year. A total of 585 images from Landsat-5 TM, 1564 from Landsat-7 ETM+ and 209 from Landsat-8 OLI satellite sensors were used to retrieve 255 BA perimeters. The sampling units were selected following a stratified random sampling design (Table A2). The total population of sample units were defined spatially by TSAs and temporally by the dates of Landsat images available, filtering out those with a cloud cover greater than 30%. For each calendar year, the sample units were stratified by Olson biomes (Olson et al., 2001) and BA based on MCD64A1 (Giglio et al., 2009). The threshold used to assign the high/low BA strata was defined where ℎ is the sample size to be selected in stratum h, ℎ is the stratum size and ̅̅̅̅ ℎ the BA mean in stratum h.
Finally, a spatial subset window of 30 x 20 km located at the centre of the images was applied for interpretation and BA 265 reference data retrieval. The reference perimeters were extracted from a dedicated Random Forest algorithm, trained for each sampling site, and output maps were visually inspected by two interpreters (Padilla et al., 2018).
The FireCCI global (2003-2014) dataset includes 1200 reference data files from 722 different TSAs and 12 years, from 2003 to 2014. The temporal length of reference data varies between 8 and 16 days. The total area of reference data is 0.72•10 6 km 2 , of which 3.85% corresponds to burned category, 71.85% to unburned, and 24.29% to unobserved category. The location and 270 total number of reference data in each TSA are shown in Fig. 5. This reference dataset is compliant with CEOS-LPVS Stage 3.

FireCCI Africa (2016)
The FireCCI Africa reference dataset consists of LU BA maps and was generated for the year 2016 from Landsat imagery (Padilla et al., 2018). It was also generated in the framework of the FireCCI project with the collaboration of the CGLS. The 275 sampling was designed with long units and it was similar to that for the FireCCI global (2003-2014) dataset, as mentioned in the previous section (Table A3). The only difference was the sample size, 50 units instead of 100 units per year. Note that each unit here is much larger, as it consists of multiple image pairs. Two reference perimeter datasets are released: (a) Reference data at SU level, 1052 files with 8-16 day BA maps; and (b) Reference data at LU level, 50 files. The temporal length covered at each LU varies from 24 to 256 days (Fig. 6b): 18% of the LUs cover a temporal length below 50 days, 34% between 50 and 280 100 days, and 48% are above 100 days. As mentioned in Section 2.2., LUs were defined to be at least 100 days long, although the presence of clouds reduced the actual temporal periods with available data. The total area of LU reference data is 0.023•10 6 km 2 , of which 15.72% corresponds to burned category, 49.61% to unburned, and 34.67% to unobserved category. The location, number of image pairs, and temporal length of the LUs reference data are shown in Fig. 6. This reference dataset is compliant with CEOS-LPVS Stage 3. 285

FireCCI Africa S2 (2016)
The FireCCI Africa S2 BA reference dataset was created to perform an initial validation assessment of the Small Fire Database Fire_cci v1.1 product (FireCCISFD11) produced for the year 2016 for the whole Sub-Saharan Africa (Roteta et al., 2019).
Reference data were generated from the comparison of two Sentinel-2 MSI images at 20 m resolution per reference site.
Systematic sampling was used to select 52 validation sites based on Sentinel-2 tiles (110 x 110 km) over Sub-Saharan Africa. 290 BA was mapped with the BAMS methodology, which is a semi-automated algorithm (Bastarrika et al., 2014). In short, training polygons for the burned category were defined in each tile, and burned seeds were detected. Then, burned pixels were grown out from these seeds until all pixels for each burned patches were detected. The results were visually analysed to determine the accuracy of the classification and new training polygons were defined if needed. This was done sequentially until all burned areas were mapped and no commission or omission errors were visually detected. Finally, if there was noise created by 295 unmasked clouds and cloud shadows, it was edited and removed manually.
The temporal length of the reference data varies between 10 and 120 days: 86% of the pairs of images were separated by less than 50 days and 14% by more than 50 days with a maximum time lapse of 120 days. The total area of reference data is 0.63•10 6 km 2 , of which 8.87% corresponds to burned category, 72.42% to unburned, and 18.71% to unobserved category. The location and temporal length of the reference data are shown in Fig. 7. This reference dataset is compliant with CEOS-LPVS 300 Stage 1.

CONUS Landsat Burned Area (1988-2013)
CONUS Landsat Burned Area (1988-2013) reference dataset (Vanderhoof et al., 2017; extends across the contiguous United States (CONUS) and was generated to validate the Landsat Burned Area product (Hawbaker et al., 2017;. The sampling design was adapted from the methods used by the ESACCI FireCCI project. Existing FireCCI validation TSAs (n=9) 305 within CONUS were augmented with an additional 19 TSAs for a total of 28 TSAs. The TSAs were stratified across the major Olson biomes (Olson et al., 2001) including (1) temperate forest, (2) Mediterranean forest, (3) temperate grassland and savannah, (4) tropical and subtropical grasslands and savannah, and (5) xeric/desert shrub. TSAs selected within each biome were meant to represent high and low burned areas as specified by the Global Fire Emissions Database (GFED) version 3 (Table A4). Systematic sampling was applied to select 6 validation years spaced out in 5-year increments (2013, 2008, 2003, 310 1998, 1993 and 1988).
A total of 269 images from Landsat-5 TM, 10 from Landsat-7 ETM+, and 56 from Landsat-8 OLI were used to derive the BA extent. Landsat reference images were limited to those with a geometric Root Mean Square Error (RMSE) < 10 m, <20% cloud cover, and available as a L1T Surface Reflectance product. Time lapse between images was not limited to 16 days and only two images (pre-and post-fire) were used to retrieve BA reference data for each validation site and year. The pre-and post-315 fire image pairs did not specifically represent a probability sample within a year but were designed to target changes incurred over the peak fire season. Peak fire season was determined using the distribution of total burned area by month as derived from the MCD45 burned area product (2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015). The FMask from the Landsat surface reflectance product was applied to mask clouds, cloud shadows, snow and open water from each image used (Zhu and Woodcock, 2014). For Landsat-7 ETM+ images, SLC off pixels were masked. The low-, medium-and high-intensity development classes (i.e. urban areas) were masked using 320 the National Land Cover Database (NLCD, https://www.mrlc.gov/national-land-cover-database-nlcd-2016) (Homer et al., 2015) to reduce spectral confusion between burned areas and impervious surfaces. Similarly, agricultural burns were not used to train the reference data burn classification, therefore the accuracy of the reference dataset in agricultural areas is unknown.
If this is of concern to users, then users can mask the 'cultivated crops' land cover type from the reference data using the

NLCD. 325
Burned area maps were generated using BAMS (Bastarrika et al., 2014). The Normalized Burn Ratio (NBR), Mid-infrared Burned Index (MIRBI), Global Environmental Monitoring Index (GEMI) and Normalized Difference Vegetation Index (NDVI) were calculated for the pre-and post-fire images and utilized in a supervised classification. The algorithm was trained on manually selected polygons containing (1) clearly burned pixels and (2) spectrally similar but less distinct burned pixels.
The algorithm applied a region-growing function between the two types of training polygons, while cut-off values for each 330 variable were extracted from the training polygons. Each classified burned area was then manually edited. When available, the analysts utilized ancillary datasets (e.g. Monitoring Trends in Burn Severity (MTBS, Eidenshink et al., 2007), MODIS active fire points (MOD14 collection 5, Giglio et al., 2009), MODIS burned area (MCD45A1 collection 5, Roy et al., 2008), and aerial imagery) to improve the confidence in their selection of training pixels and manual edits. To maximize the accuracy of the reference dataset, each image pair was classified into burned area extent and visually evaluated and edited independently 335 by three different analysts. A pixel was then classified as burned if it was identified as burned by two of the three analysts.
Additional processing details can be found in Vanderhoof et al. (2017).
The CONUS Landsat Burned Area (1988-2013) dataset includes 168 reference data files from 28 Landsat path/rows and six years (1988,1993,1998,2003,2008,2013). The temporal length of reference data varies between 16 and 288 days: 37% of pairs of images were separated by less than 50 days, 35% between 50 and 100 days, and 28% by more than 100 days with a 340 maximum time lapse between the pre-and post-fire image of 288 days. The total area of reference data is 5.23•10 6 km 2 , of which 0.12% corresponds to burned category, 82.33% to unburned, and 17.55% to unobserved category. Location of reference sites based on TSAs is shown in Fig. 8. With the publication of Hawbaker et al. (2020), this reference dataset is compliant with CEOS-LPVS Stage 4.

NOFFi Greece (2016-2018) 345
The reference data were obtained using the perimeters produced by the National Observatory of Forest Fires (NOFFi) (http://epadap.web.auth.gr, last access: 25 March 2020) and, specifically, its Object-based Burned Area Mapping (OBAM) service, implemented by the Laboratory of Forest Management and Remote Sensing (FMRS) of the Aristotle University of Thessaloniki. NOFFi-OBAM is an on-demand service, meaning that it is activated after large wildfire events and under explicit requests by the local forest offices. It relies solely on Sentinel-2 imagery and is employed only for fires within Greece. The 350 NOFFi-OBAM algorithm is designed to map fire perimeters and follows a supervised learning approach using a post-fire Sentinel-2 (Level-1C) image, although a pre-fire image is also used for photo-interpretation purposes. The methodology applied to retrieve the fire perimeters is fully described in Tompoulidou et al. (2016). Non-probability sampling design was applied for this dataset; reference sites were selected by convenience based on images previously processed in the NOFFi-OBAM service. 355 The NOFFi-OBAM fire perimeters were used as the basis for creating the reference data for the NOFFi Greece reference dataset considering the burned area mapping years 2016, 2017 and 2018. For each Sentinel-2 tile ID (e.g. T34SDH) in which fire perimeters were available, the whole time-series of images were visually checked and the date range for the reference file creation was defined from the first pre-fire image to the last post-fire image. Small fires within the specific time series that were not mapped from the NOFFi-OBAM service were explicitly digitized. Since NOFFi-OBAM only serves Greece, areas 360 outside Greece's official land boundaries (e.g. seas and land areas of neighboring countries) were masked and classified as unobserved surfaces (category = 2). Some burned scars in overlapping border tiles were mapped by using images from those neighboring tiles only if the post-fire image used for the mapping was inside the time span of the former tile ID. For example, the file 'NOFFi_RD_T34SGH_20160710_20160730.shp', includes polygons with preImg/postImg from T35SCK. This can be identified from the preImg, postImg, and tile columns of the file. Clouds and cloud shadows were manually digitized and 365 masked (category = 2), considering the last postImg. Although a non-probability sampling design was applied for this dataset, the NOFFi-OBAM service has been activated for all wildfires greater than 100 ha during the period 2016-2018 and, in many cases, for smaller (or even much smaller) wildfires. Therefore, the dataset contains a representative set of Sentinel-2 tiles that are frequently affected by wildfires in Greece, at least for the given time-period.
The NOFFi Greece dataset includes 34 reference data files from 25 different Sentinel-2 tiles. The temporal length of reference 370 data varies between 5 and 132 days. The total area of reference data is 0.41•10 6 km 2 , of which 0.10% corresponds to burned category, 25.83% to unburned, and 74.08% to unobserved category. As shown in Fig. 9, most of the surface of the tiles from this dataset corresponds to sea surface that was labelled as 'no-data' (section 2.2.), this is the reason the unobserved category is so high compared to the rest of the datasets. The location and temporal length of the reference data as well as the number of images used in each reference site are shown Fig. 9. This reference dataset is compliant with CEOS-LPVS Stage 1. 375

Data availability
The BARD compiled in this effort is freely available on the e-cienciaDatos repository (https://doi.org/10.21950/BBQQU7 (Franquesa et al., 2020)). All burned area reference data files have been visually checked, reprojected and reformatted to provide a uniform set of attributes and metadata descriptions to maximize the ease with which these reference files can be used to evaluate global burned area products. A summary of the data included in each dataset is described in Table 2 (1988-2013), and NOFFi Greece (2016. Plans are underway to expand the Burned Area Reference Database with new reference files that the FireCCI project produces, and we encourage future contributions from the scientific community.

Conclusions 385
BARD is the first publicly available database that compiles and standardizes previously generated validation reference data.
Reference datasets included in this database were produced throughout the life of the FireCCI project since 2010, and other initiatives as Landsat Level-3 Science Products and NOFFi projects have joined and contributed to this effort. BARD gathers 13 and compiles a total of 2661 standardized shapefiles representing reference burned area data generated from approximately 4500 Landsat and Sentinel-2 images and 8 million square kilometres of interpreted land surface. Reference data were produced 390 following the recommendations of the CEOS Calibration/Validation group and visually inspected by two or more experienced interpreters to ensure the accuracy of the data. As BARD is a compilation of datasets that were produced in different projects and years in which different methods were applied (e.g. different sampling methods, sensors, years or region extent), it is highly recommended that the user clearly understands the characteristics of the dataset or datasets that best suits their needs.
BA reference database and future updates remedy the lack of an extensive global and regional, multitemporal validation dataset 395  and, certainly, can serve as a valuable source for validation of existing products and developing new BA algorithms, particularly those requiring large amounts of training data.

Competing interests
The authors declare that they have no conflict of interest.   Time period between both images is 8 days: from 12 June to 20 June 2014. Only the land surface that burns between the two dates is classified as burned, while burned scars in the pre-fire image are assigned to the unburned category. Unobserved pixels on either pre-or post-fire image due to the presence of clouds, cloud-shadows, SLC-gaps or smoke plumes are classified as no-data.

565
Figure 3: Schematic process of long unit reference data generation. Consecutive image pairs are selected from the multitemporal image series at same location (left: Landsat-8 RGB (7,5,4) images time series) to derive the correspondent short unit reference data files (e.g. Image t0 and t1 to obtain the reference data t0-t1). From the union of the different short units we generate the long unit reference data (right). The long unit t0-t3 includes all the burned scars that occurred between the first image (t0) and the last image interpreted (t3), burned scars from the first image (t0) are not included or mapped. Unobserved areas in any of the images are labeled 570 as no-data in the final long unit reference data. Colours (orange-t1, red-t2, brown-t3) represent the dates in which the burned area patches were observed.      (1988,1993,1998,2003,2008,2013 1988,1993,1998,2003,2008,2013 United States  Table 3: Summary of the total area (km 2 ) of the 3 mapped categories (burned, unburned and no-data) and percentage of each category respect the total area mapped for each dataset. Additionaly, the total land surface and percentage respect the total area 615 interpreted is provided. The region extent and the total number of reference files included in each dataset is also indicated.