Eddy covariance flux towers measure the exchange of water, energy,
and carbon fluxes between the land and atmosphere. They have become
invaluable for theory development and evaluating land models. However, flux
tower data as measured (even after site post-processing) are not directly
suitable for land surface modelling due to data gaps in model forcing
variables, inappropriate gap-filling, formatting, and varying data quality.
Here we present a quality-control and data-formatting pipeline for tower
data from FLUXNET2015, La Thuile, and OzFlux syntheses and the resultant
170-site globally distributed flux tower dataset specifically designed for
use in land modelling. The dataset underpins the second phase of the Protocol for the Analysis of Land Surface
Models (PALS) Land Surface Model Benchmarking Evaluation Project (PLUMBER), an international model
intercomparison project encompassing
The global network of flux towers now encompasses
Several global multi-site collections such as FLUXNET2015 (Pastorello et al., 2020) have been released that provide valuable opportunities for evaluating LSMs across multiple climates and biomes. Whilst these collections overcome many limitations of raw flux tower data, the data are not provided in a format directly usable in land surface modelling. The datasets require varying levels of gap-filling, unit conversions, and data formatting to be applicable for modelling exercises and are missing key metadata, such as measurement height and vegetation characteristics. Most importantly, not all flux tower data releases provide temporally continuous meteorological observations, which are essential for forcing LSMs. FLUXNET2015 overcomes this key limitation by providing fully gap-filled meteorological observations but includes long periods of gap-filling at some sites, resulting in missing diurnal and/or seasonal cycles. Extended periods of synthesized meteorological variables are problematic in model applications not only because they bias model estimates at concurrent time steps but also because they bias future model predictions due to model state memory, such as soil moisture. As such, the data quality requirements for land modelling present a challenge that is not yet met by standard flux tower data releases.
Here we present a collection of 170 globally distributed flux tower sites
collated from three data releases (FLUXNET2015, La Thuile, and OzFlux) that
results from applying land-surface-model-focused quality control and
ancillary data collation. By combining multiple data sources, we were able
to maximize the number of available sites to enable model evaluation against
a wider range of climate and vegetation conditions. The dataset covers the
period 1992–2018 (although the majority of site records end in 2014), with
individual sites spanning from 1 to 21 years, with a total of 1040 site
years. The dataset provides quality-controlled, fully gap-filled
meteorological variables for forcing LSMs, together with a comprehensive set
of flux variables for model evaluation. The data are provided in the
Assistance for Land-surface Modelling Activities (ALMA;
We collated data for 223 flux towers from three flux tower data collections.
We first obtained all available Australian sites from the OzFlux network
(Isaac et al., 2017). We then obtained all
Tier 1 (open data policy) globally distributed sites from FLUXNET2015
(November 2016 release;
Pastorello et
al., 2020), excluding sites available in OzFlux. For all FLUXNET2015 sites,
data from the “FULLSET” release was used. Finally, additional sites that
were not present in OzFlux or FLUXNET2015 were taken from the La Thuile free
fair-use release (
We undertook multiple processing steps to derive the final, quality-controlled dataset. The data were first pre-processed with the FluxnetLSM R package (Ukkola et al., 2017) to convert the files to ALMA-formatted NetCDF files with consistent units and variable conventions. The data were subsequently screened using expert judgement to only retain periods of good-quality meteorological data. Additional corrections were then made to meteorological data to remove outliers and non-physical values and gap-fill any remaining missing values. The flux variables were not screened, but additional latent and sensible heat flux estimates were calculated to correct for energy balance closure. Finally, we derived two independent leaf area index time series for each site from remotely sensed data to account for uncertainties in satellite-derived LAI. A flowchart of the processing pipeline is shown in Fig. 1, with each step described in detail below.
A flowchart describing the data processing pipeline. The dark boxes show the main data processing steps, with the lighter boxes detailing the actions taken within each main step.
The three datasets come in various formats, different units, and variable naming conventions. We used the FluxnetLSM R package (Ukkola et al., 2017), which has been designed to translate flux tower data for use in land surface modelling. The package was used to process the data into ALMA-formatted CF-compliant NetCDF files with consistent variable names and units to be readily usable in land surface modelling (see Table 1 for ALMA conventions and variables included in the final dataset). In addition, FluxnetLSM was used to further gap-fill meteorological and flux variables and to include additional site metadata, such as elevation, reference and vegetation canopy heights, and vegetation type (following the International Geosphere–Biosphere Programme (IGBP) classification) in the NetCDF files. While some of the information could be obtained from FLUXNET or regional networks, we supplemented site metadata available in FluxnetLSM by extracting information from publications and site principal investigators. These metadata were collected to inform modelling choices and are included in the final NetCDF files. FluxnetLSM is fully reproducible and provides a documented framework to replace ad hoc processing methods used in many previous flux tower collections for LSMs. The version of FluxnetLSM used for processing is documented in the NetCDF file metadata.
Variables provided in the final dataset (please note not all sites provide
all flux variables). Variable naming conventions follow the ALMA format
where available. FLUXNET variable derivation details can be found at
FluxnetLSM was run separately for each parent dataset. OzFlux was first
pre-processed to remove incomplete years as land surface models require
whole years of data for spinning up soil water and temperature states. To
achieve this, the data were first gap-filled to complete days, and
incomplete years were then removed using the FluxnetLSM function
“
FluxnetLSM can be used to screen the data for missing and gap-filled time
steps, but this option was not used, instead setting the allowed level of
missing and gap-filled data to 100 % for all datasets and variables to
allow subsequent manual visual data screening (Sect. 2.2.2). However, the
gap-filling methods for meteorological variables were set differently for
each dataset. FLUXNET2015 provides continuous, downscaled ERA-Interim
estimates for all meteorological variables; these were used to gap-fill all
missing time steps in the meteorological variables (setting
met_gap-fill to “
Flux variables were gap-filled using statistical methods for all datasets. As per meteorological variables, short gaps of up to 4 h were gap-filled using linear interpolation. Longer gaps (up to 30 d for OzFlux and FLUXNET2015 and 365 d for La Thuile) were gap-filled using a linear regression of each flux variable against incoming shortwave radiation, air temperature, and humidity (relative humidity or vapour pressure deficit). This approach was demonstrated to outperform a range of LSMs in a broad range of metrics in out-of-sample tests (see Abramowitz, 2012; Best et al., 2015). In the absence of air temperature or humidity data, the linear regression was constructed against shortwave radiation only. A separate linear model was created for daytime and nighttime data. Further details of all gap-filling methods can be found in Ukkola et al. (2017).
We screened the original dataset of 223 sites to only retain sites and time
periods with good-quality meteorological forcing data. This was done to
ensure models were forced with data that were largely observed to avoid
biasing the model flux estimates. We used expert judgement to manually
screen sites instead of an automated process to be able to compromise
between data quality and time series length. During screening, we
prioritized five key meteorological variables in site selection that have
the largest influence on LSM simulations: incoming shortwave radiation
(SW
Figure 2 presents examples of how the selection criteria were applied at
three sites. AU-Lit shows a site where no adjustments to the time period
were required. All key meteorological variables are largely observed, with
only 3.3 %–5.1 % of the 2-year time series gap-filled. As such, the full
time series was selected for this site. BE-Bra shows an example where a
subset of the years were excluded from the final dataset due to a heavily
gap-filled year (2003) in the middle of the time series. During 2003, four
key variables (SW
Examples of meteorological data pre-screening plots for three sites (AU-Lit, BE-Bra, and US-Tw2). For each site different processing approaches were used and sections of these data discarded.
After selecting the final sites, meteorological variables were further
corrected for anomalous values, step changes, and missing data. These
corrections mainly applied to CO
For atmospheric CO
For OzFlux sites, unphysical values existed in the dataset that were
corrected. These included negative Precip, SW
Latent (
The EBC-corrected fluxes were obtained by multiplying
The fluxes were then corrected using a two-step method. First, for each time
step, a moving window of
We obtained two independent remotely sensed leaf area index (LAI) time
series for each site input to account for large uncertainties in
satellite-derived LAI estimates (Zhu et al., 2016). The
LAI time series can be used to force LSMs that do not include a predictive
carbon cycle and require prescribed LAI as an input. The standardized LAI
time series are also useful for reducing the degrees of freedom in evaluation
studies by allowing the models to be driven by the same LAI estimates and
allow the minimization of LAI-driven model errors at sites where observed
and modelled LAI converge strongly. The LAI data were derived from Moderate
Resolution Imaging Spectroradiometer (MODIS) and Copernicus Global Land
Service products as these products provide long-term records at high (
We used the MODIS product MCD15A2H, which is derived from a combination of
the Terra and Aqua sensors at 500 m spatial resolution and 8-daily temporal
resolution, starting in January 2000. The LAI data and associated standard
deviation and QC flags were obtained using the R package
We used the Copernicus Global Land Service LAI v.2.0.2., which provides LAI
estimates at 1 km spatial resolution and 10-daily temporal resolution for
the period 1999–2017. The estimates have been derived from SPOT-VGT and
PROBA-V sensors (Smets et al., 2019). The 10-daily data were
first averaged to monthly by taking the maximum of the three 10-daily values
for each month following the maximum composite procedure to remove low
values, e.g. due to cloud contamination. The data were then smoothed
spatially by averaging each pixel with its surrounding pixels (with each
pixel representing the mean of nine pixels). The monthly values were then
extracted for each site using the pixel containing the site. If the value
for the pixel containing the site was missing, the value from the nearest
non-missing pixel was used. To remove non-physical short-term variability,
the monthly site time series was then smoothed using a cubic smoothing
spline. A monthly climatology was then calculated and an anomaly time series
calculated by removing the climatology from the monthly LAI time series. The
anomaly time series was smoothed by taking a rolling mean over a window of
Both Copernicus and MODIS LAI were provided for each site, but we selected one as a preferred LAI time series for each site to use as the default for use with LSMs that rely on prescribed LAI. Overall, we selected MODIS as the default time series due to its higher spatial resolution, but where MODIS was deemed unrealistic for the site due to its magnitude, seasonal cycle, or non-physical short-term variations, using site data where available, Copernicus was selected instead. Table S1 summarizes the selected LAI time series for each site. The preferred LAI variable was called “LAI” in the final NetCDF files and the alternative time series “LAI_alternative”.
Selected and excluded sites.
The final dataset includes 170 globally distributed sites shown in Fig. 3a. The majority of the sites are located in North America, Europe, and
Australia, with 3 sites located in South America, 4 in Africa, and 11 in
Asia. The excluded sites are largely located in data-rich regions and as
such did not significantly change the global distribution of sites. The
dataset covers the periods 1992–2018, with a total of 1041 site years.
Individual site records span 1 to 21 years, with a median record length of
4.5 years (Fig. 3b). A total of 39 sites cover
The sites cover a wide range of biomes, ranging from grasslands and savannas
to forest ecosystems (Fig. 3c). The majority of sites are located in
grassland (40), forested (89), and cropland (17) ecosystems. A total of 22 sites are
located in savanna and shrubland ecosystems and 10 sites in wetlands. The
sites also cover a wide range of climates, with Fig. 3d showing the sites
within the global range of mean annual precipitation (MAP) and mean annual
temperature (MAT) from the Climatic Research Unit (CRU) TS 4.02 dataset
(Harris et al.,
2014). The sites capture the global climatic range well, but only a limited
number of sites were available in wet tropical environments with high MAP
and MAT and very cold environments (MAT
Excluded years from selected sites.
For the selected sites, the original time series was reduced at multiple sites to exclude periods of poor-quality meteorological data. The number of years excluded at each site is shown in Fig. 4. Regionally, the average number of years excluded was similar over North America (mean: 1.8; median: 1) and Europe (1.9, 1), whereas fewer years were removed over Australia (0.7, 0) (see sub-panels in Fig. 3a for region definitions). The number of excluded years was also similar across the FLUXNET2015 (mean: 2.0; median: 1) and La Thuile site (1.3, 1) datasets but lower for OzFlux (as per Australia). Overall, there were no systematic spatial variations in the number of years excluded.
For the selected sites, our data screening reduced the mean record length by
1.7 years (median: 1), ranging from 0 to 12 years for individual sites
(Fig. 4b). A total of 283 site years were removed. The majority of sites
(139 out of 170) had 0–2 years removed, while only 11 sites had
Flux tower observations do not commonly close the energy balance, with the
sum of latent and sensible heat fluxes underestimated relative to available
energy (Leuning et al.,
2012; Wilson et al., 2002). This problem is particularly common at sites
with heterogeneous land cover
(Stoy et al., 2013) but is also
driven by other factors such as unaccounted energy storage and mesoscale
circulation impacts
(Panin and
Bernhofer, 2008; Leuning et al., 2012). As LSMs balance all energy fluxes,
latent and sensible heat fluxes were corrected for energy balance closure to
aid model evaluation. In total, corrected fluxes are available for 143 sites,
which reported all required variables to perform the correction (
At the corrected sites, the instantaneous EBC (i.e. the ratio
(
The corrected variables should provide a more robust basis for evaluating
model biases but rely on the assumption that the measured Bowen ratio is
correct. Another limitation of the corrected fluxes is a larger proportion
of missing data as the corrected fluxes are only provided for time steps for
which the correction could be performed using our method detailed in Sect. 3.2.4. As such, 9.2 % of the corrected
The processing codes are available at
The final dataset is available at
We have presented a quality-controlled flux tower dataset for 170 sites for use in land surface modelling. Whilst the dataset was developed with land surface modelling in mind, it is also suitable for other applications requiring a large collection of sites with good-quality meteorological data. In our site selection, we prioritized long continuous periods of high-quality meteorological observations to derive a consistent dataset across individual sites. In doing so, shorter good-quality periods were discarded for some sites (e.g. Be-Bra in Fig. 2); future work might revisit these choices to retain additional data periods. FluxnetLSM provides one possible reproducible tool for automated data screening to achieve this for the FLUXNET2015, La Thuile, and OzFlux releases.
The meteorological data were screened and fully gap-filled using multiple
criteria. This screening should allow model simulations to be produced that
are less strongly biased by high levels of gap-filling and other data
quality issues that affect the original data collections. We did not quality
control the flux variables used for model evaluation. This was to enable
model evaluation at multiple timescales, ranging from sub-daily to
interannual. This also allows models to be evaluated against individual
weather and climate events, such as heatwaves and drought. The lack of
screening leads to a much higher proportion of gap-filled data in the flux
variables, which should be taken into account when selecting sites for
individual applications. For example, 31 % of all the
Model evaluation, particularly at shorter timescales, should thus be
avoided against long periods of gap-filled data. Depending on the
gap-filling methods, these periods often reflect climatological conditions
at the site and do not represent diurnal and seasonal variations well. This
can be particularly problematic at sites with high seasonal or interannual
variability in the variables of interest. Longer (daily to monthly scale)
data gaps in flux variables were gap-filled using the regression method
based on SW
The dataset additionally provides two alternative LAI time series for each site. These can be used as inputs to those LSMs that require LAI as an input. Alternatively, they can be used to evaluate simulated LAI in those models that predict it or to verify whether model biases arise from predictive LAI feedbacks. However, it should be noted that the remotely sensed LAI estimates are uncertain at site scales, with large differences between Copernicus and MODIS LAI at many sites. This is because of both the difficulties inherent in estimating LAI from satellites (methodological) and the fact that the satellite data may be drawn from a different footprint from the one that influences the site-scale-measured fluxes (De Kauwe et al., 2011). LAI is a key model property and has a strong influence on simulated fluxes. As such, more accurate LAI estimates would be highly valuable for constraining models. Particularly, where site-level LAI is measured, the inclusion of these data in future flux tower collections would allow large-scale remote sensing LAI estimates used to drive models or evaluate model-simulated LAI to be better constrained. Additionally, the inclusion of detailed site properties in future collections would strongly benefit model evaluation. This includes information on vegetation composition and crop cycles, disturbance events such as fire, soil properties, and irrigation. Furthermore, models ideally require parameters such as reference height and canopy height to reduce model–observation mismatches arising from model inputs. Key metadata were collected from multiple sources for this data collection, but the inclusion of site characteristics in future data releases would allow for more direct access to these metadata.
Finally, whilst our dataset includes a large number of globally distributed
flux tower sites, the flux tower network includes
The supplement related to this article is available online at:
AMU, GA, and MGDK designed the dataset and contributed to the quality control process. AMU developed the processing codes with input from GA and created the final dataset. AMU prepared the manuscript with contributions from GA and MGDK.
The contact author has declared that neither they nor their co-authors have any competing interests.
Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anna M. Ukkola, Martin G. De Kauwe, and Gab Abramowitz acknowledge support from the Australian Research Council (ARC) Centre of Excellence for Climate Extremes (CE170100023). Martin G. De Kauwe acknowledges support from the ARC Discovery Grant (DP190101823) and the NSW Research Attraction and Acceleration Program (RAAP). Anna M. Ukkola is supported by the ARC Discovery Early Career Researcher Award (DE200100086). We thank the National Computational Infrastructure at the Australian National University, an initiative of the Australian Government, for hosting the final dataset. This work used eddy covariance data acquired and shared by the FLUXNET community, including these networks: AmeriFlux, AfriFlux, AsiaFlux, CarboAfrica, CarboEuropeIP, CarboItaly, CarboMont, ChinaFlux, Fluxnet-Canada, GreenGrass, ICOS, KoFlux, LBA, NECC, OzFlux-TERN, TCOS-Siberia, and USCCC. The ERA-Interim reanalysis data are provided by ECMWF and processed by LSCE. The FLUXNET eddy covariance data processing and harmonization were carried out by the European Fluxes Database Cluster, the AmeriFlux Management Project, and the Fluxdata project of FLUXNET with the support of the CDIAC and ICOS Ecosystem Thematic Center and the OzFlux, ChinaFlux, and AsiaFlux offices.
This research has been supported by the Australian Research Council (grant nos. CE170100023, DP190101823, and DE200100086).
This paper was edited by Jens Klump and reviewed by David Lawrence and one anonymous referee.