The development and validation of hydroecological
land-surface models to simulate agricultural areas require extensive data
on weather, soil properties, agricultural management, and vegetation states
and fluxes. However, these comprehensive data are rarely available since
measurement, quality control, documentation, and compilation of the different
data types are costly in terms of time and money. Here, we present a
comprehensive dataset, which was collected at four agricultural sites within
the Rur catchment in western Germany in the framework of the Transregional
Collaborative Research Centre 32 (TR32) “Patterns in
Soil–Vegetation–Atmosphere Systems: Monitoring, Modeling and Data
Assimilation”. Vegetation-related data comprise fresh and dry
biomass (green and brown, predominantly per organ), plant height, green and
brown leaf area index, phenological development state, nitrogen and carbon
content (overall
System states and processes at the land surface are of major interest in the context of climate change and hydrological and biogeochemical research. In order to understand the processes in their spatial context and to provide information for larger areas, remote sensing and simulations are heavily applied methods. In this context, it is crucial to understand the fluxes mediated by the vegetation at the land surface. Dependencies of processes on vegetation states and properties and on environmental conditions are often investigated using models, while their spatial variability is inferred using remote-sensing techniques. In this context, well-documented and quality-controlled comprehensive field measurements of vegetation-related variables are essential for research tasks like model development, calibration, parameterization, and validation or as ground truth for remote-sensing products. These variables include biomass per organ differentiated between living (green) and senescent or diseased (brown) material, leaf area index (LAI), and the phenological state of the vegetation. For a simulation, additional information on site conditions such as vegetation composition, soil texture, weather, and, in the case of agroecosystem models, agricultural management is required (Kersebaum et al., 2015). However, there is a scarcity of such datasets (Jones et al., 2017). This is of special relevance since especially the crops grown and their properties differ between regions due to different soils and climate. Thus, detailed data on the named variables are required for different agricultural regions. With the publication of the data described in this article, we contribute a new coherent dataset on agroecosystems that includes all of the mentioned variables. The data were collected on conventionally managed fields cultivated by ordinary farmers working at the sites for many years. Thus, they represent conditions and usual practices representative of the intensively used agricultural region to the west of Cologne, in Germany. The dataset comprises data from four sites. It consists of almost 1500 records of vegetation parameters and more than 200 000 entries of weather data complemented by 15 flux datasets (eddy covariance), management information for 27 management periods, and soil information for all four sites. In contrast to the ancillary data often available with flux data from the Fluxnet or Ameriflux databases, vegetation and soil data in this dataset are also available for other fields in the region, enabling extrapolation of field-scale results to the region. Since collecting field data is very time consuming and expensive, there are not many datasets of this size.
The data were collected in the Rur catchment, located at the
Belgian–German–Dutch border, within the framework of the Transregional
Collaborative Research Centre 32 (TR32; Vereecken et al., 2010; Simmer et
al., 2015) “Patterns in Soil–Vegetation–Atmosphere Systems”, funded by the
German national science foundation (Deutsche Forschungsgemeinschaft, DFG).
TR32 ran from 2007 until 2018. The project's main focus was on the
combination of monitoring, modeling, and data assimilation to assess the
role of patterns in soil–vegetation–atmosphere systems across scales. The
monitoring efforts of TR32 were accompanied by the long-term research
program TERENO (Terrestrial Environmental Observatories) of the Helmholtz
Association (Bogena, 2016), which made additional instrumentation available
for TR32. The data presented in this paper are highly valuable for many
applications, such as those outlined in the publication list of TR32
(
Here, we describe the observation sites and the structure of the dataset and provide information on the observation and measurement methods. Furthermore, we illustrate the quality assurance procedures. With the provision of this dataset, we want to document our measurement and quality control strategy and provide the scientific community with a comprehensive dataset for further applications.
Left: locations of the observation sites in the Rur catchment in
Germany. Right: locations of the fields at the observation sites with
two-digit field IDs. At the Selhausen site (3), field 12 is a part of field 11. In the aerial photo of the Merken site (2), a part of field 01 is within the
area of an open-pit mine. At the time of field measurements, the mine was
about 2.5 km away from the field. Map data: GADM (
All observation sites are located within the Rur catchment located at the
Belgian–German–Dutch border (Fig. 1). The catchment is divided into a
fertile loess plain (“Jülicher Börde” and “Zülpicher
Börde”) in the north and the low mountain range of the Eifel in the
south. The fertile loess plain has a mean elevation of about 100 m a.s.l. The land use here is 47 % arable land, with the main crops being winter
wheat, sugar beet, and maize. The area has been inhabited since prehistoric
times. Since there are confirmed signs of agriculture from 2000 years ago
(Kalis, 1983), it can be assumed that the soil has been influenced by
anthropogenic activities for several thousand years. The warm temperate
midlatitude climate has an annual precipitation of about 700 mm and mean
annual air temperature of about 10
The low mountain range in the southern part of the catchment is characterized by a rolling topography. With a mean elevation of about 690 m a.s.l. and a mean annual precipitation of about 1400 mm, it is dominated by forest and grassland. The major soils are Fluvisols, Gleysols (along the Rur and its tributaries), Eutric Cambisols, and Stagnic Gleysols with a silty loamy texture.
The location and numbering of the sites and fields are shown in Fig. 1. Terrain properties of each field are given in Table 1. Permission to take samples from the fields were given by the respective farmers.
Terrain properties of the fields. Coordinates are for centroids; projection is UTM 32N (WGS 1984).
The intensively used cropping site Selhausen is located in the east of the
fertile loess plain (50
The climate exhibits an annual precipitation of 698 mm and a mean annual
temperature of 9.9
The Selhausen site has been equipped with eddy covariance stations and meteorological sensors since 2007. Because it is the main agricultural observation site of TR32, numerous ancillary data from the site are available and have been presented in the literature (e.g., Busch et al., 2014; Hoffmeister et al., 2016; Korres et al., 2010; Prolingheuer et al., 2014; Schiedung et al., 2017; von Hebel et al., 2018; Bornemann et al., 2011; Ney and Graf, 2018; Schmidt et al., 2012). Beginning in 1895, historical maps document agricultural land use for field F01. Based on this information and the general findings that there has been agriculture in the region for several thousand years, it can be assumed that conversion of the fields into agricultural area does not have persisting effects on current states or processes.
The Merken site (5
The Merzenhausen site (50
The observation site Hürtgenwald (50
Abbreviations for sites and land use types.
The vegetation data are structured in management periods, which are defined by a combination of the observation site, the field, the crop, and the year. A dataset identifier is assigned to each management period, such as, for example, “SEF05WW15”, which describes a management period at the site Selhausen (SE) on field 5 (F05) where winter wheat (WW) was harvested in the year 2015 (15). A management period can be either the growing period of a crop or the between-cropping period, where the field is fallow. The fallow period can be discontinuous and refers to the periods before planting and after the harvesting of a crop.
Data on fluxes and agricultural management can be matched to the management periods by the site, the field, and the year. Meteorological data are given per site. Soil parameters are available for several points at a site. All measurement locations are identified by their positions and are assigned to fields. Fields are defined by field boundaries with a specific land use and homogeneous agricultural management. In the dataset and throughout this text, sites and land use types are abbreviated as shown in Table 2, while the field numbering is shown in Fig. 1.
Additional conventions include the following:
For a crop, the given year is the year when the crop was
harvested. Time and date are in UTC. Coordinates are given in UTM (Zone 32N, WGS 1984).
Folder structure of the dataset. A slash (“/”) denotes a directory.
The dataset (Reichenau et al., 2020) is provided as a zip file containing text files in a separate folder for each site as shown in Fig. 2. Details on the data format are described below. An overview of management periods and available data is presented in Table 3.
Data availability for vegetation data, fluxes, and management data (“X” data available, “–” no data available). For an explanation of vegetation data categories, refer to Sect. 4. For crops, the year refers to harvest. Concerning vegetation data, the number of points gives the maximum number of points in the field measured on the same date. In the event of harvest residues, green sprouts, and other biomass, data are only marked available if at least one value unequal to zero is available.
Missing (or unknown) data are denoted by the symbol NA throughout the
dataset. There are three main causes of missing data.
Since the data described in this document are mostly results from field
measurements, some numbers are missing due to instrument failure or quality
issues (see sections on quality assurance). For data on agricultural
management, data availability depends on the willingness of the farmers to
report their activities. Consistency between sites and management periods: if a certain measured
variable is available for one site, the variable is also listed in the
respective data table of the other sites to keep the data format consistent.
If there are no data for the respective variable at a site, all data points in
that column are marked NA. Consistency with predefined file formats: for flux data, we used pre-existing
file formats, which define columns for variables that were not measured in
our case. This causes columns totally filled with NA.
The vegetation data contain information on fresh and dry biomass, development state, growth height, canopy density, row spacing, and tissue nitrogen and carbon content. Data on biomass are either differentiated by organ (brown and green leaves and stems, respectively, and fruit) or undifferentiated as overall aboveground biomass (named “biomass_undiff”). Furthermore, data on the undifferentiated biomass categories “harvest residues” or “green sprouts” may be included in a record. Harvest residues are understood as the aboveground residues after harvesting, which can be material lying on the ground or stubble left standing. Green sprouts are defined as plants growing between the harvest residues or on an otherwise fallow field. This can be weeds or regrowing crops (especially cereals). In addition, an undifferentiated biomass category named “biomass_other” may contain biomass of roots, weeds, or the like (specified in the database column “other_descr”).
Vegetation data were collected from 2007 to 2017 at different sites and
fields (see Table 3). Biomass and leaf area from at least three points in
the field were determined destructively. For row crops, the number of
plants within a certain distance of the row was also determined. For cereals, plants
were taken from 40 or 50 cm in three different rows. Triticale in
Hürtgenwald was not sown in rows. Thus, plants from an area of at least
After being transported to the lab in airtight bags, the fresh weight (FW) of the plant sample was determined. An aliquot of 150 g or at least one individual plant was further analyzed. In the event of a per-organ analysis, the sample was separated into fruit (understood as the harvested organ like ear, beet, etc.), green or brown stems (shoots), and green or brown leaves. A leaf or stem was classified as brown if 50 % of its surface was not green. A functional definition of a leaf was applied for cereal leaves where only the leaf blade was considered as a leaf, while the leaf sheath was assigned to the stem. Blossoms were defined as fruit. For Maize, the male blossoms on top of the plant were assigned to the stem, and only the female blossoms and the maize cobs that evolve from them were defined as fruit.
The leaf area was determined using either a LI-3000A area meter with a
LI-3050A belt conveyer (LI-COR Biosciences, Lincoln, NE, USA) or a flatbed
scanner (Epson GT-15000, Seiko Epson Corp., Suwa, Japan) together with the
public domain image analysis software ImageJ (
Additional information includes the following:
The frequency of data collection ranges from 1 to 3 weeks.
The number of measurements in each management period can be seen in Table 3. In the years 2015 until 2017, vegetation data were sampled on
overflight days of a radar satellite (Sentinel 1, Radarsat 2). Per-organ data of crops for fields at a particular site without
organ-specific measurements may be estimated from organ-specific biomass
measurements for fields of the same crop on this site assuming equal
proportions of the total aboveground biomass. The validity of this approach
depends on the similarity of soil and management conditions. Prior to 2011, harvest residues and green sprouts were not
sampled in the field. Therefore, these entries are always set undefined (NA)
in the years 2007 until 2010. LAI is undefined instead of zero where no LAI
was reported in the field protocol. During the management periods HWF04HR15 and SEF04HR16, the
fields were fallow. Therefore, all vegetation data are zero. These management
periods and other entries containing only zeroes are included in the dataset
to document dates where the field observations showed no biomass on the
field. Explicitly distinguishing no biomass from undefined or no data (NA)
provides important information for calibration or validation of remote-sensing products.
Figure 3 exemplarily shows dry weights and leaf area index of winter wheat from field F08 at the Selhausen site in 2009 (dataset identifier SEF08WW09). For this management period, three samples per field were collected at each of the 14 dates beginning in December 2008. The last samples were taken on 27 July 2009, 1 d before harvest on 28 July 2009. The graphs nicely show that the exponential growth phase in April comes along with higher variability between the points in the field in terms of green biomass and LAI. With the beginning of senescence in late May, brown biomass and LAI emerge, showing even higher variability. This is a result of small-scale spatial variability of soil and vegetation properties and terrain under field conditions, which is important information for model evaluation.
Dry weight (DW) and leaf area index (LAI) of winter wheat on field F08 at the Selhausen site in 2009. Dataset identifier is SEF08WW09.
Using the data in remote-sensing applications often results in scale
problems. Since only small patches of
Examples of the application of the vegetation data can be found in Ahrends et al. (2014), Brogi et al. (2020), Korres et al. (2013), Ney and Graf (2018), Reichenau et al. (2016), and Schmidt et al. (2012).
The first step of the quality assurance procedure for the vegetation data
was a rigorous documentation of the measuring process. In addition to
written documentation on any phenomena, which might have affected the
measurement (in the field and in the lab), a photographic documentation of
the samples in the field and in the lab enables a visual inspection and
provides independent evidence in case of any doubts. Transcribing the analog
protocols into a spreadsheet-based (MS Excel) digital field protocol
provides a first test of data consistency. Possible errors, inconsistencies,
or incomplete data are reported automatically, and the personnel entering the
data are prompted to check the entries. Transcribing the data from the analog
protocol to a digital dataset is done as soon as possible to be able to
trace possible errors. Keeping analog field protocols provides a double
documentation of the valuable measurements and observations. In the second
step, tests on consistency and plausibility were applied, which ensure that
coordinates are in UTM projection, and timestamps are in UTC; naming of crops, sites, and points follows conventions; values are in plausible ranges; missing values are set to unknown (NA); the right upscaling method is set for a crop throughout a management period; there are no duplicate coordinates for points in a field at the same date.
A third step comprises statistical tests, which result in a quality flag for each value in the dataset (see below). These tests were applied using an R script (R Core Team, 2017), which reads from the digital field protocols, assigns the quality flags, and finally writes the files provided in the dataset.
The quality flags can take the values 1 to 5:
high quality (all tests could be applied, and no problems were identified;
no problems were identified in the field) good quality (a test could not be applied; information is missing to
ensure high quality) unusual water content (a specific flag concerning the measured water
content of the sample, which may hint at problems with biomass measurements) suspicious (a test or a documented issue in the protocol showed possible
problems) low quality (a value is known to have problems but is of interest as evidence of the real conditions, e.g., root biomass)
The flags were set based on the criteria explained below. After evaluation of all tests, the flag with the highest value was assigned. Obviously erroneous data were removed from the dataset. There are no flags for the carbon and nitrogen content of the plant tissue.
Weight measurements below 1 g were generally flagged as good quality (2) instead of high quality (1) as it is quite likely to lose material from samples, which will have a larger relative effect than for high biomass. All harvest residues are generally flagged as suspicious (4). This is due to the fact that precise collection of only the aboveground material is rather difficult and error-prone. It is even more difficult to extract the belowground biomass. Therefore, root biomass (given as “biomass_other”) is generally flagged as low quality (5).
In most cases, a sample from the field had to be differentiated into fractions (organs, harvest residues, green sprouts). For larger samples, only a part (aliquot) was analyzed in the lab (see Sect. 4.1). For organ-specific analysis, this aliquot is the sum of all organs. In the event of undifferentiated biomass, the aliquot is the sum of the biomass categories biomass_undiff, harvest_residues, green_sprouts, and biomass_other. During the process of sample partitioning some material might get lost, causing a difference between the aliquot and the sum of its components (median 1 %). Differences of up to 5 % were accepted independent of their sign. Larger differences result in higher values of the quality flag (Table 4). Higher flags are set if the sum of their components exceeded the aliquot because this cannot be explained by losing material.
Quality flags set if the sum of their components differed from the biomass aliquot.
If an aliquot was available but the FW of one of its components was missing, this FW was recalculated from the difference of the aliquot and the sum of the available FWs. Due to the missing value, the loss of material during sample partitioning cannot be determined. Instead, it is contained in the recalculated value, which is therefore flagged as suspicious (4). In this case, the test against the aliquot is not applicable. Thus, the other FWs were flagged as good quality (2).
Flow chart of the decision process of quality assurance when fresh weight (FW) was found to be larger than dry weight (DW).
The comparison of FW and DW can reveal errors in the biomass data. In the first step, it was tested whether DW exceeds FW (Fig. 4). For brown leaves and stems, FW and DW were compared directly, while for the other biomass categories, the FW was reduced by 5 % for this test assuming that percentage of minimal water content. If DW exceeded the resulting FW, it was checked whether the sum of fresh weights was less than 95 % of the aliquot, which hints at a possible error in the FWs (see above). In that case, the error can be attributed to neither FW nor DW, and both were removed from the dataset. If the sum of fresh weights was more than 95 % of the aliquot, the error was attributed to the DW, which consequently was removed from the dataset, and the corresponding FW was flagged as good quality (2).
Temporal course of sugar beet leaves' relative water content (percentage of fresh weight, all available data). Black lines show the upper and lower limit of the “usual course” (definition in the text). Circles outside the limits of the usual course are assigned the unusual water content flag (3). These data may still be valid because of heterogeneous conditions in a field (e.g., because of earlier drying).
In the second step it was checked whether the relative water content
(FW Use all water content data for a respective crop and organ. Exclude outliers by removing all values outside of the 10 % and 90 %
percentiles in a running 21 d window. In each time window, determine the corridor of 2 standard deviations above
and below the mean.
Owing to the low number of data for some crops and organs and to their scattering, the upper and lower boundaries of the corridor show a lot of
scatter. Since there is tendency towards lower water content with
progressing phenological development, the limits of the usual course were
defined as follows (Fig. 5).
Lower limit: for each day in the direction of time, only include the lower
boundary of the corridor if it is lower than the value on the previous day.
Otherwise, keep the value of the previous day as the lower limit of the
usual course. Upper limit: for each day in reverse direction of time, only include the
upper boundary of the corridor if it is higher than the value on the
following day. Otherwise, keep the value of the following day as the upper
limit of the usual course.
For water content outside of the upper or lower limits, FWs and DWs were assigned the “unusual water content” flag (3). However, these data might also result from particularly dry or wet conditions at a point in a field in a certain year.
All issues observed in the field or in the lab which may have had an influence on the results were translated into flags. For samples reported as dirty, FW and DW were flagged as suspicious (4). For humid or wet samples, samples which might not have been completely dried, and samples which were not analyzed on the same day, only the FW was flagged as suspicious (4) since DW is not affected. If the number of plants per meter was required for upscaling (MA, SB, RA) but missing, this value was derived from other points or dates in the same management period and field. Since this propagates linearly to LAI and to all biomasses per square meter and since the germination rate is variable in space, all FWs and LAI were flagged as suspicious (4).
FW and DW are connected by the upscaling process from the aliquot to the sample (see Sect. 4.1) because the upscaling factor derived from FW is also applied to DW. Therefore, flags were propagated from FW to DW and in the event of leaves also to LAI.
To ensure the validity of the location coordinates it was ensured that reported coordinates of a given measurement are within the given field and that no duplicate coordinates are assigned to different measurements on the same date. If it was not possible to correct implausible coordinates, they were removed. In 2008, measurement locations within each field were predefined and marked with flags. Consequentially, coordinates were not recorded explicitly. Since destructive sampling employed in this study prevents repetitive sampling of the exact same location, the prescribed coordinates represent the sampling location less accurately than those recorded directly at the sampling points. Thus, coordinates for 2008 were flagged as good quality (2) instead of high quality (1).
Uncertainty of biomass data is difficult to estimate. Sources of error exist in all steps of sampling and analysis, including harvest of the samples in the field (incomplete harvest), loss of material and water during handling of the sample, and the unsystematic error of the scales. The error of incomplete harvest cannot be quantified based on the existing data. However, the relative error can be assumed to be rather small for high biomass. The error of handling the sample in the lab (separation of the sample) can be assessed by comparing the weight of the aliquot that was separated by organ with the sum of the organ weights. Of 1176 organ-specific records in the dataset, 229 have a valid aliquot. The other records either show missing values, only have a single organ, or were weighed in total without taking an aliquot. A total of 164 records show a loss of material during separation, while 20 show an increase. The mean loss is 2.6 % of the aliquot (max 15 %). The mean increase is 2.9 % (max 17 %). The average error for the (un-)packing steps associated with transport and drying cannot be quantified based on the available data. However, since activities are similar, it can be assumed to be of a similar relative magnitude. The maximum error of the scales used in the lab was 0.1 g. Since leaf area can be measured quite precisely, the relative uncertainty of LAI depends primarily on the accuracy of the leaf weight used for upscaling. Since these are connected linearly in the upscaling process, it equals the relative uncertainty of biomass. A further source of error is the upscaling from the sample taken in the field to a square meter. For row crops (see Sect. 4.1), the error of the measured row spacing or plant density within the rows propagates linearly into the upscaled result. In order to reduce this error, the median of all row distances measured on a field in a management period was used for upscaling. As the sowing machine settings do not change within a field, the resulting error is considered small. In the field, plant height was measured with a folding rule. The reading accuracy is assumed to be 1 cm, which is less than the natural variability of plant height.
The uncertainty of carbon and nitrogen content of the plant tissues was determined by analyzing differences of 1034 duplicate measurements (two aliquots of the same sample). For carbon content, the mean difference of the samples was 0.6 %. For nitrogen content, the mean difference was 1.1 %. The largest differences occurred for root tissues.
Concerning the uncertainty of phenological states in the BBCH system, principal growth stages (first digit) can be assumed to be correct, while secondary growth stages (second digit) may have an error. Since this depends on the observer, it cannot be generally quantified.
Vegetation data are supplied per site in a UTF-8-coded CSV file named “vegetation_” followed by the two-letter site abbreviation (Table 2). The column separator is the semicolon (;). A description of columns and units is presented in Table A1. The no-data symbol is NA. The files have two header lines, of which the first contains the variable names, while the second contains the units.
Phenological development (“bbch” column) may be given as a single number or as a range if the development state could not be exactly identified in the field. Before sowing and after harvest, the land use is set to harvest residues (HR) independent of the presence of residues on the surface of the field.
Locations, processing software, instrument heights, and temporal extent of eddy covariance measurements. Coordinates are UTM zone 32N (WGS 1984). For information on quality indicators see Sect. 5.2.
The dataset contains 15 time series of flux measurements (Table 5). Net
fluxes of carbon (net ecosystem exchange, NEE), water (latent energy, LE),
and energy (sensible heat flux, H) at the surface were measured at the sites
Selhausen, Merzenhausen, and Merken using state-of-the-art eddy covariance
systems. There were no flux measurements in Hürtgenwald. Wind components
and sonic temperature were measured with a three-dimensional sonic
anemometer (CSAT3, Campbell Scientific, Inc., Logan, UT, USA). Measurements
of water vapor (
Measurements were taken with a sampling rate of 20 Hz and were aggregated to intervals of 30 min. Processing of raw measurements was accomplished as shown in Fig. 1 of Mauder et al. (2013) using the processing software shown in Table 5. The number of decimal places in the data files was kept as they were in the output of the processing software. No gap filling was applied to the data.
Examples of the application of the flux data can be found in Ahrends et al. (2014), Eder et al. (2015), Klosterhalfen et al. (2017), Ney and Graf (2018), Schmidt et al. (2012), and Wienecke et al. (2018).
Quality control was accomplished according to the “TERENO” scheme for quality and uncertainty assessment presented by Mauder et al. (2013). Deviating from this description, before 2011 the software TK2 (Mauder and Foken, 2011) was applied following the process described in Sect. 2.3 of Schmidt et al. (2012). The software ECpack 2.5.20 (Van Dijk et al., 2004) was applied for the data from Merken (Table 5). The software TK uses flagging to indicate the quality of data. Flag values and their meanings are shown in Table 6.
Flag values set by the TK software and their meanings.
Acceptable value ranges of ECpack results and tolerance values at the lower and upper boundary. For the meaning of the “Group” column refer to the text.
Since flux data from Merken 2009 (MK09) was processed with the ECpack software, the concept of quality assurance differs from the other sites. ECpack provides tolerance values which can be used to rate the quality of data (Table 7). Values outside the lower and upper boundaries given in Table 7 are considered invalid. In addition, data can be filtered using the tolerance values. A tolerance is assigned to the lower and upper boundary of each variable, respectively. To evaluate the quality of the data in the valid ranges, tolerances have to be linearly interpolated between the boundaries. The most obvious tolerance violations have already been eliminated by a postprocessing scheme. Tolerance limits were set sufficiently wide to retain most of the values which still might be useful in the dataset. For some variables, considering a value to be invalid causes the whole record to be invalid. These variables are assigned to group A in Table 7. If any value of group B is considered invalid, only the values of group B are invalid.
Uncertainty information for fluxes per data point is available for sensible
heat flux, latent heat flux, NEE, and friction velocity. The kind of
uncertainty information differs between the different software tools used
for data processing (Table 5). For TK3, relative random errors and relative
noise errors for friction velocity, sensible and latent heat flux, and net ecosystem exchange are given in the respective columns (see Table A3) in
the data files. For datasets processed with TK2 this information is not
available. A rough estimate of the general uncertainty for these
measurements may be obtained from statistics of the errors included given in
the TK3-processed data. For other variables included in the TK output, the
uncertainty is quantified from the instrument errors given by the respective
manufacturers (Table 11). The uncertainties of
Flux data are provided in a UTF-8-coded CSV file per field and year. The file name consists of “fluxes_” followed by the two-letter site abbreviation (Table 2), the field ID (Fig. 1), “EC”, a station identifier, and the year. The elements of the file name are separated by underscores (e.g., fluxes_SEF01_EC_001_2016.csv). The column separator is the semicolon (;). A description of columns and units is presented in Tables A2 and A3 for the TK and ECpack software, respectively. The no-data symbol is NA. The files have two header lines, of which the first contains the variable names, while the second contains the units.
Availability of soil information per site. C and/or N: both carbon and nitrogen content is not always available. Due to the absence of carbonates, C content is expected to equal soil organic carbon (SOC).
Soil property data include particle size distribution of the fine soil
(
All particle sizes were analyzed following DIN ISO 11277. Therefore, they follow the definition of particle size classes of DIN 14688. Particles larger than 2 mm are considered gravel. To recalculate particle sizes to the USDA system, which is assumed for many pedotransfer functions, refer to, e.g., Nemes et al. (1999). All data on particle sizes and soil carbon or nitrogen content refer to fine soil after the removal of gravel. Therefore, percentages of sand, silt, and clay refer to fine soil, while the percentage of gravel refers to total soil mass. Bulk density was determined gravimetrically. Total C concentrations in soil samples were determined by elemental analysis. Based on previous analyses it can be assumed that all samples were free of carbonates. Hence, total C concentrations are in accordance with those of soil organic carbon (SOC).
Applications of the soil data can be found in Bornemann et al. (2011), Brogi et al. (2019, 2020), Jakobi et al. (2020), Korres et al. (2013), and Meyer et al (2017).
Soil data for Selhausen originate from different
sources. Particle sizes for three depths in field SE F08 were analyzed at
the Laboratory for Physical Geography, University of Cologne. For the
plowing horizon of field SE F00, particle sizes were analyzed at the
Institute of Crop Science and Resource Conservation, University of Bonn
(Bornemann et al., 2011). These data have a high spatial resolution that
enables analysis of small-scale heterogeneity. A third dataset consists of
horizon-specific particle size data from 100 randomly chosen points from a 1 km
From the 100 sampling points of the 1 km
Due to temporal and spatial variability, these data have to be understood as snapshots and cannot be transferred to other points in space or time.
Particle size data for the Merken site are only available from a composite sample based on samples from all fields. These data are assumed to be valid for all fields due to small spatial heterogeneity of the soil at the site. The analysis was carried out at the Soil Physical Laboratory of IBG-3, Jülich Research Centre. Field-specific carbon and nitrogen content for three soil depths was measured from composite samples as described for Selhausen.
For the field in Merzenhausen, soil texture, bulk
density, and content of carbon and nitrogen were determined for the Ap
horizon at a single point in the field following the methodology described
by Bornemann et al. (2011). Bulk density was quantified from three
independent 100 cm
For Hürtgenwald, particle sizes were analyzed at the Institute of Crop Science and Resource Conservation, University of Bonn, while bulk densities were determined at the Laboratory for Physical Geography, University of Cologne.
For the determination of particle sizes, bulk density, and carbon and
nitrogen content, at least two samples from each point were analyzed in
parallel. This was not the case for the 1 km
To quantify the uncertainty of particle size fractions, data of a repeatedly
analyzed sample were evaluated at the University of Bonn. The results show
coefficients of variation (CVs) of 2.0 % for sand, 2.4 % for silt,
2.5 % for clay, and 3.5 % for gravel. Since such repeated estimates
were not performed at the University of Cologne, it is assumed that the
uncertainty of their measurements is of the same magnitude. At Jülich
Research Centre, particle sizes were automatically analyzed with a Sedimat
(see above), which has uncertainties in the calculation of the particle size
fractions that are comparable to those obtained in the abovementioned analysis
performed in Bonn. For bulk density, a CV of 10 % was determined from the
analysis of multiple adjacent samples from the same horizon (University of Bonn). For the soil unit data from Selhausen, uncertainties for
particle size fractions and layer depth are given in the respective columns
(Table A6) in the data files. The CNS elemental analyzers used to determine
soil carbon and nitrogen content show uncertainties of
Soil data are provided in a UTF-8-coded CSV file per site named “soil_” followed by the two-letter site abbreviation (Table 2). The column separator is the semicolon (;). A description of columns and units is presented in Table A5. The no-data symbol is NA. Soil unit data for Selhausen are provided in a UTF-8-coded CSV file named “soil_units_SE.csv”. The column separator is the semicolon (;). A description of columns and units is presented in Table A6. The files have two header lines, of which the first contains the variable names, while the second contains the units.
Meteorological stations, their positions, available data, and data source.
The dataset was assembled with the aim of providing the data usually required
to run a hydroecological crop growth model. Therefore, the dataset includes
gap-filled hourly meteorological data of air pressure (AirPres; Pa), global radiation (Globrad; W m
The availability of meteorological field data varies between the sites as well as in time. The temporal availability of data increased significantly with time due to the setup of the meteorological stations of the TERENO Eifel/Lower Rhine Valley observatory in 2011. In the earlier years, only a few meteorological stations were run near the sites. Table 9 shows a list of all meteorological stations used in this study. Methods to fill gaps in the time series vary between years and stations. The gap-filling methods are explained in the following sections. The data sources for each year and site are presented in Table 10. In most cases, information on the measurement devices and raw data with gaps can be obtained from the data sources shown in the table.
Meteorological data for the site Hürtgenwald are provided for the years 2014 through 2016. Meteorological measurements started on 21 April 2015 (station 20 in Table 9). In 2016, additional stations were set up nearby (station 21 in Table 9). Data for earlier dates were generated using the regression gap-filling method (see Sect. 7.1.1) for all variables but AirPres, where gaps were filled using the barometric formula (Eq. 1). The first year for HW consists of reconstructed data only.
Data for the site Merzenhausen are provided for the years 2009 through 2017. Local measurements are available for the whole period (stations 1 and 15 in Table 9). For the years 2009 and 2010, gaps were filled using the regression method. From 2011 on, the EOF (empirical orthogonal function) method was used (see Sect. 7.1.1).
Data for Selhausen are provided for the years 2007 through 2017. Local measurements are available for the whole period, starting on 27 May 2007 (stations 10, 11, and 19 in Table 9). For the years 2007 until 2010, the regression method was used to fill gaps. From 2011 on, the EOF method was applied.
For the site Merken, no meteorological data are available. Since the distance to Selhausen is only about 4 km, and the difference in elevation is about 10 m, it can be assumed that the weather was very similar to that in Selhausen. Therefore, the use of Selhausen meteorological data when working with Merken vegetation or flux data is suggested.
Since cloudiness is not available for any of the sites but required in some ecohydrological models (e.g., the DANUBIA simulation system; for an application see, e.g., Korres et al., 2013), data on cloudiness from the Aachen station of the German Meteorological Service (distances to HW, ME, and SE are 37, 37, and 42 km, respectively) were used. Since there is no reliable method to adjust cloudiness data to remote stations, the data were used without modifications.
Information on the conditions at the locations of the meteorological stations, especially in the past, are not fully available. Therefore, precipitation data are given as measured at the stations. Since the data were not corrected for shielding effects, precipitation can be assumed to be slightly underestimated.
Excerpt of the gap-filled hourly meteorological data for the Selhausen site for the period from May until July 2011.
Figure 6 shows an excerpt of the meteorological data for the Selhausen site for the period May to July 2011. The graphs show a period where there are no breaks or shifts in the continuous curves, which is the usual case in the weather time series (for a discussion of inhomogeneities, see Sect. 7.1.3). In the middle of June, the example data show a noticeable period of 2 d with low radiation and temperatures together with rather high wind speed and high cloud cover. All variables show a reduced diurnal cycle, which confirms the consistency of the time series of the separate variables and is an important prerequisite for a good reproduction of real processes in a simulation.
Applications of the meteorological data can be found in Korres et al. (2013), Ney and Graf (2018), Sakai et al. (2016), and Schmidt et al. (2012).
In the course of the TR32 project, an increasing number of meteorological stations were set up in the Rur catchment. Therefore, different methods were chosen for different periods to fill gaps in the meteorological data.
For this simple approach (method 0 in Table 10), data of a nearby station were simply inserted into gaps of the reference station's time series. This method was applied in the beginning of the project when only a few stations were set up.
Source of meteorological data given as station IDs as defined in Table 9. Station IDs in parentheses are stations used for gap filling. The “Met” column shows the method used for gap filling as explained in the text.
This method (method 1 in Table 10) was applied to fill gaps in Hürtgenwald. This method was applied because Hürtgenwald was not included in the central TR32 gap-filling efforts with the EOF method (see below) since it was not an official site of TR32 stations. A simple linear regression was set up between the available data of the station with gaps and a nearby station for each variable, respectively. The slope of the regression was then applied to the data of the nearby station to fill the gap. In the event of a data gap at the nearby station, data from a further station were used. In the seldom cases where no data were available at any station, the gap was filled based on linear interpolation. No gaps longer than 4 h had to be filled this way.
For variant 2 of the regression method (method
2 in Table 10), which was applied for the year 2010 in Selhausen and for the
years 2009 and 2010 in Merzenhausen, the data of a reference station were
correlated with data of the closest remote station using a reduced major
axis regression (Webster, 1997). If the coefficient of determination was
higher than 0.9, the data of the remote station were inserted into the data
gap without further processing (same as insertion method). If
This method (method 3 in Table 10) was applied for the
sites Merzenhausen and Selhausen from 2011 on as soon as enough stations
were available in the TR32 set of meteorological stations. It utilizes
empirical orthogonal functions (EOFs) to describe the relation between
variables at several meteorological stations. The approach was originally
introduced by Beckers and Rixen (2003) and adapted for station time series
by Graf (2017); further information on EOF computation on similar data can
be found in Graf et al. (2012). Since the approach does not depend on the
regular spatial arrangement of the pixels, it can easily be transferred to a
network of stations. In contrast to the original approach, this method works
on the Prior to gap filling, remove all values rated “bad” or “suspicious”. Delete an additional 10 % (randomly selected) of the available data
per station, and set them aside for cross-validation purposes. Perform a Replace all missing values with zeroes. Compute the EOFs, and reconstruct the time series of each station and
variable using only the first EOF (“truncated reconstruction”). Fill all gaps with the reconstruction, and repeat step 4 with the filled
time series. Repeat the procedure until no data point is changed from one
iteration to the next by more than 1 %, if the change between
iterations starts to increase again in at least one data point, or if a
maximum of 1000 iterations is reached. Use the dataset with the new preliminary fillers to initialize at step 4
again but this time using the first two EOFs. Continue as in step 5. After
this has converged too, use the first 3 EOFs and so on until 10 EOFs are
used. Retransform results to absolute values (reverse step 2). Use the cross-validation dataset set aside in step 1 to determine the
number of EOFs at which the prediction is optimal (minimum RMSE between
validation data and prediction). Repeat the whole procedure up to this
number of EOFs starting with step 2 (i.e., without removing cross-validation
data).
An advantage of this approach is that the EOF method exploits the same underlying statistics as multiple linear regression would but does not need to be re-evaluated each time a predictor variable becomes unavailable. The method was applied to 10 min resolution data from stations 1 to 18 (Table 9). Results were aggregated to hourly resolution.
Gaps in cloudiness data were filled using the “na.approx” method in the R package “zoo” (Zeileis and Grothendieck, 2005).
For the sites and years where the EOF method was not applied, air pressure
data were transformed between stations by using the barometric formula
A closer look at the time series of meteorological data reveals differences in general characteristics between different years. This is mainly due to different instruments or different calibration of instruments. By these means, synthetic breaks in the time series are generated that can disturb the analysis of real phenomena. This is particularly a problem when using the data with models, which deterministically transform weather data into plant growth and into exchange fluxes of matter and energy.
Breaks in time series of meteorological measurements.
Several breaks can easily be identified from graphical visualizations of the data. Figure 7a shows a shift in air pressure measured in Selhausen from 2009 to 2010 using different instruments. A similar effect can be observed in the Merzenhausen data. Figure 7b illustrates different maxima of relative humidity in 2015 in Hürtgenwald, which are due to differences in instrument calibration. This effect can also be found in the data for Merzenhausen and Selhausen. Other obvious breaks refer to lower extrema of air temperature (SE and ME), maxima of global radiation (ME), maxima of wind speed (SE), and changing temporal variability of wind speed (HW). Often, these breaks coincide with a change in the main source station (Table 10). Other less noticeable breaks may be included in the time series.
The removal of such breaks in the time series is known in the literature as homogenization. Several methods have been developed to detect the breaks and correct for inconsistencies. However, most of these methods were designed for monthly or annual data (Venema et al., 2012) and are not applicable to subdaily data (Aguilar et al., 2003; Auer et al., 2005; Wijngaard et al., 2003). Since methods for data on higher temporal resolutions would involve dealing with nonlinear atmospheric processes (Della-Marta and Wanner, 2006), the World Meteorological Organization does not yet make any recommendations on how to homogenize these data. Nevertheless, the following literature might help in finding an appropriate homogenization method for the intended application of the data: Vincent et al. (2002), Brandsma and Können (2006), Della-Marta and Wanner (2006), Kuglitsch et al. (2009), Mestre et al. (2011), and Trewin (2013) for temperature; Beaulieu et al. (2008), and Beaulieu et al. (2009) for precipitation; and Domonkos and Coll (2017) for both.
Time series of meteorological data were checked for plausibility of the recorded data. Values outside of a plausible range were removed from the dataset. Periods of repeated identical (but plausible) values were removed. To ensure good quality of gap filling, the gap-filling methods were applied to periods with measurements of good quality.
Instrument uncertainties for meteorological measurements at stations 1 to 21 and 25 (for station IDs see Table 9).
Measurement uncertainties of weather variables are given as instrument errors in Table 11. It has to be mentioned that especially for precipitation, the instrument error is much smaller than systematic errors. For a discussion of such errors, see Dengel et al. (2018).
Additional uncertainty occurs when gaps in time series are filled based on data from other stations. Because different methods and data sources were used, uncertainty was determined separately for the different sites and years. For the years 2007 to 2010, uncertainty was estimated by deriving a fill value from remote stations for each available value at the reference station using the respective method shown in Table 10. Bias and root mean square error (RMSE) were calculated from the differences (Table 12). These results for the Selhausen site are assumed to be transferable to the other sites.
Uncertainty estimates for weather time series gap filling
expressed as pairs of bias (
For the EOF method, which was applied for the years from 2011 to 2017, an extra run on a dataset copy with artificial gaps was used to determine worst-case uncertainty estimates. These artificial gaps were inserted for the Merzenhausen site for the 2.5 consecutive days with the highest mean for the respective variable (relative humidity: lowest mean) for all sensors at the site. The artificial gaps were then filled, and the differences to the measured data were evaluated in terms of bias and RMSE (Table 12). By selecting an extreme situation for gap filling, uncertainties for the EOF method are a worst-case estimate. Inserting arbitrary gaps would probably give lower uncertainty values. In addition to this, when comparing uncertainty estimates between different periods, it has to be taken into account that the analysis for the EOF method was applied to raw 10 min data, while the evaluation for the years 2007 to 2010 is based on hourly data, which generally results in slightly lower RMSE values. Again, we assume that results can be transferred to the Selhausen site.
For precipitation in the period 2007 to 2010 and for global radiation in 2010, uncertainty estimates cannot be given since the raw data are no longer available. Data from the German Meteorological Service (DWD) were used for global radiation in 2007 to 2009 and for air pressure in 2010. Since these data were without gaps, there is no gap-filling uncertainty for these variables.
Weather data are provided in a UTF-8-coded CSV file per site named “meteo_” followed by the two-letter site abbreviation (Table 2) and the span of years available. The column separator is the semicolon (;). A description of columns and units is presented in Table A7. The no-data symbol is NA. The files have two header lines, of which the first contains the variable names, while the second contains the units.
Crop management data were obtained from the farmers by means of a questionnaire. However, the information given by the farmers
was only reported for 27 out of 58 management periods and is often
incomplete (Table 3). The following information was inquired:
sowing date sowing density, row spacing, seed spacing in a row, seed weight,
sowing depth, cultivar fertilization date, amount, and product cultivation date and type growth regulator application date, amount, and product fungicide/insecticide/herbicide application date, amount, and
product harvesting date dry weight of yield after harvest information on residues left on the field
All fertilization data were recalculated to kilograms of nitrogen per hectare.
Since for some products, nitrogen content is not explicitly stated, the
following assumptions were made: it was assumed that KAS (calcium ammonium
nitrate) contains 27 mass % nitrogen. Furthermore, it was assumed that
sulfane contains 24 mass % nitrogen. AHL (urea ammonium nitrate solution, UAN)
was assumed to have a density of 1.3 kg L
Applications of the management data can be found in Korres et al. (2013) or Schmidt et al. (2012).
Some of the fields were equipped with automatic camera systems, which took hourly photos. Management information gathered from the farmers was checked against these photos.
Accuracy of management data is based on the reliability of the information provided by the farmers. Since there is no way to check information on fertilizer or agrochemical types and amounts, an uncertainty cannot be assigned.
Management data are provided in a UTF-8-coded CSV file per management period. The file name starts with “management_” followed by the ID of the management period (e.g., management_SEF08WW09.txt). The file can contain data on management activities in the fallow period before or after harvest. If no management information is available, the file contains a comment only. There are no management files for management periods denominated “harvest residues” (HR).
Each record is structured in the same way: date; keyword; additional
information. The elements of the record are separated by a semicolon (;).
The record starts with the date in YYYY-MM-DD format, where day may be
replaced by “xx” if the exact date is unknown. In the second position, the
record contains a keyword that defines the management activity. Keywords
refer to basic crop-related activities (“Sowing”, “Harvesting”,
“Fertilizer”, “Cutting”), soil management (“Plow”, “Rotary harrow”,
“Harrow”, “Roller”, “Cultivator”, “Tire packer”), and application of
agrochemicals (“Herbicide”, “Growth control”, “Fungicide”,
“Insecticide”, “Coformulant”). After the keyword, one or more pieces of
additional information may follow in a semicolon-separated list.
Fertilizer: amount of fertilizer in kilograms of nitrogen per
hectare, information on the product and its contents (may also be a
semicolon-separated list) Application of agrochemicals: amount of agrochemical per area, information on the product and its contents (may also be a
semicolon-separated list) Sowing: sowing density, row spacing, seed spacing, weight of
seeds, sowing depth, cultivar
Unknown information is indicated by the no-data symbol NA. Units are given with the data. Comments start with “#”. Comments can contain additional information on yield, management of harvest residues, additional contents of agrochemicals, etc.
The dataset can be downloaded from the TR32
database (
The tables in the appendix describe the data files in terms of their column order, variables, units, and data types (Tables A1–A3, A5–A7) and specify soil particle size classes (Table A4).
Columns in the vegetation data files.
Continued.
Columns of flux data files processed with the software ECpack. With the exception of the timestamps, all data types are numeric.
Columns of flux data files processed with the software TK. With the exception of the timestamps, all data types are numeric.
Continued.
Size ranges of particle size classes (DIN 14688).
Columns of soil data files. For particle size classes see Table A4.
Columns of the soil unit data file. These data exist for the site Selhausen only. For particle size classes see Table A4.
Columns of weather data files. With the exception of the timestamps, all data types are numeric.
TR designed and compiled the dataset and did the quality control and processing of the vegetation data. AS and AG did the gap filling of meteorological data and developed the methods. MS and AG were responsible for the eddy covariance data. MS collected the management data. WK, TR, AS, and KS were responsible for the collection of the vegetation data. Meteorological data were acquired by AG, MS, WK, and KS. AG, WK, GW, NM, and CB took and analyzed soil samples. The manuscript was prepared by TR with contributions from all coauthors.
The authors declare that they have no conflict of interest.
This study was supported by the Deutsche Forschungsgemeinschaft through the Transregional Collaborative Research Center 32 – Patterns in Soil–Vegetation–Atmosphere Systems: Monitoring, Modeling and Data Assimilation. In addition, support was received through the “Terrestrial ENvironmental Observatories” (TERENO). We thank the numerous student helpers for their help with the field campaigns. Special thanks go to the farmers who granted access to their fields and to our student helpers Michael Holthausen (gap filling of meteorological data), Tobias Bothe (gap filling and soil sampling), and Nils Eingrüber (consistency check of soil data). We thank Florian Steininger for collecting the management data from the farmer in Hürtgenwald, Victor Venema for discussions and literature on the homogenization of meteorological data, Ulrike Schwedler for cartography, and Constanze Curdt for advice concerning all aspects of data management. Finally we want to thank the five reviewers for their helpful comments.
This research has been supported by the Deutsche Forschungsgemeinschaft (grant no. TRR 32/3).
This paper was edited by Yuyu Zhou and reviewed by four anonymous referees.