Quality control and correction method for air temperature data from a citizen science weather station network in Leuven, Belgium

Beele, Eva; Reyniers, Maarten; Aerts, Raf; Somers, Ben

doi:https://doi.org/10.5194/essd-14-4681-2022

Articles | Volume 14, issue 10

https://doi.org/10.5194/essd-14-4681-2022

Articles | Volume 14, issue 10

Data description paper

21 Oct 2022

Data description paper |

| 21 Oct 2022

Quality control and correction method for air temperature data from a citizen science weather station network in Leuven, Belgium

Eva Beele, Maarten Reyniers, Raf Aerts, and Ben Somers

Abstract

The growing trend toward urbanisation and the increasingly frequent occurrence of extreme weather events emphasise the need for further monitoring and understanding of weather in cities. In order to gain information on these intra-urban weather patterns, dense high-quality atmospheric measurements are needed. Crowdsourced weather stations (CWSs) could be a promising solution to realise such monitoring networks in a cost-efficient way. However, due to their nontraditional measuring equipment and installation settings, the quality of datasets from these networks remains an issue. This paper presents crowdsourced data from the “Leuven.cool” network, a citizen science network of around 100 low-cost weather stations (Fine Offset WH2600) distributed across Leuven, Belgium ( $50^{\circ} 52^{'}$ N, $4^{\circ} 42^{'}$ E). The dataset is accompanied by a newly developed station-specific temperature quality control (QC) and correction procedure. The procedure consists of three levels that remove implausible measurements while also correcting for inter-station (between-station) and intra-station (station-specific) temperature biases by means of a random forest approach. The QC method is evaluated using data from four WH2600 stations installed next to official weather stations belonging to the Royal Meteorological Institute of Belgium (RMI). A positive temperature bias with a strong relation to the incoming solar radiation was found between the CWS data and the official data. The QC method is able to reduce this bias from 0.15 ± 0.56 to 0.00 ± 0.28 K. After evaluation, the QC method is applied to the data of the Leuven.cool network, making it a very suitable dataset to study local weather phenomena, such as the urban heat island (UHI) effect, in detail. (https://doi.org/10.48804/SSRN3F, Beele et al., 2022).

Download & links

How to cite.

Received: 01 Apr 2022 – Discussion started: 02 May 2022 – Revised: 23 Sep 2022 – Accepted: 26 Sep 2022 – Published: 21 Oct 2022

1 Introduction

More than 50 % of the world's population currently lives in urban areas, and this number is expected to grow to 70 % by 2050 (UN, 2019). Keeping this growing urbanisation trend in mind and knowing that both the frequency and intensity of extreme weather events will increase (IPCC, 2021), it becomes clear that both cities and their citizens are vulnerable to climate change. To plan efficient mitigation and adaptation measures and, hence, mitigate future risks, information on intra-urban weather patterns is needed (Kousis et al., 2021). Therefore, dense high-quality atmospheric measurements are becoming increasingly important to investigate the heterogeneous urban climate. However, due to their high installation and maintenance costs as well as their strict siting instructions (WMO, 2018), official weather station networks are sparse. As a result, most cities only have one official station, or they may even lack an official station (Muller et al., 2015). Belgium only counts around 30 official weather stations distributed across a surface area of 30 689 km²; a total of 18 of these weather stations (Sotelino et al., 2018) are owned and operated by the Royal Meteorological Institute of Belgium (RMI). These classical observation networks operate at a synoptic scale and are, thus, not suitable to observe city-specific or intra-urban weather phenomena such as the urban heat island (UHI) effect (Chapman et al., 2017).

The UHI can be measured using a number of methods. Fixed pairs of stations (e.g. Bassani et al., 2022; Oke, 1973) or mobile transect approaches (e.g. Kousis et al., 2021) have traditionally been used to quantify this phenomenon. However, both methods are not ideal: pairs of stations lack detailed spatial information, whereas transects often miss a temporal component (Chapman et al., 2017; Heaviside et al., 2017). Other studies have quantified the UHI using remote sensing data derived from thermal sensors. Such methods can provide spatially continuous data over large geographical extents but are limited to land surface temperatures (LSTs) (Arnfield, 2003; Qian et al., 2018). As opposed to LSTs, the canopy air temperature (T_air) is more closely related to human health and comfort (Arnfield, 2003). Nevertheless, finding the relationship between LST and T_air is known to be rather difficult, and results have been inconsistent (Yang et al., 2021). Numerical simulation models (e.g. UrbClim, De Ridder et al., 2015; SURFEX, Masson et al., 2013) in which air temperature is continuously modelled over space and time could be a possible solution. However, these models still have some drawbacks: due to the computational power capacity, these models only take a limited number of variables into account, making them less suitable for real-life applications (Rizwan et al., 2008); additionally, they often lack observational data to train and validate their simulations (Heaviside et al., 2017).

The rise of crowdsourced air temperature data, especially in urban areas, could be a promising solution to bridge this knowledge gap (Muller et al., 2015). Such data are obtained via a large number of nontraditional sensors, mostly set up by citizens (i.e. citizen science) (Muller et al., 2015; Bell et al., 2015). Crowdsourced datasets have already been successfully used for monitoring air temperature (Chapman et al., 2017; de Vos et al., 2020; Fenner et al., 2017; Napoly et al., 2018; Meier et al., 2017; Hammerberg et al., 2018; Feichtinger et al., 2020), rainfall (de Vos et al., 2019, 2020, 2017), wind speed (Chen et al., 2021; de Vos et al., 2020) and air pollution (EEA, 2019; Castell et al., 2017) within complex urban settings. However, due to their nontraditional measuring equipment and installation settings, the quality of datasets from these networks remains an issue (Bell et al., 2015; Napoly et al., 2018; Chapman et al., 2017; Meier et al., 2017; Muller et al., 2015; Cornes et al., 2020; Nipen et al., 2020). Quality uncertainty arises due to several issues: (1) calibration issues in which the sensor could be biased either before the installation or due to drift over time, (2) design flaws in which the design of the station makes it susceptible to inaccurate observations, (3) communication and software errors leading to incorrect or missing data, (4) incomplete metadata (Bell et al., 2015), and (5) unsuitable installation locations (Feichtinger et al., 2020; Cornes et al., 2020).

Recent studies have, therefore, highlighted the importance of performing data quality control in data processing applications (Båserud et al., 2020; Longman et al., 2018), especially before analysing crowdsourced air temperature data (Bell et al., 2015; Jenkins, 2014; Chapman et al., 2017; Meier et al., 2017; Napoly et al., 2018; Cornes et al., 2020; Nipen et al., 2020; Feichtinger et al., 2020). Jenkin (2014) and Bell et al. (2015) both conducted a field comparison in which multiple crowdsourced weather stations (CWSs) were compared with official (and thus professional) observation networks. Both found a profound positive instrument temperature bias during daytime with a strong relation to the incoming solar radiation. Thus, the use of crowdsourced data requires quality assurance and quality control (QA/QC) in order to both remove gross errors and correct station-specific instrument biases (Bell et al., 2015). Using the findings of Bell et al. (2015) as a basis, Cornes et al. (2020) corrected crowdsourced air temperature data across the Netherlands using radiation from satellite imagery and background temperature data from official stations belonging to the Royal Netherlands Meteorological Institute (KNMI). To investigate the UHI in London, UK, Chapman et al. (2017) used Netatmo weather stations and removed crowdsourced observations that deviated from the mean of all stations by more than 3 standard deviations. Meier et al. (2017) developed a detailed QC procedure for Netatmo stations using reference data from two official observation networks in Berlin, Germany. The QC consists of four steps, each identifying and removing suspicious temperature data. Their methods highlight the need for standard, calibrated and quality-checked sensors in order to assess the quality of crowdsourced data (Cornes et al., 2020; Chapman et al., 2017; Meier et al., 2017). Such official sensors are, however, not present in most cities, hindering the transferability of these QC methods. To this end, Napoly et al. (2018) developed a statistically based QC method for Netatmo stations that was independent of official networks (the CrowdQC R package). The QC method was developed on data from Berlin (Germany) and Toulouse (France), and it was later applied to Paris (France) to demonstrate the transferability of this method. The procedure consists of four main and three optional QC levels, removing suspicious values, correcting for elevation differences and interpolating single missing values. As the CrowdQC-filtered dataset still contained some radiative errors, Feichtinger et al. (2020) combined the methods of Napoly et al. (2018) and Meier et al. (2017) to study a high-temperature period in Vienna in August 2018. Most recently, Fenner et al. (2021) presented the CrowdQC+ QC R package, which is a further development of the existing CrowdQC package developed by Napoly et al. (2018). The core enhancements deal with radiative errors and sensor response time issues (Fenner et al., 2021).

Current QC studies mostly identify and remove implausible temperature measurements (Chapman et al., 2017; Meier et al., 2017; Napoly et al., 2018), instead of correcting for known temperature biases (Cornes et al., 2020). We do, however, know that both the siting and the design of CWSs can introduce such a bias. By parameterising this bias, it can be learned and corrected for, thereby limiting the number of observations that is eliminated (Bell et al., 2015). Additionally, most QC procedures require data from official networks (Cornes et al., 2020; Chapman et al., 2017; Meier et al., 2017), although most cities do not have such measurements available (Muller et al., 2015). Lastly, previous research has also noted that biases can be station-specific; this is because the design of a CWS is an important uncertainty source (Bell et al., 2015), and it indicates the need for station-specific quality control methods. Thus, there is a need for station-specific quality control and correction methods, independent of official weather station networks.

Here, we report on a statistically based QC method for the crowdsourced air temperature data of the “Leuven.cool” network, a citizens science network of around 100 weather stations distributed across private gardens and (semi-) public locations in Leuven, Belgium. The Leuven.cool network is a uniform network in the sense that only one weather station type (Fine Offset WH2600) is used for the entire network. To our knowledge, no quality control method has been developed for this sensor type. The stations were installed following a strict protocol, lots of metadata are available, and both the dataflow and station siting are continuously controlled. This novel QC method removes implausible measurements, while also correcting for inter-station (between-station) and intra-station (station-specific) temperature biases. The QC method only needs an official network during its development and evaluation stage. Afterwards, the method can be applied independently of the official network that was used in the development phase. Transferring the method to other networks or regions would require the recalibration of the QC parameters. After applying this quality control and correction method, the crowdsourced Leuven.cool dataset becomes suitable to monitor local weather phenomena such as the urban heat island (UHI) effect.

The paper is organised as follows. Section 2 describes materials and methods, providing information on the study area, the crowdsourced (Leuven.cool) dataset and the official reference dataset. The development of the quality control method is explained in Sect. 3. In Sect. 4 the newly developed QC method is first tested on four crowdsourced stations installed next to three official stations from the Royal Meteorological Institute of Belgium (RMI). This allows us to quantify the data quality improvement after every QC level. In Sect. 5 the QC method is applied to a network of CWSs in Leuven, Belgium. Section 6 briefly focusses on the application potential of the dataset. Concluding remarks are summarised in Sect. 7.

2 Materials and methods

2.1 Study area

The QC method is developed for a citizens science weather station network, Leuven.cool, based in Leuven, Belgium ( $50^{\circ} 52^{'} 39^{''}$ N, $4^{\circ} 42^{'} 16^{''}$ E; 65 m a.s.l.). The Leuven.cool project is a close collaboration between the University of Leuven (KU Leuven), the city of Leuven and the RMI that aims to measure the microclimate in Leuven and gain knowledge on the mitigating effects of green and blue infrastructures (Leuven.cool, 2020). Leuven has a warm temperate climate with no dry season and a warm summer (Cfb), no influence from mountains or seas, and overall weak topography (Kottek et al., 2006). It is the capital and largest city of the province of Flemish Brabant and is situated in the Flemish region of Belgium, 25 km east of Brussels, the capital of Belgium. The city comprises the districts of Leuven, Heverlee, Kessel-Lo, Wilsele and Wijgmaal, covering an area of 56.63 km². The main characteristics of the study area are summarised in Table 1.

Table 1Main characteristics of the study area Leuven.

Download Print Version | Download XLSX

2.2 Leuven.cool dataset

Data from the Leuven.cool citizens science network are presented in this paper. The crowdsourced weather station network consists of 106 weather stations distributed across Leuven and surroundings. The meteorological variables are measured by low-cost WH2600 wireless digital consumer weather stations produced by the manufacturer Fine Offset (Fig. 1). The station specifications, as defined by the manufacturer, are summarised in Appendix A1. The weather station consists of an outdoor unit (sensor array) and a base station. The outdoor sensor array measures temperature (in ^∘C, add 273.15 for K), humidity (%), precipitation (mm), wind speed (m s⁻¹), wind direction (^∘), solar radiation (W m⁻²) and UV (–) every 16 s. This outdoor sensor array transmits its measurements wirelessly, via the 868 MHz radiofrequency, to the base station. This base station needs both power and internet (via a LAN connection) in order to send the data to a server. The data are forwarded to the Weather Observations Website (Kirk et al., 2020), a crowdsourcing platform initiated and managed by the UK Met Office. RMI participates in this initiative and operates its own WOW portal (Weather Observations Website – Belgium, 2022). The outdoor unit is powered by three rechargeable batteries which are recharged by a small built-in solar panel. A radiation shield protects both the temperature and humidity sensors from extreme weather conditions and direct exposure to solar radiation.

https://essd.copernicus.org/articles/14/4681/2022/essd-14-4681-2022-f01

Figure 1The WH2600 wireless digital weather station outdoor unit (a) at Mathieu de Layensplein in Leuven (LC-105) and (b) next to the official AWS equipment in Humain (LC-R05) (photograph credit: Maarten Reyniers).

Quality control and correction method for air temperature data from a citizen science weather station network in Leuven, Belgium

2.1 Study area

2.2 Leuven.cool dataset

2.3 Reference dataset

3.1 Quality control level 1 – outlier detection

3.1.1 QC level 1.1 – range outliers

3.1.2 QC level 1.2 – temporal outliers

3.1.3 QC level 1.3 – spatial outliers

3.2 Quality control level 2 – inter-station bias correction

3.3 Quality control level 3 – intra-station bias correction

3.3.1 The intra-station temperature bias

3.3.2 Building a predictor for the intra-station temperature bias

4.1 Quality control level 1 – outlier detection

4.2 Quality control level 2 – inter-station bias correction

4.3 Quality control level 3 – intra-station bias correction

5.1 Quality control level 1 – outlier detection

5.2 Quality control level 2 – inter-station bias correction

5.3 Quality control level 3 – intra-station bias correction

5.4 Overall impact of the QC method on the dataset