These authors contributed equally to this work.
Having a prediction model for air quality at a low computational cost can be useful for research, forecasting, regulatory, and monitoring applications. This is of particular importance for Latin America, where rapid urbanization has imposed increasing stress on the air quality of almost all cities. In recent years, machine learning techniques have been increasingly accepted as a useful tool for air quality forecasting. Out of these, random forest has proven to be an approach that is both well-performing and computationally efficient while still providing key components reflecting the nonlinear relationships among emissions, chemical reactions, and meteorological effects. In this work, we employed the random forest methodology to build and test a forecasting model for the city of Buenos Aires. We used this model to study the deep decline in most pollutants during the lockdown imposed by the COVID-19 (COronaVIrus Disease 2019) pandemic by analyzing the effects of the change in emissions, while taking into account the changes in the meteorology, using two different approaches. First, we built random forest models trained with the data from before the beginning of the lockdown periods. We used the data to make predictions of the business-as-usual scenario during the lockdown periods and estimated the changes in concentrations by comparing the model results with the observations. This allowed us to assess the combined effects of the particular weather conditions and the reduction in emissions during the period when restrictions were in place. Second, we used random forest with meteorological normalization to compare the observational data from the lockdown periods with the data from the same dates in 2019, thus decoupling the effects of the meteorology from short-term emission changes. This allowed us to analyze the general effect that restrictions similar to those imposed during the pandemic could have on pollutant concentrations, and this information could be useful to design mitigation strategies.
The results during testing showed that the model captured the observed hourly variations and the diurnal cycles of these pollutants with a normalized mean bias of less than 6 % and Pearson correlation coefficients of the diurnal variations between 0.64 and 0.91 for all the pollutants considered. Based on the random forest results, we estimated that the lockdown implied relative changes in concentration of up to
In recent times, machine learning has been proven to be an efficient approach for air quality prediction by relying on historical data to estimate the temporal variability in different pollutants for a specific site at a low computational cost. Also, this kind of model has the ability to unravel underlying patterns in data and deal with complex interactions among predictive variables
During the last decade, random forest (RF) arose as a new method for the prediction of mean values of atmospheric pollutants
The outbreak of the COVID-19 pandemic at the end of 2019, with its devastating consequences in terms of loss of life and the economic impact, has caused many governments around the world to impose different degrees of lockdown. For atmospheric scientists, it has also provided a unique opportunity to examine changes in air pollution under decreased emission levels, in what
An early approach to analyzing the changes in air quality due to the implementation of specific control measures was to comparatively assess the concentrations during the lockdown with concentrations from the same period during the previous year or the mean value of a period of 5 years using exclusively ground-based or satellite observations. However, the degree to which the COVID-19 lockdown influenced air quality is not only a function of emissions but also of both meteorology and physical and chemical atmospheric transformations
The goals of this study were (i) to provide novel air quality data for the metropolitan area of Buenos Aires (MABA), Argentina, including the first
The remainder of this paper is structured as follows. Section
The MABA comprises the autonomous city of Buenos Aires (ACBA) and 40 surrounding districts of greater Buenos Aires (GBA). Located along the western coast of the Río de la Plata estuary, on a flat plain, the MABA is the third-biggest megalopolis in Latin America and the Caribbean. It has a population of approximately
In terms of anthropogenic air pollutant emissions, road transportation is clearly the largest contributor of CO, VOCs, and PM in the area. The MABA is also affected by the emissions from residential, commercial, and institutional buildings, mainly based on natural gas consumption, and from three power plants located near the shoreline of the La Plata River, which mainly burn natural gas and, to a lesser extent, gas oil and fuel oil. Under these circumstances,
Argentina's national government established different lockdown phases for the duration of the pandemic
Considering the different degrees of the restrictions imposed, we evaluated the impact of the lockdown on air quality according to two distinct periods. The first period, from 20 March to 12 April 2020, corresponded to the most restrictive lockdown (LD). The second period, from 13 April to 25 May, was denoted a partial lockdown (PLD) because some restrictions were lifted. The period of 1–15 March 2020, before the start of the first lockdown, was defined as BLD (before LD) and was used to evaluate the model. As of 16 March, flexible restrictions started but were optional; therefore, the period 16–19 March was not considered in our research.
Because combustion is the main air pollution source in the area, the significant decrease in traffic flow imposed by the lockdown led necessarily to a decrease in the emissions of traffic-related pollutants
Description of the lockdown phases in MABA. NU is for not used (i.e., not included in the model).
The atmospheric general circulation in the MABA is controlled by the influence of the semi-permanent South Atlantic high-pressure system. This system influences the climate of the MABA throughout the year by bringing in moist winds from the northeast, which produce most of the precipitation in the area in the form of frontal systems, or storms produced by cyclogenesis, in autumn and winter
To identify similarities and differences between the meteorological conditions during the lockdown phases and the testing period (BLD, LD, and PLD) with those of the autumn of 2019 (March, April, and May – MAM2019), we carried out a meteorological analysis for all the periods. We used hourly and daily data from the Buenos Aires Central Observatory (OBS; lat 34
Average temperatures in the BLD (24.4
During MAM2019, the average calm value was 6.7 %, while during the BLD, the LD, and the PLD, the corresponding calm values were 3.6 %, 4.7 %, and 8.6 %. Average wind velocity, within the 7.5–8.6
Our analysis showed that there were meteorological differences in terms of temperature and precipitation between autumn 2019 and the periods analyzed in 2020 (BLD, LD, and PLD). This is indicative of the need to take into account the influence of meteorological conditions for comparative purposes of air quality conditions that occurred in the different periods.
We employed air quality data from two monitoring sites, namely the Comisión Nacional de Energía Atómica (CNEA), operated by our research group, and Parque Centenario (PC), managed by the autonomous city of Buenos Aires (described below). Both sites are mostly influenced by the emissions from mobile and residential sources, and, to a lesser extent, by thermal power plants located at least at 6
Location of the MABA in Argentina
From 23 February 2019 to 26 May 2020, a monitoring campaign was carried out in an open area (
The main goal of this monitoring campaign was to assess the temporal variability of
Air pollutant concentrations were continuously acquired (Table
Data were registered per 1 min average. Unfortunately, from 26 May onwards, restrictions on entering our institute where the monitoring station was located led to the need to suspend the monitoring campaign.
Description of the equipment used at the CNEA site.
To include aerosol variations in this analysis and complement the information of the CNEA site, we used
Relatively low concentration values for all the analyzed periods, with no exceedances for the short-term air quality standard for all the pollutants measured
CO and
With respect to the pollutants that were only measured in CNEA,
We used the machine learning random forest method to (i) estimate the relative changes during the LD and the PLD phases and (ii) develop a model for the air quality forecast for the MABA at a low computational cost. To this end, two different approaches have been implemented using a random forest algorithm (Fig. 2). The first one estimates the hypothetical prospective pollutant concentrations that would have occurred in the MABA under the regular emissions conditions (BAU scenario) with the particular meteorological conditions that occurred during the period analyzed. This model, named the random forest predictive model or simply RF, has been applied to the LD and the PLD phases to estimate the concentrations if no lockdown measures had been imposed and compare them with the observations during the lockdown phases. This tool could also be used to forecast the air quality situation in the city. The second approach, referred to as RF normalized or RFN, has been designed to decouple the effects of the meteorology by normalizing the meteorological variables, allowing a generalized assessment of the effect of the changes in the emission patterns. This technique has been applied to compare the concentrations of the different lockdown periods to those of the same time frames in 2019 in order to infer the effects of the sudden reduction in emissions during COVID-19 mobility restrictions period. A summarized schematic of the modeling approach can be seen in Fig.
Schematic description of the model building and evaluation.
Observations from February 2019 to May 2020 were divided into different groups, following the methodology by
The target variables were the measured air pollutant concentrations in each monitoring site, namely CO, NO,
As predictive variables, we considered the (i) data taken from the Argentine Meteorological Weather Service, namely wind speed, wind direction, surface temperature, sea level pressure, and relative humidity, (ii) boundary layer height and total cloud cover taken from ERA5
For the RF normalized, all variables were used, and only the meteorological variables were normalized, following the approach described in
Random forest model with the target variables, predictors, and hyperparameters for RF.
rh2 is the 2 m relative humidity, slp is the sea level pressure, t2 is the 2 m air temperature,
The RF model was tested for adequate performance, focusing on the reproduction of (i) the hourly concentrations, (ii) the mean diurnal cycles, and (iii) the mean value. For each pollutant, the normalized mean bias (NMB) and the Pearson correlation coefficient (
To detect, locate, and characterize different pollution sources
Partial dependency plots were also built using the
In general, modeled CO, NO,
Summary of the evaluation statistics used in the random forest predictive model for the testing dataset (
Mean diurnal cycles for the testing dataset and for the evaluation period (BLD). The line represents the average diurnal cycle, and the shaded area represents the standard deviation.
Average daily concentrations for CNEA and PC sites. The line represents the 24 h average concentration, and the shaded area represents the daily levels between the 25th and 75th percentile.
The results for
Figure
One of the advantages of building a random forest model is that it could provide the key components that reflect the nonlinear relationship among the emissions, the chemistry, and the meteorology by analyzing variables such as the permutation difference (variable importance; Figs.
Variable importance plot (permutation difference) for CO variables (ppm) in CNEA
Partial dependency plots (Figs. S9 to S11) enlighten the relationships between pollutant concentrations and temperature. As an example, in CNEA, while CO, NO, and
We discuss here the relative changes in the (1) measured concentrations during the LD and the PLD periods, in comparison to the RF outputs for the same period, and (2) normalized measured concentrations during the LD and the PLD with normalized concentrations during the same periods but for 2019 (20 March to 12 April and 13 April to 25 May). The corresponding percent relative changes (RC
Since both monitoring sites had been highly influenced by vehicular emissions, the traffic reduction of
On the other hand, observed
Summary of the average concentrations for the BLD, the LD, and the PLD and the relative changes for the LD and the PLD for PC and CNEA sites estimated by RF and RFN. In every case, the RCs were calculated by considering the mean value for each period. In every case, Obs refers to observed concentrations.
Mean diurnal cycle for the different pollutants for the LD (from 20 March to 13 April 2020) and the PLD (13 April to 25 May 2020) for both sites. The line represents the average diurnal cycle, and the shaded area represents the standard deviation.
Mean diurnal cycle for different pollutants for the LD, PLD, and MAM2019 periods, with meteorological normalization for both sites.
Bivariate polar plot for CNEA of hourly means for observations during MAM2019 and the lockdown periods versus the BAU scenario estimated with the RF model. The radial axis represents wind speed, the angular axis represents wind direction, and the color scale represents pollutant concentrations.
Figure
In what follows, the results are presented by species, highlighting the most relevant relative changes in concentrations. The results of the meteorological normalization are used to evaluate the effects of the changes in emissions of particular pollutants as a consequence of the restrictions previously discussed. Bivariate polar plots were used to distinguish potential sources that impact the monitoring sites (Figs.
Bivariate polar plot for PC of hourly means for observations during MAM2019 and the lockdown periods versus the BAU scenario estimated with the RF model. The radial axis represents wind speed, the angular axis represents wind direction, and the color scale represents pollutant concentrations.
As shown in Table
In PC, the recovery of traffic during the PLD (
The observed CO had lower concentration values and flatter diurnal patterns than our simulations of a BAU scenario (Fig.
As shown in Fig.
An equivalent analysis for PC (Fig.
The drastic reduction in vehicular emissions impacted positively on the NO and
At both sites, the relative changes in nitrogen oxide levels were larger than those of CO. Arguably, this indicates that the power plants did not contribute in any major way to the observed differences. This is probably due to a reduced circulation of diesel vehicles, which are the major nitrogen oxide emitters
We also see a flattening of the diurnal cycles of NO during the LD, both in the RF predictive model and in the analysis with normalized meteorology (Figs.
Figure
In the case of PC, as shown in Fig.
In contrast to the other pollutants considered, the
Recent studies of the lockdown effects on atmospheric composition have also reported large
It is well known that decreasing nitrogen oxides levels in a VOC-limited regime tend to increase
With respect to the role of aerosols in the
In this case, meteorological factors were clearly highly relevant, as can be seen by the fact that the relative change estimated with the RFN model is far smaller (27 % for
When the meteorology is normalized, the valleys at 07:00 and 20:00 are clearly less marked during 2020 than during 2019 and almost disappeared during the LD when compared with the normalized values for the same period in the previous year (Fig.
As expected, the bivariate polar plots (Fig.
From these results, we can also derive that the area in which the CNEA site is located behaves as a region with a VOC-limited chemical regime because the reduction in
During the LD, the
While all other species in this study are mostly controlled, directly or indirectly, by on-road traffic emissions, according to our findings,
Another possible reason for having a smaller relative change in
During the LD,
When winds are taken into account (Fig.
Although, as expected, most pollutants were noticeably reduced during the LD due to the restrictions imposed,
This highlights the importance of having comprehensive air quality policies rather than focusing on reducing individual pollutants.
Hourly concentrations of CO, NO,
In this study, we present novel air quality data for a residential site located in the metropolitan area of Buenos Aires that includes concentrations of CO, NO, and
The resulting set of explanatory variables for the different pollutants at each site provides evidence of the need for careful variable identification during the training period. Although ideally the best explanatory variables could be identified by trial and error by inexperienced users of random forest models with the support of variable importance plots, expert judgment is advisable for a meaningful and relatively fast selection. The RF model was able to reproduce air quality observations at two monitoring stations in the MABA when evaluated for a 15 d period prior to the outbreak of the COVID-19 pandemic. This approach allowed predicting the pollutant hourly mean values with a mean bias of less than 10 % by using the data of air quality, emissions, and meteorology and analyzing the effect of wind direction and speed in pollutant concentration, which is useful when characterizing pollution sources. During the lockdown, all primary pollutants had lower concentrations than what the RF framework would predict for a business-as-usual scenario. The relative change ranged from RF estimations can be implemented at a low computational cost and can be used to assess the changes that occurred in a specific period if an anomalous situation happened. It can also be used to forecast air quality conditions in the short term at a lower cost than CTMs, which could be of use for local authorities, considering that the MABA has, thus far, only six long-term air quality monitoring stations. When, as in this case, detailed temporal information on different emission sources is lacking (for example, traffic information from on-road sensors), it is essential to use a set of data in which the emissions are similar to those that are expected to be simulated. The model also allows the analysis of the relations between different pollutants, which is of particular interest for those that have very complex chemistry, such as To assess the effectiveness of a particular measure in air quality (AQ) independently of particular meteorological conditions of specific periods, a meteorological normalization technique based on random forest can be used. This approach is relatively simple to implement with already existing R packages. Although previous studies employed both techniques with similar aims, we postulate that the use of the RF predictive model and the meteorological normalization serve different purposes and should be used accordingly. The predictive model can be used to analyze the changes for particular weather conditions or, combined with a meteorological forecast, to forecast pollutant concentrations. On the other hand, the meteorological normalization makes it possible to evaluate the general impact on concentrations due to changes in emissions, decoupling the effects of particular meteorological conditions from the short-term emission changes from the AQ datasets. In this work we provide the first year-long in situ observational dataset on tropospheric According to our measurements, the MABA seems to be in a VOC-limited regime. If VOC emissions are not carefully regulated, a
The supplement related to this article is available online at:
MDR and LD conceived the conceptualization. DA and MDO acquired the data for CNEA, and MDR retrieved the data from the Argentine National Weather Service and the environmental authority of the city of Buenos Aires. MDR validated the data. MDR and DA curated the data. MDR and PL analyzed the data, and CR contributed with the meteorological analysis. MDR, PL, DG, and LD performed the formal analysis. DG and LD supervised the project and acquired the funding. MDR, PL, DA, MDO, CR, DG, and LD wrote the original draft. MDR, PL, MDO, PC, DG, and LD reviewed the draft. MDR, PL, MDO, DG, and LD were part of the editing process.
The contact author has declared that none of the authors has any competing interests.
Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the special issue “Benchmark datasets and machine learning algorithms for Earth system science data (ESSD/GMD inter-journal SI)”. It is not associated with a conference.
We want to acknowledge the participation of the entire group of Atmospheric Chemistry of the National Atomic Energy Commission of Argentina (CNEA), who continued with the campaign, even during Lockdown. The authors wish to thank the Environmental Protection Agency of Buenos Aires (APRA) and the Argentine National Weather Service (SMN) for sharing the air quality and meteorological data for this study. We also appreciate the editor and reviewers, for their comments and recommendations that helped to improve the paper.
This research has been supported by CNEA (grant no. CNEA-GQ-20192), the Agencia Nacional de Promoción Científica y Tecnológica, Argentina (grant nos. PICT-O 2016-4802 and PICT 2016-3590), and the EU Horizon 2020 Marie Skłodowska-Curie project PAPILA (grant no. 777544; MSCA action for research and innovation staff exchange).
This paper was edited by Nellie Elguindi and reviewed by two anonymous referees.