A machine learning approach to address air quality changes during the COVID-19 lockdown in Buenos Aires, Argentina

. Having a prediction model for air quality at a low computational cost can be useful for research, forecasting, regulatory, and monitoring applications. This is of particular importance for Latin America, where rapid urbanization has imposed increasing stress on the air quality of almost all cities. In recent years, machine learning techniques have been increasingly accepted as a useful tool for air quality forecasting. Out of these, random forest has proven to be an approach that is both well-performing


Introduction
In recent times, machine learning has been proven to be an efficient approach for air quality prediction by relying on historical data to estimate the temporal variability in different pollutants for a specific site at a low computational cost. Also, this kind of model has the ability to unravel underlying patterns in data and deal with complex interactions among predictive variables (Stafoggia et al., 2020).
During the last decade, random forest (RF) arose as a new method for the prediction of mean values of atmospheric pollutants (Yu et al., 2016;Feng et al., 2019;Jiang and Riley, 2015). This is a supervised machine learning method, consisting of applying multiple tree classifiers created at random, using bagging (i.e., selecting samples stochastically to create new datasets from which every classification tree is created). RF requires a short training time and can provide reliable information on air quality, with a strong anti-overfitting ability . Many data science programming languages have libraries in which random forest is already efficiently implemented (e.g., scikit-learn in Python or randomForest in R). Random forest is faster and cheaper than other available models, such as regional chemical transport models (CTMs), and, in terms of computation costs, it needs fewer input variables and is a useful method when information on air pollutant concentrations at a particular site is needed. According to Masih (2019), machine learning techniques may even provide better forecasting than CTMs, and, out of the different existing algorithms, random forest seems to stand out due to its simplicity and the quality of its results, which can account for nonlinear relationships between emissions, chemical reactions, and meteorological effects. With respect to complex reactive species, the random forest method has also been successfully used to assess O 3 levels. For example, Zhan et al. (2018) satisfactorily applied the random forest method to predict spatiotemporal variability in daily O 3 concentrations across China using information on meteorology, elevation, and emission inventories. One of the most recent applications of machine learning methods has been aimed at elucidating the interconnection among the COVID-19 pandemic lockdown measures, human mobility, and air quality (Rahman et al., 2021;Velders et al., 2021;Yang et al., 2021).
The outbreak of the COVID-19 pandemic at the end of 2019, with its devastating consequences in terms of loss of life and the economic impact, has caused many governments around the world to impose different degrees of lockdown. For atmospheric scientists, it has also provided a unique opportunity to examine changes in air pollution under decreased emission levels, in what Gaubert et al. (2021) called an unintentional worldwide experiment. Many studies have, in general, identified significant decreases in most pollutants, except for O 3 , under the stay-at-home orders imposed in attempt to curb the spread of COVID-19 (Muhammad et al., 2020;Faridi et al., 2021;Srivastava, 2021;Grange et al., 2021;Yang et al., 2021). These drastic changes in anthropogenic emissions are of major interest to enhance our understanding of the chemistry related to air quality, particularly when the behavior of secondary pollutants, like ozone (O 3 ) or components of particulate matter (PM), is explored . O 3 , in particular, has a complex behavior depending on multiple factors. Nitrogen monoxide (NO) and nitrogen dioxide (NO 2 ) conform as NO x that, together with volatile organic compounds (VOCs), plays a vital role in the O 3 formation process, and its production can be either VOC limited or NO x limited (Shi and Brasseur, 2020;Liu et al., 2021;Li et al., 2019).
An early approach to analyzing the changes in air quality due to the implementation of specific control measures was to comparatively assess the concentrations during the lockdown with concentrations from the same period during the previous year or the mean value of a period of 5 years using exclusively ground-based or satellite observations. However, the degree to which the COVID-19 lockdown influenced air quality is not only a function of emissions but also of both meteorology and physical and chemical atmospheric transformations (Kroll et al., 2020;Le et al., 2020). Consequently, pure statistical tests or observational comparisons might be inadequate in providing a complete understanding of what influences pollutant concentrations, since weather conditions, particle persistence, transport, radiation, and seasonality affect concentrations by linear and nonlinear processes (Šimić et al., 2020). In this work, this challenge has been addressed using two different but complementary approaches. The first one consists of using a model to simulate a hypothetical scenario in which the restrictions were not implemented, which we did using the random forest (RF) algorithm, as previously done by Velders et al. (2021). The second one consists of a random-forest-based normalization of the meteorological variables, which makes it possible to decouple the emission changes Grange and Carslaw, 2019;Vu et al., 2019).
The goals of this study were (i) to provide novel air quality data for the metropolitan area of Buenos Aires (MABA), Argentina, including the first O 3 and SO 2 observational datasets in a residential area in more than a decade, (ii) to explore the performance of the random forest method in predicting the air quality situation at two monitoring sites in the MABA, (iii) to apply this methodology to estimate the changes in air pollutant concentrations under the COVID-19 control measures, and (iv) to assess the effect of the reduction in emissions by normalizing the meteorological variables. We implemented the RF algorithm to estimate the concentrations of CO, NO 2 , NO, sulfur dioxide (SO 2 ), O 3 , and particles with an aerodynamic diameter less than or equal to 10 µm 3 (PM 10 ) using meteorological and air quality observations in addition to the local diurnal variation in emissions as explanatory variables. Trained with data acquired in 2019 and 2020 before the start of the pandemic with the variables available for this city, the RF method can only predict concentrations under a business-as-usual (BAU) scenario. We then compared these BAU estimations with the observations during two distinct lockdown phases. We also used a random forest normalization (RFN) technique to decouple the effects of the meteorology over the concentration of the pollutants by normalizing the meteorological variables based on Shi et al. (2021). We compared them with the normalized observations for the same period in the previous year, allowing us to assess the effect of reducing the emissions independently of the particular meteorological situation that occurred during the specific periods analyzed. In addition, we studied the responses of O 3 to the reduction in emissions of its precursors (NO x and VOCs) because of its relevance regarding emission control and health effects.
The remainder of this paper is structured as follows. Section 2 provides a description of the studied area, the different lockdown phases, the air quality and meteorological data, and the structure of the random forest models used to estimate the relative changes (RCs) during the lockdown. The analysis of the model performance and the analysis of the impact due to the emission reductions are given in Sect. 3. Section 4 provides a description of the data and code avail-ability. Finally, Sect. 5 presents a summary and the main conclusions of this work.

Description of the studied area
The MABA comprises the autonomous city of Buenos Aires (ACBA) and 40 surrounding districts of greater Buenos Aires (GBA). Located along the western coast of the Río de la Plata estuary, on a flat plain, the MABA is the third-biggest megalopolis in Latin America and the Caribbean. It has a population of approximately 13 × 10 6 , with a heterogeneous population density in the range of 14 000-20 000 inhabitants per kilometer squared. Its active fleet reached 5.4 × 10 6 vehicles by 2019 (Anapolsky, 2020).
In terms of anthropogenic air pollutant emissions, road transportation is clearly the largest contributor of CO, VOCs, and PM in the area. The MABA is also affected by the emissions from residential, commercial, and institutional buildings, mainly based on natural gas consumption, and from three power plants located near the shoreline of the La Plata River, which mainly burn natural gas and, to a lesser extent, gas oil and fuel oil. Under these circumstances, NO x is emitted by stationary and mobile sources in a similar amount . Since most of Buenos Aires' vehicle fleet uses low-sulfur fuel, the majority of the SO 2 emissions are due to heavy-duty diesel engines used by ships, trucks, and, occasionally, small electricity generators.

Description of the lockdown for the MABA
Argentina's national government established different lockdown phases for the duration of the pandemic (Decree 297/2020, 2020). Since 80 % of Argentina's COVID-19 cases were concentrated in the MABA, some policies that were applied to the MABA region differed from those applied to the rest of the country. Starting on 20 March 2020, strict measures were imposed to avoid a sharp increase in COVID-19 cases, emphasizing that the population should stay at home and avoid any social contact. All non-essential stores, including toy, furniture, and clothing stores, were closed until 11 May. Table 1 provides a summary of the restrictions set for the MABA during each phase. Under severely restricted mobility, public transport and passenger car circulation decreased drastically. Local mobility dropped by 80 % during the intense lockdown phase and 65 % during the flexible lockdown phase until the end of May (Aktay et al., 2020). It is worth noting that, before the COVID-19 pandemic, 1 × 10 6 vehicles entered the city of Buenos Aires from the suburbs per day.
Considering the different degrees of the restrictions imposed, we evaluated the impact of the lockdown on air quality according to two distinct periods. The first period, from 20 March to 12 April 2020, corresponded to the most restrictive lockdown (LD). The second period, from 13 April to 25 May, was denoted a partial lockdown (PLD) because some restrictions were lifted. The period of 1-15 March 2020, before the start of the first lockdown, was defined as BLD (before LD) and was used to evaluate the model. As of 16 March, flexible restrictions started but were optional; therefore, the period 16-19 March was not considered in our research.
Because combustion is the main air pollution source in the area, the significant decrease in traffic flow imposed by the lockdown led necessarily to a decrease in the emissions of traffic-related pollutants (D'Angiola et al., 2010;Puliafito et al., 2017;Diaz Resquin et al., 2018;Castesana et al., 2022).

Meteorological description
The atmospheric general circulation in the MABA is controlled by the influence of the semi-permanent South Atlantic high-pressure system. This system influences the climate of the MABA throughout the year by bringing in moist winds from the northeast, which produce most of the precipitation in the area in the form of frontal systems, or storms produced by cyclogenesis, in autumn and winter (Barros et al., 2006). In terms of the climate conditions of the MABA, temperatures at the beginning of autumn range from warm to hot in the afternoon, but they are mild during the nights and in the mornings. Later on in the season, conditions are cooler, featuring mild afternoons and cold nights and mornings.
To identify similarities and differences between the meteorological conditions during the lockdown phases and the testing period (BLD, LD, and PLD) with those of the autumn of 2019 (March, April, and May -MAM2019), we carried out a meteorological analysis for all the periods. We used hourly and daily data from the Buenos Aires Central Observatory (OBS; lat 34 • 35 S, long 58 • 29 W). The site of the Argentine National Weather Service is located in a residential area. It is representative of the meteorology of the air quality conditions under study.
Average temperatures in the BLD (24.4 • C) and in the LD (21.1 • C) were higher than those in MAM2019 (18 • C), while the average temperature in the PLD (16.8 • C) was lower than that in MAM2019 but close to the corresponding value in May 2019 (16 • C). Precipitation in March and April 2020 exceeded the accumulated values of the same months in 2019 (+60 % and +90 %, respectively). On the contrary, precipitation in May 2020 exhibited significantly lower values than those of 2019 (−75 %).
During MAM2019, the average calm value was 6.7 %, while during the BLD, the LD, and the PLD, the corresponding calm values were 3.6 %, 4.7 %, and 8.6 %. Average wind velocity, within the 7.5-8.6 km h −1 range, was similar in all periods. In autumn 2020, the prevailing wind was from the NW-N sector, with an average contribution of 34 % compared to 26.5 % in 2019. The LD and the PLD periods had In yellow, the location of OBS, the site of the Argentine National Weather Service monitoring site referred to in this study, and, in red, the air quality monitoring sites (shapefiles from IGN, 2021) are shown. a similar direction of prevailing winds as in autumn 2019. In contrast, 45 % of winds during the BLD were from the NE-E sector.
Our analysis showed that there were meteorological differences in terms of temperature and precipitation between autumn 2019 and the periods analyzed in 2020 (BLD, LD, and PLD). This is indicative of the need to take into account the influence of meteorological conditions for comparative purposes of air quality conditions that occurred in the different periods.

Air quality data
We employed air quality data from two monitoring sites, namely the Comisión Nacional de Energía Atómica (CNEA), operated by our research group, and Parque Centenario (PC), managed by the autonomous city of Buenos Aires (described below). Both sites are mostly influenced by the emissions from mobile and residential sources, and, to a lesser extent, by thermal power plants located at least at 6 km from them (Diaz Resquin et al., 2018;Pineda Rojas et al., 2020).  District-differentiated lockdown II PLD All non-essential stores, including toy, furniture, and clothing stores, were permitted to open with specific protocols. Children were allowed to go for a walk with an accompanying adult at the weekends but no further than 500 m from home.

%
NO, NO 2 , SO 2 , and O 3 ) and their temporal variability in a residential area of the MABA. The main goal of this monitoring campaign was to assess the temporal variability of SO 2 and O 3 in the area for an entire year. Although it may seem surprising, especially for a megacity like the MABA, there is scarce and fragmentary information on the concentrations of SO 2 and O 3 . Presently, O 3 is routinely monitored in one site of the MABA, which is located in an industrial area. Past data for the region are only available from a few short-time campaigns carried out in the early 2000s (Reich et al., 2006). Similarly, there is a lack of monitored SO 2 concentrations because historical measurements carried out in the 1990s reported very low values, and therefore, the decision-makers decided not to measure this pollutant on a regular basis. However, it has now become a pollutant of concern for local authorities, who have recently decided to start monitoring SO 2 in two of the four air quality stations of the ACBA in the near future.
Air pollutant concentrations were continuously acquired (Table 2). Monitors were placed at an approximate height of 10 and 100 m east of a main traffic artery with a high den-sity of buses, light-duty trucks, and passenger cars. Another main artery is located 500 m north, having circulation of vehicles including trucks and buses in a low-speed, stop-and-go pattern. The Jorge Newbery City Airport, two thermal power plants, the La Plata River, and the port are located within a 19 km radius of the monitoring station.
Data were registered per 1 min average. Unfortunately, from 26 May onwards, restrictions on entering our institute where the monitoring station was located led to the need to suspend the monitoring campaign.

Parque Centenario
To include aerosol variations in this analysis and complement the information of the CNEA site, we used PM 10 , CO, NO, and NO 2 data from the PC station (34.61 • S, 58.44 • W), which is one of the surface air quality sites of the Environmental Protection Agency of Buenos Aires city (APRA). This site is located in a residential-commercial area, with medium vehicular flow and a relatively low incidence of stationary sources. A monthly technical report Cross-flow modulation type with reduced chemiluminescence detector.
0.099 ppmV ± 1.4 % diluted (NO) * The calibration of the ambient air gases detectors was performed by following U.S. Environmental Protection Agency (EPA) regulations and HORIBA standard procedures (see U.S. EPA CFR 40 Part 50, Appendix A1, C, D and F, and the corresponding user manual for the HORIBA AP devices). The APMA-370, APSA-370, and APNA-370 devices were calibrated using EPA certified calibration gases and diluted with an Environics 6103, a National Institute of Standards and Technology (NIST) traceable mass flow controller dilutor, when needed. The APOA-370 device was calibrated with ozone generated with the Environics 6103 NIST traceable internal UV-based ozone generator. Note: ppmv is parts per million by volume.
of the hourly averaged concentrations registered in PC is available at the APRA website (https://data.buenosaires.gob. ar/dataset/calidad-aire; APRA, 2021). Although the city has three other monitoring stations, at least one of the essential periods needed for this study was missing in each of their datasets. Therefore, they were not taken into account for this study.

Summary of the datasets
Relatively low concentration values for all the analyzed periods, with no exceedances for the short-term air quality standard for all the pollutants measured (Decree 1074/18, 2018; Act 1356, 2004), were registered in both sites. Air pollutants, except for SO 2 , exhibited well-defined diurnal cycles (see Fig. S2 in the Supplement).
CO and NO x patterns were governed by traffic emissions (Figs. S1 and S2), with the maximum values occurring in winter. Annual mean average values of NO x were ∼ 37 ppb (parts per billion) for both CNEA and PC. Relevant differences in CO were identified, with annual mean levels in PC doubling those measured in CNEA (0.51 ppm -parts per million -versus 0.26 ppm).
PM 10 , which was only measured in PC, had a mean value of 21 µg m −3 , with the maximum values at noon.
With respect to the pollutants that were only measured in CNEA, SO 2 maximum concentrations were registered during autumn (April), with monthly averages in the 2-2.9 ppb range. In terms of O 3 concentrations, maximum daylight levels were registered during summer. The diurnal cycle pre-sented higher levels during the afternoon and was the opposite to those of NO and NO 2 .

Modeling approach
We used the machine learning random forest method to (i) estimate the relative changes during the LD and the PLD phases and (ii) develop a model for the air quality forecast for the MABA at a low computational cost. To this end, two different approaches have been implemented using a random forest algorithm (Fig. 2). The first one estimates the hypothetical prospective pollutant concentrations that would have occurred in the MABA under the regular emissions conditions (BAU scenario) with the particular meteorological conditions that occurred during the period analyzed. This model, named the random forest predictive model or simply RF, has been applied to the LD and the PLD phases to estimate the concentrations if no lockdown measures had been imposed and compare them with the observations during the lockdown phases. This tool could also be used to forecast the air quality situation in the city. The second approach, referred to as RF normalized or RFN, has been designed to decouple the effects of the meteorology by normalizing the meteorological variables, allowing a generalized assessment of the effect of the changes in the emission patterns. This technique has been applied to compare the concentrations of the different lockdown periods to those of the same time frames in 2019 in order to infer the effects of the sudden reduction in emissions during COVID-19 mobility restrictions period. A summarized schematic of the modeling approach can be seen in Fig. 2. Observations from February 2019 to May 2020 were divided into different groups, following the methodology by Grange et al. (2021), using 8710 total data points for CNEA and 9198 for PC. The training of the models was conducted using a random sample of the 80 % of the input data from February 2019 to February 2020. The remaining 20 % was used as testing (t) to choose the model configuration with best statistical metrics. The BLD period (360 data points; see Table 1) was established as a different evaluation period in order to check the adequate performance of the model 2 weeks before the lockdown periods. Data collected from 20 March to 25 May and the RF estimates were used to quantify and interpret the changes during the LD and the PLD.
The target variables were the measured air pollutant concentrations in each monitoring site, namely CO, NO, NO 2 , O 3 , and SO 2 (CNEA) and CO, NO, NO 2 , and PM 10 (PC).
As predictive variables, we considered the (i) data taken from the Argentine Meteorological Weather Service, namely wind speed, wind direction, surface temperature, sea level pressure, and relative humidity, (ii) boundary layer height and total cloud cover taken from ERA5 (Hersbach et al., 2018, 2020, (iii) pollutant concentrations measured in each of the sites (APrA, 2021; Diaz Resquin et al., 2021), (iv) time variables such as month, hour, and weekday, and (v) diurnal and weekly emission cycles for pollutants associated with gasoline and diesel emissions Freitas et al., 2011). For the predictive model, all of these variables were tested as explanatory variables for each pollutant, and those performing the best for the testing dataset were selected. Table 3 presents the final set of predictive variables used in the RF model in addition to the hyperparameters that were employed.
For the RF normalized, all variables were used, and only the meteorological variables were normalized, following the approach described in Shi et al. (2021), which consists of resampling only the weather data over the whole study period and is considered adequate for studying emission changes. We employed the randomForest package of the R pro- ws, gas_emcycle, aer_emcycle, CO, NO, NO 2 , month, weekday, hour rh2 is the 2 m relative humidity, slp is the sea level pressure, t2 is the 2 m air temperature, U is the 10 m U component of wind, V is the 10 m V component of wind, wd is the 10 m wind direction, ws is the 10 m wind speed, gas_emcycle is the gasoline-related emission cycle, and aer_emcycle is the diesel-related emission cycle. The hyperparameters are ntree (number of trees to grow) of 300 and mtry (number of variables randomly sampled as candidates at each split), which is the rounded-down square root of the number of variables.

Random forest model evaluation and assessment tools
The RF model was tested for adequate performance, focusing on the reproduction of (i) the hourly concentrations, (ii) the mean diurnal cycles, and (iii) the mean value. For each pollutant, the normalized mean bias (NMB) and the Pearson correlation coefficient (r) for the hourly concentrations were calculated. The diurnal cycles were comparatively assessed by a graphical inspection of the temporal series of the mean values and spreads of the modeled and observed concentrations of each pollutant.
The NMB is useful for comparing pollutants that cover different concentration scales, and it is defined as the difference between modeled and observed mean concentrations, normalized by dividing by the mean observed concentration for that period. The r coefficient is useful to measure the linear relationship between two variables.
To detect, locate, and characterize different pollution sources (Carslaw and Beevers, 2013;Grange et al., 2016), bivariate polar plots were built considering observations and RF results, using the openair library of the R programming language (Carslaw and Ropkins, 2012;R Core Team, 2019). These plots provided a graphical support to analyze the air pollutant concentrations together with wind speed and wind direction with and without COVID-19 restrictions. We also calculated them for March, April, and May 2019 (MAM2019), so as to have a baseline to identify the sources of the different pollutants.
Partial dependency plots were also built using the rmweather library of R (Grange et al., 2018;Grange and Carslaw, 2019) to highlight the relationships between pollutant concentrations and all explanatory variables presented in Table 3 and can be seen in the Supplement (Figs. S9 to S11). By obtaining the prediction from the random forest model for each unique value of a specific explanatory variable, these plots allow us to analyze how this dependency varies for different values of the explanatory variable and therefore helps us to detect nonlinear relationships, which are highly relevant in air quality.

Analysis of the results of the random forest models
In general, modeled CO, NO, NO 2 , and PM 10 concentrations at both sites were in good agreement with the corresponding observations (see Table 4), with NMB < 6 % for both sites for the testing dataset. The Pearson correlation coefficient during testing (r t ) was above 0.7 for all pollutants, except PM 10 . That was probably due to both a complex chemistry, with primary and secondary processes being highly relevant, and the effect of a few regional events during the period having a large impact on particulate matter. In addition, calculations of diurnal cycles utilizing RF outcomes adequately reproduced the clear bimodal behavior of CO, NO, and NO 2 (Fig. 3). Nevertheless, biases during the BLD period are moderately larger than during the testing period. This is to be expected, given that the model was optimized to reproduce the testing period.
The results for O 3 were also satisfactory, particularly considering its secondary nature with complex dynamics which depends on multiple factors such as radiation energies, VOCs, and NO x concentrations and their ratios (Seinfeld and Pandis, 1998). Model performance indicators were NMB t = 2.2 % and r t = 0.85. Other processes involved in O 3 chemistry (like the O 3 /VOCs and O 3 /NO x ratios) in the MABA were analyzed as a further way to test the RF model performance. The O 3 -CO ratio was used as a proxy for VOCs because direct VOC observations were unavailable in the MABA, and traffic-borne VOCs are intimately linked to CO (Bon et al., 2011;Cazorla et al., 2020). Overall, above 75 % of O 3 -CO and O 3 -NO x hourly ratios from RF  were within a factor of 2 of those resulting from the observations (Fig. S4). The Pearson correlation coefficients between observed and estimated O 3 -CO and O 3 -NO x hourly ratios were found to be 0.85 and 0.9, respectively. In this context, this model was suitable for reproducing not only the levels of primary contaminants in the two analyzed sites but also the formation of O 3 at the CNEA site. The diurnal cycle of SO 2 (Fig. 3) during the BLD period had a sharp peak between 18:00 and 20:00 local time (LT is given here and elsewhere in the paper, unless indicated otherwise) that could not be entirely captured by the model, but it was linked to a day of particularly high concentrations during that time period. Concentrations from 12:00 to 17:00 were also overestimated during the BLD. Figure 3 shows that, during the BLD, the diurnal cycles of O 3 and SO 2 estimated using RFN are noticeably different from those calculated using RF and the observations. This is further evidence that the atmospheric conditions can affect the concentrations of pollutants in a relevant way under certain weather conditions.
One of the advantages of building a random forest model is that it could provide the key components that reflect the nonlinear relationship among the emissions, the chemistry, and the meteorology by analyzing variables such as the permutation difference (variable importance; Figs. 5 and S6 to S8) and the partial dependencies. The analysis of the variable importance plots shows that the boundary layer height and the wind speed were important variables to predict CO concentrations at both sites for normalized and not normalized models. This result is consistent with the fact that, at the temporal scale studied here, CO can be considered to be a passive tracer (Saide et al., 2011). For NO 2 and NO, the most important variables were the other pollutants included in the models and the surface temperature (Table 3), which was also expected because temperature has an influence on NO x chemistry. In the case of O 3 , the model was dominated by the concentrations of NO x and CO, with NO 2 being the most relevant, which is consistent with O 3 chemistry (see Sect. 3.2.2 and 3.2.3). The variable importance plot for PM 10 model shows that CO and NO 2 are the most important variables for predicting PM 10 concentrations. This was also expected because in Buenos Aires around 65 % of the PM 10 is PM 2.5 , and the latter is highly correlated with CO (Arkouli et al., 2010).
Partial dependency plots (Figs. S9 to S11) enlighten the relationships between pollutant concentrations and temperature. As an example, in CNEA, while CO, NO, and NO 2 concentrations were inversely related to temperature, SO 2 presented the opposite behavior. As described by Grange and Carslaw (2019), this relationship of SO 2 with temperature could be associated with shipping emissions. This is also consistent with the fact that there is also a high partial dependence on wind directions from 0 to 100 • (Figs. S3 and S11), which is the range of wind that brings air masses from the La Plata River.

Quantifying and analyzing the changes in concentrations during the lockdown periods
We discuss here the relative changes in the (1) measured concentrations during the LD and the PLD periods, in comparison to the RF outputs for the same period, and (2) normalized measured concentrations during the LD and the PLD with normalized concentrations during the same periods but for 2019 (20 March to 12 April and 13 April to 25 May). The corresponding percent relative changes (RC RF and RC RFN ) were estimated using the expressions presented in Eqs. (3) and (4). We make use of RC RF to quantify the number of changes with respect to a BAU scenario for the particular meteorological conditions that happened during the two lockdown periods, and RC RFN is used to quantify the effects of the changes in emissions of these pollutants sources rather than meteorological or environmental effects of particular atmospheric conditions. is the corresponding predictive RF for the same period. RFN LD,PLD refers to the data for the LD and the PLD with the normalization of the meteorological variables, which was compared with the meteorologically normalized data of the same period in 2019. Since both monitoring sites had been highly influenced by vehicular emissions, the traffic reduction of ∼ 80 % that was registered during the LD period led to a significant air quality improvement in primary pollutants (Fig. 6). In almost all cases, except for CO in PC and for SO 2 , the meteorological conditions amplified the change, as shown by the fact that RC RFN is smaller than RC RF . This is consistent with the results obtained by Shi et al. (2021).
On the other hand, observed O 3 levels were 80 % and 57 % higher in comparison with the RF estimations for the LD and the PLD, respectively. However, the fact that this increment was considerably smaller when the meteorology was normalized indicates that this change was strongly enhanced by the meteorological conditions that occurred during that period. Figure 4 displays the differences in daily concentrations between observations and RF estimates for the three considered periods (BLD, LD, and PLD). For CNEA, CO and NO x observations and predictions for the BLD period showed NMB < 10 %. Noticeably, most of the changes were observed right from the day after lockdown. Pollutant levels were almost fully recovered by the last week of the PLD period.
In what follows, the results are presented by species, highlighting the most relevant relative changes in concentrations. The results of the meteorological normalization are used to evaluate the effects of the changes in emissions of particular pollutants as a consequence of the restrictions previously discussed. Bivariate polar plots were used to distinguish potential sources that impact the monitoring sites (Figs. 8 and 9).

Carbon monoxide
As shown in Table 5 and discussed below, there was a reduction in CO levels when the highest restrictions were in place (LD). However, the behavior of this pollutant when the restrictions were partially lifted (PLD) differed, depending on the measuring site.
In PC, the recovery of traffic during the PLD (RC PLD RF = −19 %) did not result in a smaller relative change with respect to a scenario with higher restrictions (RC LD RF = −20 %). Nevertheless, as shown by RC RFN , when decoupling the effects of the meteorology, the relative change was −20 % in the LD but only −7 % in the PLD with respect to the normalized values for the same periods in 2019. These results show the influence that the particular meteorological conditions had on CO concentrations in PC. On the other hand, in CNEA, the partial lift of restrictions during the PLD resulted in a smaller relative change in CO concentrations that is clear for both the particular meteorological conditions of the two periods (−45 % for RC LD RF vs. −26 % for RC PLD RF ) and for the normalized model (−26 % for RC LD RFN vs. −11 % for RC PLD RFN ). The observed CO had lower concentration values and flatter diurnal patterns than our simulations of a BAU scenario (Fig. 6). This reduction far surpasses any bias detected in RF simulations, particularly during rush hours, where RF showed close to no bias (Fig. 3). This is particularly true in CNEA, where the general reduction in CO was larger. For this pollutant, there are no big differences between the changes in the normalized diurnal cycle and those obtained by comparing the RF predictive model with the observations (Figs. 6 and 7).
As shown in Fig. 8, for the CNEA site during MAM2019, concentrations were similar for all wind directions and speeds (up to 8 m s −1 ). The largest relative changes between the 2020 observations and the RF simulations were when winds were coming from the east and southeast (both for the LD and the PLD). These were probably due to a reduction in Table 5. Summary of the average concentrations for the BLD, the LD, and the PLD and the relative changes for the LD and the PLD for PC and CNEA sites estimated by RF and RFN. In every case, the RCs were calculated by considering the mean value for each period. In every case, Obs refers to observed concentrations.  (2018), is one of the principal sources of fuel combustion emissions. An equivalent analysis for PC (Fig. 9) yielded similar results during MAM2019, although concentrations seemed to be largest when winds were from the west. However, relative changes during the LD and the PLD did not seem to have a clear dominant wind direction. During the PLD, sources from the west reappeared.

Nitrogen oxide levels
The drastic reduction in vehicular emissions impacted positively on the NO and NO 2 levels. As shown in Table 5, during the LD period, NO levels were one-third and one-fourth of the estimated value for a BAU scenario in PC and CNEA, respectively. The relative change for NO 2 was ∼ −45 %. During the PLD, the relative change was smaller, with −37 % for both sites for NO and −20 % and −30 % for NO 2 in PC and CNEA, respectively.
At both sites, the relative changes in nitrogen oxide levels were larger than those of CO. Arguably, this indicates that the power plants did not contribute in any major way to the observed differences. This is probably due to a reduced circulation of diesel vehicles, which are the major nitrogen oxide emitters (D'Angiola et al., 2010;Ghaffarpasand et al., 2020).
RC RFN shows that these changes were consistently enhanced by the meteorological conditions during that period, so that the changes with a meteorological normalization are between two-thirds and half as large as those without.
We also see a flattening of the diurnal cycles of NO during the LD, both in the RF predictive model and in the analysis with normalized meteorology (Figs. 6 and 7). The bimodal curve is partially recovered during the PLD. This indicates, once again, the strong role of traffic emissions in NO concentrations. NO 2 , however, preserves most of its bimodal nature, albeit somewhat diminished. Although a clear explanation of this fact is hard to find, while NO is predominantly a primary pollutant, NO 2 is partially secondary in origin and is largely influenced by NO, O 3 , and HO x concentrations, as well as radiation and other meteorological parameters (Han et al., 2011;Brasseur and Jacob, 2017). NO is photochemically converted to NO 2 by reacting with O 3 during the morning but is converted back to NO due to photolysis during the daytime, generating an O radical that regenerates the O 3 . At night, O 3 and NO 2 react with each other in a chain of reactions that ends up generating HNO 3 in the aqueous phase of aerosols. The diurnal cycle of this photochemical processes should be largely regulated by the solar radiation and therefore unaffected by the restrictions. This remains true, even if NO emissions are flattened and the total concentrations of NO 2 are also clearly lower, particularly during daytime. Figure 8 shows the bivariate polar plots of the NO x concentrations at the CNEA site. The bivariate polar plot in MAM2019 provides evidence for two main contributing sources. One source was due to air masses from E-SE directions at low wind speeds, and the second source was associated with higher wind speeds from N-NW direction. The source to the E-SE could be dominated by ground-level road traffic emissions that are closer to the site because high concentrations under low wind speeds are indicative of surface emissions released with little or no buoyancy (Uria-Tellaetxe and Carslaw, 2014). Also, the wind direction in which this source was dominant corresponds to the highway previously described in Sect. 2.4.1. The source to the N-NW was associated with high concentrations at high wind speeds, which is indicative of emissions at a greater distance. It is plausible to attribute these NO x levels to the main access avenue that connects the city with the suburbs and is located in this direction, due to the presence of heavy-duty diesel vehicles and buses and the number of flowing traffic stops. During the LD and the PLD, the highest RC RF were present when winds were coming from the highway. This serves as further evidence that the observed effects were mainly due to changes in traffic and not to the changes in residential emission patterns due to lifestyle changes during the lockdown.
In the case of PC, as shown in Fig. 9, during MAM2019, the main sources seemed to be located to the west and southwest of the station. These two directions entailed the largest changes due to restrictions during the LD period. During the PLD period, in a similar manner to CO, the sources to the west were partially restored (although concentrations from the southwest remained low).

Ozone
In contrast to the other pollutants considered, the O 3 was higher when compared to a no-restrictions scenario. Its relative changes estimated using the RF predictive model were 80 % and 57 % during the LD and the PLD periods, respectively.
Recent studies of the lockdown effects on atmospheric composition have also reported large O 3 increases at urban sites and indicated the need to analyze changes in precursor emissions and meteorological parameters in light of their role in the nonlinear response in the O 3 concentrations (Ordóñez et al., 2020;Tobías et al., 2020;Nakada Kondo and Urban, 2020;Shi and Brasseur, 2020). Hence, consideration of the joint effects of the changes on precursors and meteorology are of great value to understand the differences between the relative changes estimated using RF concentrations. Based on Figs. 4 and 6, we provide plausible explanations for these discrepancies.
It is well known that decreasing nitrogen oxides levels in a VOC-limited regime tend to increase O 3 . It is most likely that the lower concentrations of freshly emitted NO registered during the LD and the PLD in CNEA provoked a decline in the local scavenging of O 3 , leading to higher O 3 concentrations, particularly in the morning (Tobías et al., 2020;Nakada and Urban, 2020). Even though NO is the pollutant that had the highest relative decrease during the LD and the PLD, its reduction is not enough to explain the overall relative increase in O 3 , and therefore, NO 2 might have played a role as well. Lower NO 2 levels could also have resulted in more OH to initiate O 3 production because the inhibition of a termination reaction favors a faster O 3 accumulation (Seguel et al., 2012).
With respect to the role of aerosols in the O 3 formation, it is worth noting that a significant decrease in PM 10 was registered in PC. This likely implied a consequent reduction not only in the mass concentrations of PM 2.5 and PM 1 but also especially in the number concentrations of fine and ultrafine particles (Arkouli et al., 2010;Gelman Constantin et al., 2021). A similar situation most likely occurred in CNEA. This could have led to greater photolysis due to the decrease in the emissions of fine particles as a consequence of the vehicular restrictions imposed during the lockdown periods, which in turn could have led to higher O 3 concentrations .
In this case, meteorological factors were clearly highly relevant, as can be seen by the fact that the relative change estimated with the RFN model is far smaller (27 % for RC LD RFN and only 5 % for RC PLD RFN ). The effects of meteorology can be rather complex, since the O 3 precursor concentrations and reaction rates are affected in multiple ways (Wang et al., 2017). Although meteorological variables such as the temperature and relative humidity are highly relevant for ozone production and chemistry, they were tested as explanatory variables and, in this case, led to model degradation. However, we submit that their effects are indirectly taken into account by the chemical species that were employed (CO, NO, NO 2 , and SO 2 ). Solar radiation, which is highly relevant for O 3 chemistry, is also linked to the variable daynight. In this particular case, during the LD, elevated O 3 concentrations occurred on days with high temperatures and low winds, which favor the photochemical production of O 3 and the accumulation of ozone and its precursors.
When the meteorology is normalized, the valleys at 07:00 and 20:00 are clearly less marked during 2020 than during 2019 and almost disappeared during the LD when compared with the normalized values for the same period in the previous year (Fig. 7). This is probably due to the lower concentrations of nitrogen oxide levels that are therefore less efficient at titrating O 3 (Brasseur and Jacob, 2017).
As expected, the bivariate polar plots (Fig. 8) show that O 3 behaved in an opposite manner to that of NO x and had the largest increases when winds came from the east and southeast during the LD and also when they came from the east and northwest during the PLD.
From these results, we can also derive that the area in which the CNEA site is located behaves as a region with a VOC-limited chemical regime because the reduction in NO x emissions caused an increase in ozone concentrations (Blanchard and Fairley, 2001;Heuss et al., 2003;Yarwood et al., 2003;Blanchard and Tanenbaum, 2006). We identified similar behavior for increasing O 3 concentrations under decreasing NO x levels when analyzing the 2019 data for weekends (Fig. S5). This is related to the denoted weekend effect in a VOC-limited regime (Koo et al., 2012).

Sulfur dioxide
During the LD, the SO 2 concentrations were slightly lower than those of the simulated BAU scenario (RC RF of −12 %). Although this change is not as large as in the other species for the particular meteorological conditions that occurred during the period, if we consider a normalized meteorology, then we observe a relative change of −20 %, which is about as large as the change observed in, for example, CO. There was a smaller relative change during the PLD, which was similar for RF and RFN.
While all other species in this study are mostly controlled, directly or indirectly, by on-road traffic emissions, according to our findings, SO 2 concentrations are largely influenced by shipping emissions (see Sect. 3.1). This might be the reason why SO 2 is the species with a larger change after normalizing the meteorology.
Another possible reason for having a smaller relative change in SO 2 concentrations is that the vehicle emissions of heavy-duty diesel trucks are another relevant source in Buenos Aires. These are mainly associated with essential activities and might not have been affected as much by the restrictions. However, the partial flattening of the normalized diurnal cycle (Fig. 7) is still probably related to changes in this particular sort of traffic.

Particulate matter 10 µm
During the LD, PM 10 had a relative change of −33 % compared to what would be expected for that specific period under previous emissions. This effect was once again enhanced by the meteorological factors, considering that RC RFN was only −20 %. During the PLD, similar to what happened with other pollutants, the concentrations had a relative change that was only about half as large (−14 % for the RF predictive model and −7 % for the RFN).
When winds are taken into account (Fig. 9), we observe a general reduction from all directions during the LD. Two sources account for this, namely (i) the anthropogenic PM 10 emissions close to the monitoring site that were mostly from vehicle diesel combustion and soot resuspension and (ii) natural sources, such as dust emissions, from the nearest large open area. In a similar fashion to CO and NO x , sources from the west were re-established during the PLD.

Vehicle emission reduction strategies and air pollution in the MABA
Although, as expected, most pollutants were noticeably reduced during the LD due to the restrictions imposed, O 3 was an exception. Strategies for controlling pollution from vehicular emissions in the MABA must take into account the relative reductions in NO x and VOCs to avoid an unintended increase in O 3 concentrations. The atmosphere in the MABA is usually cleaned up during the night, due to a flat topography and the city's wind dynamics. Therefore, criteria pollutants rarely surpass air quality norms. Even though no specific policies to reduce them have been implemented, recently announced greenhouse gas emission mitigation policies affecting on-road mobile emissions may have a major impact. These include (i) technological advances in diesel buses that should reduce NO x and PM 10 , without a major impact on VOCs, and (ii) an increase in the fraction of electric cars, which should reduce NO x and VOC concentrations. Thus, if NO x emissions decrease like they did during the COVID-19 lockdown, then this will likely result in an increase in tropo-spheric O 3 in the MABA if no additional measures regarding VOCs emissions are included, which could be of particular importance for some weather conditions. In fact, under the VOC-limited regime identified for the MABA, control of VOC emissions would be more efficient to reduce local peaks in O 3 . This highlights the importance of having comprehensive air quality policies rather than focusing on reducing individual pollutants.

Code and data availability
Hourly concentrations of CO, NO, NO 2 , SO 2 , and O 3 in CNEA are available in * .csv format at https://doi.org/10.17632/h9y4hb8sf8.1 (Diaz Resquin et al., 2021). We also provide an introductory R notebook with some baseline simulations for the predictive model. For PC, regulatory averages are publicly available and can be accessed through https://data.buenosaires.gob.ar/dataset/calidad-aire (APrA, 2021). Nevertheless, hourly data are not regularly reported but can be requested from the Environmental Protection Agency of Buenos Aires. To enable a machine learning quick start to reproduce the baseline experiments, we also added the meteorological data used to run the simulations to the dataset. These data are publicly available on the website of the Argentine National Weather Service (https://www.smn.gob.ar/descarga-de-datos, Servicio Meteorológico Nacional, 2023).

Summary and conclusions
In this study, we present novel air quality data for a residential site located in the metropolitan area of Buenos Aires that includes concentrations of CO, NO, and NO 2 and, of particular importance for the city, SO 2 and O 3 . One year of these data, together with data from a public monitoring station, were used to train random forest models. The performance of the models was tested on the basis of observations registered both with a separate testing set during the training period and with data before the outbreak of the COVID-19 pandemic. Observations in the two first phases of the lockdown measures imposed were compared with the businessas-usual RF concentrations to assess the change with respect to the air pollutant concentrations that would have occurred without the lockdown. Simultaneously, a meteorological normalization using random forest was performed (RFN), and the normalized concentrations during these lockdown phases were compared with the normalized concentrations for the same periods during 2019. The main conclusions are listed below.
i. The resulting set of explanatory variables for the different pollutants at each site provides evidence of the need for careful variable identification during the training period. Although ideally the best explanatory variables could be identified by trial and error by inexperienced users of random forest models with the support of variable importance plots, expert judgment is advisable for a meaningful and relatively fast selection.
ii. The RF model was able to reproduce air quality observations at two monitoring stations in the MABA when evaluated for a 15 d period prior to the outbreak of the COVID-19 pandemic. This approach allowed predicting the pollutant hourly mean values with a mean bias of less than 10 % by using the data of air quality, emissions, and meteorology and analyzing the effect of wind direction and speed in pollutant concentration, which is useful when characterizing pollution sources.
iii. During the lockdown, all primary pollutants had lower concentrations than what the RF framework would predict for a business-as-usual scenario. The relative change ranged from −12 % (SO 2 ) to −75 % (NO in the monitoring site of CNEA). In the case of all pollutants except SO 2 , the relative changes were enhanced by the meteorology, as shown by the fact that, in absolute terms, RC RF was generally larger than RC RFN . This difference was particularly large for O 3 , probably due to its secondary nature and its complex chemical and photochemical production and destruction mechanisms. The exception observed in the case of SO 2 is likely due to the importance of the wind direction, due to the relevance of the shipping emissions. The relative changes in pollutant concentrations are closely linked to both the traffic and the particular meteorological conditions. The use of bivariate polar plots is also helpful for identifying potential sources, while remaining relatively easy to implement.
iv. RF estimations can be implemented at a low computational cost and can be used to assess the changes that occurred in a specific period if an anomalous situation happened. It can also be used to forecast air quality conditions in the short term at a lower cost than CTMs, which could be of use for local authorities, considering that the MABA has, thus far, only six long-term air quality monitoring stations. When, as in this case, detailed temporal information on different emission sources is lacking (for example, traffic information from on-road sensors), it is essential to use a set of data in which the emissions are similar to those that are expected to be simulated. The model also allows the analysis of the relations between different pollutants, which is of particular interest for those that have very complex chemistry, such as O 3 . The observational input data needed for future RF simulations can be readily updated. The modeling framework developed in this study is user-friendly, rather straightforward to implement, and does not re-quire a large computational capacity. The methodology is capable of being adapted to different time periods and sites and implemented by the technical staff of regulatory agencies. Expert advice may be needed during the selection of the predictive variables and model optimization.
v. To assess the effectiveness of a particular measure in air quality (AQ) independently of particular meteorological conditions of specific periods, a meteorological normalization technique based on random forest can be used. This approach is relatively simple to implement with already existing R packages.
vi. Although previous studies employed both techniques with similar aims, we postulate that the use of the RF predictive model and the meteorological normalization serve different purposes and should be used accordingly. The predictive model can be used to analyze the changes for particular weather conditions or, combined with a meteorological forecast, to forecast pollutant concentrations. On the other hand, the meteorological normalization makes it possible to evaluate the general impact on concentrations due to changes in emissions, decoupling the effects of particular meteorological conditions from the short-term emission changes from the AQ datasets.
vii. In this work we provide the first year-long in situ observational dataset on tropospheric O 3 and SO 2 outside of an industrial area in the MABA in the last decade. We also provide concentrations of CO, NO, and NO 2 determined by colocated instruments.
viii. According to our measurements, the MABA seems to be in a VOC-limited regime. If VOC emissions are not carefully regulated, a NO x reduction would imply an increase in the tropospheric O 3 . Knowing how the concentrations of O 3 in the troposphere respond to reducing the emissions of their precursors is relevant when planning appropriate strategies to reduce CO, non-methane volatile organic compounds (NMVOCs), and NO x emissions. Even though this classification is limited due to the fact that we only have single-point measurements, this could be a useful starting point for a more thorough characterization of the ozone regime in this urban area.
Author contributions. MDR and LD conceived the conceptualization. DA and MDO acquired the data for CNEA, and MDR retrieved the data from the Argentine National Weather Service and