AQ-Bench: A Benchmark Dataset for Machine Learning on Global Air Quality Metrics

. With the AQ-Bench dataset, we contribute to the recent developments towards shared data usage and machine learning methods in the ﬁeld of environmental science. The dataset presented here enables researchers to relate global air quality metrics to easy-access metadata and to explore different machine learning methods for obtaining estimates of air quality based on this metadata. AQ-Bench contains a unique collection of aggregated air quality data from the years 2010-2014 and metadata at more than 5500 air quality monitoring stations all over the world, provided by the ﬁrst Tropospheric Ozone 5 Assessment Report (TOAR). It focuses in particular on metrics of tropospheric ozone, which has a detrimental effect on climate, human morbidity and mortality, as well as crop yields. The purpose of this dataset is to produce estimates of various long-term ozone metrics based on time-independent local site conditions. We combine this task with a suitable evaluation metric. Baseline scores obtained from a linear regression method, a fully connected neural network and random forest are provided for reference and validation. AQ-Bench offers a low-threshold entrance for all machine learners with an interest in environmental 10 science and for atmospheric scientists who are interested in applying machine learning techniques. It enables them to start with a real-world problem relevant to humans and nature. The dataset and introductory machine learning code are available at https://doi.org/10.23728/b2share.30d42b5a87344e82855a486bf2123e9f (Betancourt et al., 2020) and https://gitlab.version. fz-juelich.de/esde/machine-learning


Introduction
In recent years, machine learning has achieved remarkable success in areas such as pattern-, image-and speech-recognition by usage of increasing computing power, innovative algorithms and high data availability (Krizhevsky et al., 2012;Amodei 20 et al., 2016;Silver et al., 2016). This aroused the interest of environmental scientists to explore the application of machine learning and data-driven methods in their fields. The strength to be exploited is the ability of machine learning algorithms to find complex relationships in large multivariate, inhomogeneous datasets (as described in Wise and Comrie, 2005;Porter et al., 2015, e.g.).
In air quality research, there is one pollutant which is especially challenging to track: tropospheric ozone, a toxic trace gas 25 which harms human health, vegetation and also impacts the climate (Cooper et al., 2014;Monks et al., 2015). Tropospheric ozone is difficult to track because it has no direct emission sources, but is produced as a secondary air-borne pollutant by several chemical reaction chains involving a large variety of precursors and photochemistry. With a lifetime of days to weeks (Wallace and Hobbs, 2006), the ozone concentration is affected by various physical and chemical processes which produce and destroy ozone. Therefore, ozone is a scientifically interesting candidate for machine learning applications: It is influenced by many 30 interconnected environmental factors -and it is interesting to see if machine learning algorithms can learn these.
Data-driven atmospheric chemistry research was combined with machine learning from the late 1990s, to model and predict surface ozone concentrations in an alternative way to multivariate regression from (Yi and Prybutok, 1996;Comrie, 1997;Elkamel et al., 2001;Caselli et al., 2009). These data driven approaches take ground based measurements as input and predict the pollutant concentrations for the next days at individual locations. The principle behind recent machine learning applications 35 in ozone research is often a similar principle as Schultz et al. (2020) described for weather data: The input data are directly mapped to a specific data product, e.g. from meteorological and past ozone measurements to the next day's maximum O 3 ozone value. In recent studies, Sayeed et al. (2020) and Kleinert et al. (2020) predicted regional ozone time series with convolutional neural networks and meteorological input data. Furthermore, Silva et al. (2019) trained a feed forward neural network to output ozone dry deposition at two forest measurement sites. Moreover, within computationally complex components of 40 atmospheric chemistry models, machine learning techniques are used as emulators or surrogate models. They replace for example costly atmospheric chemistry and micro-physical calculations to improve computational performance of the models (Kelp et al., 2020). In addition, machine learning is applied in the calibration of low-cost sensors for air quality measurements in order to account for the diverse sources of interference with these measurements (Schmitz et al., 2021;Wang et al., 2020). Nevertheless, to our knowledge there are currently no machine learning projects that attempt 45 analyzing and predicting ozone on the global scale, for longer time periods and with many kinds of metadata.
Developments in machine learning are accelerated by the existence of precompiled benchmark datasets, that allow machine learners to try out specific tasks, exchange solutions and compete with each other (LeCun et al., 2010;Deng et al., 2009;Rasp et al., 2020). Benchmarks can also be used for the development of explainable artificial intelligence approaches (Kierdorf et al., 2020;Roscher et al., 2020). So far, few of such benchmark datasets exist in the field of environmental science, especially re-50 lated to air quality. While air quality data are in principle easily accessible from a variety of archives, there is often incomplete information and insufficient metadata to develop useful machine learning applications from this data. Furthermore, harmonization of such data from different sources, which is needed to achieve a global picture of ozone air pollution, is a difficult and time-consuming task.
With the AQ-Bench dataset, we aim to fill this gap and provide a dataset of global long-term air quality metrics and 55 metadata compiled from the TOAR database (Tropospheric Ozone Assessment Report, Schultz et al. (2017)). To make these data usable for machine learning developments, this paper also describes the specific task of mapping between the metadata and the air quality metrics. Our ready-to-use, fully documented dataset is freely available under the DOI https://doi.org/10.23728/ b2share.30d42b5a87344e82855a486bf2123e9f (Betancourt et al., 2020). We also provide our baseline machine learning code at https://gitlab.version.fz-juelich.de/esde/machine-learning/aq-bench, offering a low-threshold entrance to machine learning in 60 environmental science within a relevant research topic. In Sect. 2 of this paper we present the main factors affecting tropospheric ozone as the scientific background for the design of the AQ-Bench data set. Section 3 introduces the TOAR data products from which AQ-Bench was constructed. In Sect. 4, we describe the dataset itself. Section 5 contains the machine learning task for AQ-Bench and three baseline experiments to evaluate the applicability of these data in the machine learning context. We discuss opportunities and challenges of AQ-Bench, and give problem-related expected difficulties in Sect. 6. Information on is a secondary pollutant that is formed from emissions of precursor substances and undergoes a variety of physical and chemical processes during its atmospheric lifetime. Figure 1 summarizes these processes, and is further elaborated in the following Subsections. How the described processes translate into the data in AQ-Bench, is described in the dataset description (Sect. 4).   Jacob (2000). See text for elaboration.

75
The most important ozone precursors are nitrogen oxides, carbon monoxide and volatile organic compounds (denoted as NO x , CO and VOCs in Figure 1; note that NO x = NO 2 + NO). Many of these precursors are emitted by human activities, e.g. from traffic, industry and agriculture (Benkovitz et al., 1996;Field et al., 1992). NO x concentrations resulting primarily from combustion processes are especially high at very heavily polluted sites such as in city centers or near power plants. Industrial and traffic pollution are closely related to energy consumption depending on population density and economical activities.

80
Agriculture machinery emits similar trace gases as traffic or industry. Moreover, agricultural plants are often fertilized, which adds more trace gas emissions (Veldkamp and Keller, 1997). In addition to emissions from human activities, several processes in nature also lead to emissions, especially of VOC compounds. For example, plants emit VOCs which are often more reactive (and could therefore produce more ozone) than VOCs emitted from human activities. The exact emission patterns vary among the types of plants and are thus related to land cover. I.e. agriculture fields, forests and grasslands will yield Agricultural fields, 85 forests and grasslands therefore yield different magnitudes and seasonal cycles of VOC emissions (Simpson et al., 1999).
Emissions can also occur from oceans, barren land and snow or ice covered surfaces. For example, the latter emit substantial quantities of NO x in Arctic regions (Wang et al., 2007).

Ozone chemistry
The daily average ozone volume mixing ratios vary in the orders of magnitude from 10 to 100 ppbv (parts per billion by vol-90 ume), with a lifetime of days to weeks (Wallace and Hobbs, 2006). Ozone has practically no direct emissions but is exclusively formed through atmospheric chemical reactions. The chemical processes leading to ozone formation are driven by ultraviolet radiation (denoted with hν in Fig. 1). At wavelengths < 0.43 nm, photons convey enough energy to release chemical bonds in nitrogen dioxide (NO 2 ) molecules. This process (photo dissociation) leads to the formation of nitrogen oxide (NO) and a free oxygen radical (O). NO is also a radical and thus recombines quickly, while O collides with a high probability with O 2 and 95 forms O 3 . The produced O 3 is removed rapidly when it reacts with NO to NO 2 + O 2 . The reactions form a null cycle, because O 3 is both created and destroyed. The cycle stabilizes at a certain O 3 concentration, depending on the available NO 2 , ultraviolet light intensity and temperature. Up to a certain point, the ozone concentration rises with increasing NO 2 concentrations.
The dynamic equilibrium of this cycle can be altered by the presence of VOCs and CO (denoted as primary emissions in Fig. 1), which provide chemical pathways to convert NO to NO 2 without the destruction of O 3 by oxidation (oxidized 100 pollutants denoted as HO 2 and RO 2 in Fig. 1). This leads to a non-linear system, where O 3 concentrations depend on the ratio of VOCs + CO and NO x (= NO + NO 2 ) concentrations. During daytime, O 3 can photo dissociate and recombine with water vapor (H 2 O in Fig. 1), thereby forming hydroxy radicals (OH in Fig. 2) which fuel a large share of atmospheric oxidation.
There are several thousand chemical reactions occurring in the atmosphere, which need to be considered for an adequate description of ozone formation and loss processes and Fig. 1 only provides a very small glimpse on this rather complex system.

105
For more details on ozone chemistry we refer to Brasseur et al. (1999).

Transport and loss processes
During its atmospheric lifetime, O 3 can be transported on spatial scales of hundreds or even thousands of kilometers (Schultz et al., 1999), until it is removed via atmospheric chemical reactions and deposition (indicated with downwards pointing arrows in Fig. 1). Primary chemical loss of O 3 is rather indirect via removal of NO 2 in polluted regimes and radical-radical reactions 110 in clean environments with low NO 2 concentrations. Besides the chemical loss, O 3 can be removed by deposition on surfaces, especially on the leaves of natural or agricultural plants (Emberson et al., 2000). Ozone irreversibly damages plant tissue, when the plant leaves take it up, this leads to reduced crop yields  and tissue damage (Schraudner et al., 1997).
Ozone irreversibly damages plant tissue when the plant leaves take it up (Schraudner et al., 1997), leading to reduced crop yields . Ozone deposition on water surfaces is relatively slow, but due to the large extent of them, this 115 process also matters in the context of the global ozone budget .

Interconnected factors
In the following, we describe exemplary how the influences of ozone precursor emission, chemistry, transport and loss (described in Sects. 2.1 -2.3) can come together. The combination of chemistry and transport of air pollutants favors ozone formation downwind of sites with high precursor exhaust. A typical example are summertime rural areas downwind of larger city 120 centers, where peak ozone values can often be observed (Xu et al., 2011). In the close vicinity of power plants or in city centers, NO x is often very high and low ozone levels are observed (Sillman, 1999).
There are several geographical factors which determine the rates of chemical formation and loss of ozone. These factors can result in different mixes of ozone precursor emissions, varying reaction rates and varying rates of deposition. For example, the climate in a certain location determines the vegetation cover and the local weather. Since temperatures near the equator are 125 high and more intense sunlight is available, ozone levels are generally higher there than near the poles. Moreover, at higher altitudes the air is generally cooler and drier, which leads to changes in reaction rates. Local flow patterns can also influence the ozone concentration, for example through the transport of air masses from valley to mountain tops (Kaiser et al., 2007).
Besides natural geographic factors, political decisions can also influence ozone formation. Many governments and decision makers worldwide strive to reduce air pollution by emission regulation, but these regulations differ between countries and may 130 be implemented with more or less rigor. Ozone regulation is more difficult than that of primary air pollutants as one has to limit both VOC and NO x emissions in order to control ozone, because of the chemical cycles described in Sect. 2.2.
Although ozone has a rather long lifetime, the local ozone concentration can change substantially in a matter of minutes and on scales of meters (e.g. in a street canyon), but it can also remain stable across hundreds of kilometers and for several weeks (e.g. at higher altitudes over the oceans). The "radius of influence" where ozone determined by near-by precursor emissions 135 and deposition surfaces is typically around 25 km (European Union, 2008) in mid-latitude areas. The "radius of influence" within which ozone is determined by nearby precursor emissions and deposition surfaces is typically about 25 km in mid-latitude areas (European Union, 2008). All in all, ozone concentrations measured at a station are determined by many interconnected influences from precursor emissions, land use / land cover and the local weather conditions. Many of these factors are poorly quantified and often the interconnections are not understood well yet (Schultz et al., 2017). With AQ-Bench 140 and the machine learning task described below we want to explore a novel way of using a multitude of geographical features to predict ground-level ozone levels around the world. The details of data selection are described in Sect. 4, while the machine learning task is provided in Sect. 5.1.

TOAR data products
The TOAR database (Schultz et al., 2017) was created in context of the Tropospheric Ozone Assessment Report (TOAR).

145
It contains one of the world's largest collections of near-surface ozone measurements, gathered from public bodies, research institutions and air quality networks all over the world. TOAR data products enabled the first comprehensive global assessment of the tropospheric ozone distribution and trends (Schultz et al., 2017;Fleming et al., 2018;Gaudel et al., 2018;Lefohn et al., 2018;Chang et al., 2017;Young et al., 2018;Mills et al., 2018;Tarasick et al., 2019;Xu et al., 2020). In the spirit of FAIR data usage (Wilkinson et al., 2016), these data products are openly available via the JOIN graphical interface 1 , a REST interface 2 , 150 and through the PANGAEA repository 3 . 1 https://join.fz-juelich.de/ 2 https://join.fz-juelich.de/services/rest/surfacedata/ 3 https://doi.org/10.1594/PANGAEA.876108 For the AQ-Bench dataset we selected and harmonized air quality metrics and metadata from TOAR (see Sect. 4 and Appendix C). This section therefore contains a description of these selected data products, introducing the concepts of metrics and metadata.

Air quality metrics 155
The TOAR database contains hourly ozone measurements, transmitted from air quality observation sites. The data providers conduct quality control on these data by calibrating the measurement devices and setting suitable instrument parameters. In a second step of data curation, the TOAR database administrators conduct a statistical analysis of the data to identify and remove low-quality data (Schultz et al., 2017). These hHourly data are usually aggregated into statistics or "metrics" for further analysis. Ozone metrics consolidate air quality properties of longer time series (e.g. a season or a year) in 160 a single figure, which can then be directly used for a scientific assessment and in decision making. Longer aggregation periods also average out short term weather fluctuations. There are specific metrics for different areas of ozone impact assessments (respiratory and cardiovascular disease, vegetation damage, climate impacts) and control.
The JOIN web-service is connected to the TOAR database and provides more than 30 of the most frequently used metrics as data products, calculated on-demand from hourly data. Besides these specialized metrics, also basic statistics such as 165 meanaverage, median and percentiles are available in JOIN. In the context of evaluating air quality, the validity of reported ozone metrics hinges on the data capture. Typically, statistical aggregations (i.e. metrics) of air quality data can only be used for decisions on attainment or non-attainment of air quality standards, if at least 75 % of the (hourly) samples in a dataset were reported. In this sense, the validity of ozone metrics is tied to the data completeness and we will use the term "valid data" to indicate samples with sufficient coverage of accurate data. All metrics which are part of AQ-Bench, are listed in Table 2 of 170 the next section. Documentation and further information on all available metrics including data capture criteria are available in Schultz et al. (2017) and Lefohn et al. (2018).

Station metadata
The TOAR database also contains geographical information on air quality measurement station locations, i.e. station metadata. Metadata gives background information on the measurement site, where the data was retrieved from, and thus enables 175 to characterize the location. These metadata are collected from different sources and were primarily derived from satellite earth observations. Some data, for instance station coordinates and altitude are given by the data providers and quality controlled by TOAR. Others were derived from data sources with individual quality control, such as satellite earth observations. For a complete list of the available metadata attributes see Schultz et al. (2017) and the REST interface 2 .
For the AQ-Bench dataset described in this paper, we selected metadata from the TOAR database which characterize mea-180 surement locations and their surroundings with respect to pollution-relevant properties as introduced in Sect. 2. They are listed in Table 1 of the next Section.

AQ-Bench dataset description
The AQ-Bench dataset consists of metadata and meanaggregated ozone metrics from the years 2010-2014 at 5577 measurement stations all over the world, compiled from the TOAR database. The point of interest is to determine the resulting ozone 185 metrics (see Sect. 3.1) given all environmental influences (Sect. 2) represented by metadata (Sect. 3.2). Our contribution in data preparation is to pick metadata with expert knowledge, relate it to processes, and aggregate air quality data to metrics in a way that it is representative for long time periods and meaningful in a machine learning context.
Three key points in the conception of this benchmark dataset are: 1) As targets, we use aggregated air quality metrics over five years. These are not influenced by short-term weather and emission forcings, but by site conditions on the climatological 190 time scale. 2) Many known environmental influences on ozone are on short time scales (see Sect. 2), but we aim to predict longterm air quality conditions at the sites. Thus, we have identified which station metadata are the climatological representations of these short forcings. 3) We use a -to our knowledge unprecedented -variety of metadata that contains diverse information about environmental influences on the climatological scale. These metadata are sometimes not directly descriptive of the influences, but rather proxies for them. The benefits of machine learning must be taken to relate these proxies to air quality metrics.

195
This aggregated, climatological approach makes it possible to cover air quality data over a long period of time on the global scale with a relatively small and compact dataset. Yet, aggregated data accounts for long-term air quality conditions at a site, and daily or hourly influence on ozone variations not considered. Figure 2 gives an overview of all TOAR air quality monitoring stations included in AQ-Bench.

Station metadata 200
A summary of metadata in AQ-Bench is given in Table 1. The data originates from the TOAR database (Sect. 3), yet some data was harmonized, as documented in Appendix C. See Appendices A and C for details on the data sources and harmonization for machine learning purposes. The metadata contains proxies for environmental influences on ozone on the climatological scale. In the following, we give two examples.
As mentioned in Sect. 2, ozone is influenced by weather. Likewise, ozone on longer time scales is influenced by climate.

205
One variable in the AQ-Bench dataset is the climatic zone in which the site is located. The climate zone provides simplified information about climatic conditions at a location, for example, whether it is hot or cold, humid or dry, or of tropical climate.
A second example are ozone precursor emissions. In Sect. 2.1 we outlined that they are emitted by, for example, traffic and human activities. This means that the population density at a site is a good proxy for these activities. A second -more subtleproxy is the stable nightlight at a location. It is the average intensity of light during night as seen from space, an indicator for 210 industrial activity. In Sect. 2.2, we pointed out that ozone is often formed downwind of sites with high human and industrial activity. Therefore, in the AQ-Bench dataset, we do not only give population density and stable night lights at a site, but also related statistics of the closer surroundings. One example is the maximum population density in a radius of 5 km around the station.
All variables of the AQ-Bench dataset can be related to environmental impacts on the climatological time scale. We indicate 215 the proxies in the right column of Table 1. Machine learning can make use of these proxies, even if they are not directly related to ozone concentrations.

220
There are therefore two steps involved in obtaining the metrics: 1) Getting up to five yearly metrics between 2010-2014 from hourly measurements, including data cover criteria to validate the metrics 2) averaging over these five years. If less than two yearly values are available, the value is considered missing. Missing values are denoted with -999 in the dataset.
Some suspiciously high values were sorted out, as documented in Appendix C. A summary of all metrics and their data capture criteria is given in Table 2. More details on the process of ensuring robustness through data capture are given 225 in Appendix B. The summary is given in Table 2. The value is marked as missing if less than 75% of days contain data.
Human health

Validating AQ-Bench via machine learning
In this Section, we introduce the AQ-Bench dataset as a machine learning benchmark dataset. This means we combine the data documentation from the previous Section (Sect. 4) with the machine learning task for this dataset. We also provide an 230 evaluation metric, a data split and baseline experiments.

Task description and evaluation metric
The task proposed for the AQ-Bench dataset is to train a machine learning model that maps from metadata in Table 1 to the ozone metric values in Table 2. This can be achieved with individual machine learning algorithms or in one multi-output algorithm.

235
The evaluation metric for our baselines is the R 2 , the coefficient of determination, where m denotes a sample index, M the total number of samples,ŷ m a predicted output value and y m a reference target value.
R 2 measures the proportion of variance in the output values that the model predicts from the input values. A larger R 2 thus denotes a better model and the largest possible value is 1, or 100 %. We choose R 2 as it is comparable between all different 240 targets, even if they cover different value ranges. The overall score of the solution is the mean of all scores achieved on the test set for all ozone metrics. For further evaluation of machine learning results, cross validation can be applied. We would like to challenge the machine learning and air pollution researchers to use this rather small dataset as efficiently as possible to extract all inherent information to accurately map onto the ozone metrics.

Data split 245
We provide a fixed data split within the AQ-Bench dataset to enable a comparison of our baseline results with future solutions, and to provide a suitable data setup for learning (see below). As it is good practice in machine learning, the dataset is split into three subsets for training, validation and hyperparameter tuning, and testing. The three data subsets are required to be independent, while having a similar statistical distribution to prevent the concealment of possible overfitting and an overestimation of accuracy. Because the dataset is relatively small, the split was chosen to be 60/20/20 %, as it is commonly used for datasets of 250 this size. It is indicated in the dataset whether an example belongs to training, validation or test set.
In order to guarantee the spatial independence of the subsets, the data are divided into several spatial zones. The zones were created by spatial clustering, where stations are assigned to the same cluster if they are closer than 50 km (European Union, 2008). We chose 50 km, as this is the distance in which the defined areas of influence of the stations do not overlap. Large station clusters were split again into smaller ones to ensure similar statistical distributions of the training, validation and 255 test datasets. The final clusters were randomly assigned to the three datasets. This way, all stations within a spatially dependent cluster are allocated to the same dataset.

Baseline experiments
As baselines for machine learning approaches on the AQ-Bench dataset we present results obtained with three standard machine learning algorithms. For preprocessing, rows with missing values are dropped. Continuous metadata is scaled, each by a 260 quantile range from 25 % to 75 % to avoid influence from outliers. Categorical metadata is one-hot encoded, resulting in 135 input features in total. We drop the longitude for our baseline experiments, since this is a circular variable and cannot be used without additional feature engineering. The preprocessed metadata is called input data in the following. Ozone metrics, which are the targets, are not scaled.

Methods:
265 -Linear regression. Linear regression models the simplest correlation between input and target values. It maps an input data example x m withŷ m = w T · x m + b, where w and b are the regression parameters weights and bias. Vector w = [w 1 , w 2 , ..., w N ] T has the dimension of input vector x m = [x 1 , x 2 , ..., x N ] T .
-Neural network. We train a shallow fully connected neural network with two hidden layers of size 20 and 5 neurons, respectively. We use the Adam optimizer with MSE (mean squared error) loss function, L2 regularization and ReLU (rec-270 tified linear unit) as activation function (Goodfellow et al., 2016). Training is performed independently for each ozone metric. We optimized the learning rate and regularization parameter by empirical studies and random search.
Through further empirical analyses, we decided on tThe hyperparameters for the individual targets are summarized in Appendix B. The model is written in Tensorflow/Keras (Chollet et al., 2015).
-Random forest. Our random forest model (Breiman, 2001) is built with a number of 100 trees for each target, based 275 on empirical studies. As in the case of the neural network, we use the MSE as optimization criterion. We use the RandomForestRegressor of SciKit-learn (Pedregosa et al., 2011).
The baseline results are summarized in Table 3. Comparing the different models, random forest yields the best results for all targets except the nvgt-metrics, where the neural network performs best. The linear regression is the worst for all targets except e.g. 75 % percentile where it is second best after the Random Forest. For some targets, e.g. average values, random forest is 280 only slightly better than the neural network. However, there are targets, e.g. AOT40, where the gap between the two methods is almost 10 %. The neural network performs best for nvgt070 and nvgt100. The baseline experiment results of nvgt100 drops in comparison to other targets with partly negative R 2 -scores. The results of nvgt070 have the second least scores. These two targets count exceedances of a certain threshold, so that many values equal zero, which might be problematic to capture for standard machine learning algorithms. Except of those, R 2 is higher than 50 % for at least one of the three models per target.

285
This shows that there is a quantitative relationship between input data and targets. Nevertheless, for our baseline experiments we used rather simple models, in order to proof the concept. Ozone, as a secondary pollutant with levels highly dependent on the environment and available precursors, is not captured perfectly by these simple baselines.  With the AQ-Bench dataset, we used our knowledge on environmental influences on ozone, a toxic greenhouse gas, to bundle air quality data and metadata with machine learning approaches. By doing this, we enable a quick entry into machine learning in air quality research on a global scale with reduced machine learning overhead. Our approach enables to use data from various sources that would otherwise be time consuming to acquire and prepare. We provide a ready to use dataset for the machine learning community, to support research on meaningful real-world applications (motivated by Wagstaff, 2012).

295
One great advantage of using machine learning for air quality research is the possibility to use data from various different sources, especially data which are not directly connected to air pollution via physical or biogeochemical models (e.g. stable nightlights). To explore this opportunity for ozone, we gathered an unprecedented variety of metadata to allow the machine learning approaches to obtain hints on the many interconnected, nonlinear influences, which determine ozone concentrations (see Section 2). As the results from our baseline experiments show, the AQ-Bench dataset bears some potential to exploit these 300 relations with machine learning methods.
Currently not many air pollution researchers use purely data-driven approaches for their studies. With AQ-Bench we offer a first data driven machine learning view on global tropospheric ozone. To achieve the global view, we use the JOIN web interface 4 of the TOAR data center, which provides customized data products from the TOAR database. As proposed by Schultz machine learning. Further applications of AQ-Bench could be developed, such as a classification of ozone sites into 'healthy' or 'unhealthy'. Our dataset fits with the vision for benchmark datasets described by Ebert-Uphoff et al. (2017).

Limitations of AQ-Bench
AQ-Bench includes ozone metrics and metadata from 5577 stations and spans a time period of five years. The stations included in AQ-Bench are not distributed equally around the globe. The spatial coverage in most of the regions is low, except in USA,

310
European countries and some regions of East Asia (Japan and South Korea). This raises the question of whether it is possible to generalize machine learning results to regions that are not included in the training data, even if they have similar input metadata. Possibly it may be necessary to use a combination of observational data and numerical models to achieve full global coverage (c.f. Chang et al. (2017)).
Measurement errors, interannual changes and drift result in noisy ozone metrics. Conversely, at least in the current version 315 of AQ-Bench, the input metadata is fixed and has no temporal evolution, an assumption which we can make, because we average over five years of ozone metrics. It cannot be out ruled that within this time major environmental changes could have happened, e.g. settlements could grow or shrink during this time. This means, that metadata as given in AQ-Bench might not be valid for the whole time period of five years. The population density might have increased, the climate zone might have changed, and if a forest was cleared, for example, the land cover would have changed as well. We note that some uncertainty Another topic is the complexity of the problem, compared to the dataset size. It is doubtful whether simple machine 325 learning models are intricate enough to grasp all complex relationships between ozone and environmental factors. On the other hand, very deep neural networks, which may be capable of learning such patterns, cannot be trained on a dataset with only 5577 samples. In Sect. 5.3 we gave some basic machine learning approaches to find a mapping between the metadata and the target ozone metrics. Since the complexity of the machine learning approach has to match the data complexity, wWe assume that the inaccuracies in our baselines partly arise from the complex relationships of ozone with the environment 330 compared to the input dataset size and complexity of these basic machine learning approaches. Furthermore, through a longer aggregation period, we emphasize robust, static features. This aggregation reduces the size of the dataset and makes a global coverage possible. Due to our focus on spatial relationships we consciously ignore time-resolved patterns.
Through aggregation, wWe simplify the problem and make machine learning on the dataset easy -but this simplification also comes at the cost of introducing noise and uncertainties. A diversity of geographical data is available so additional variables 335 could be identified to catch this complexity. Two examples are the distance to the next coast line or major roads in the close vicinity of a station. Furthermore, fFor a more complete description of ozone processes, more input data, additional input variables and time resolved data could be used.

Machine learning challenges arising from AQ-Bench
In order to provide some guidance on how the machine learning results could be improved compared to the standard machine 340 learning methods applied in our baselines (Sect. 5.3), we briefly discuss some techniques here. One aspect to explore is feature engineering. Currently AQ-Bench includes for example the circular variable longitude, which cannot be accessed by the machine learning algorithm without further feature engineering. Other variables could be accumulated, or transformed to improve machine learning results. See e.g. Duboue (2020) for an introduction to the topic. We hope that the research community will be creative in feature engineering.

345
Another aspect is multi-task learning. The baseline methods were performed independently for each ozone metric, but there may be a connection between them, as they all describe ozone pollution. Therefore, multi-task learning is a promising direction to exploit these connections. See Zhang and Yang (2017) for a review on this topic.  , 2020). To enable a machine learning quick start on the AQ-Bench dataset with reproduction of the baseline experiments, we also provide an introductory jupyter notebook on https://gitlab.version.fz-juelich.de/esde/machine-learning/ aq-bench. To start it directly in your browser, click the button "launch on binder" in the readme of this repository.

Conclusions
In this paper, we introduced AQ-Bench as a benchmark dataset for machine learning on global air quality metrics. It allows 360 to explore different machine learning methods on the real-world problem of air quality analyses. Specifically, the machine learning task is to map station metadata to air quality metrics at 5577 measurement stations around the globe and to optimize the results with hyperparameter tuning and data engineering. The usability of the dataset is documented through the results from our three baseline machine learning solutions. These methods show robust relations between the input data (geospatial features) and the targets (ozone metrics), and these relations are understandable from an atmospheric 365 chemistry point of view. As data driven techniques for air quality research are emerging, we present a first benchmark dataset on the global scale. The purpose and significance of AQ-Bench is twofold: first, it has never been tried before to exploit a rich collection of geospatial datasets to find out which fraction of ozone pollution can be attributed to such more or less static geographical features. Second, this problem definition makes some low-level air quality analysis easily accessible to data scientists with little or no background in atmospheric chemistry. Following the vision of Ebert-Uphoff et al. (2017) 370 to design benchmarks that bridge geoscience and data science, the key features of AQ-Bench are: -Active research area: Ozone is a highly relevant and active field of research, as it harms living beings and the ecosystem.
Ozone research benefits from making data available and developing data driven methods for ozone assessment.
-Understandable context: We introduced the complex mechanisms behind ozone formation as well as physical and chemical processes in Sect. 2, to make the scientific context of this dataset understandable to everyone, even without 375 prior knowledge.
-Impact on data science: Since AQ-Bench is relatively small and thus easy to handle, it is suitable for beginners in programming. AQ-Bench can be trained in less than a minute on a common personal computer without GPUs, so one can quickly iterate through different algorithms and configurations. Yet noise, the small size of the data set and the complicated underlying processes make it challenging to achieve satisfactory machine learning results on this dataset.

380
-A means to evaluate success: We propose R 2 , the coefficient of determination, as an evaluation metric for AQ-Bench.
It is a suitable metric because it measures the proportion of variance in the output values that the model predicts from the input values. It is comparable between all targets.
-Quick start: To start machine learning on AQ-Bench in a common browser, launch the "binder" in the following Git repository: https://gitlab.version.fz-juelich.de/esde/machine-learning/aq-bench. Running the introductory notebook on 385 binder enables users to try out different training algorithms and hyperparameters directly in the browser.
-Citability and reproducibility: The dataset has a DOI, and the baseline experiments can be reproduced with the code that is openly available on Github (see Sect. 7).
We hope that the AQ-Bench dataset will help to advance data driven techniques in the field of air quality research, and form the basis for future experiments and research.    The data capture criteria applied in this work ensure robustness of the ozone metrics. Data capture criteria of hourly to annual metrics are applied through the JOIN web service (https://join.fz-juelich.de/), as described in Schultz et al. 590 (2017). The 5-year mean and its data capture criterion were applied in this work. One exception is the average values metric which does not have a data capture criterion in JOIN. Here we have verified that more than 2200 hourly values are processed to calculate the metric, and that the average hourly data capture of all stations is above 50 %. The flowchart below shows an example data capture criterium as applied in the AQ-Bench dataset. All data capture criteria are summarized in Table 2   -The variable type was harmonized, as there are some types which appear only once or twice. These types were replaced with the category they go best with: -"agricultural", "commercial", "other-agricultural", "other-marine" were replaced with "other" -"rural" was replaced with "background" 605 -"urban" was replaced with "unknown".
-The station with id 4587 was sorted out, because it was a remote background station in Romania which reported one of the highest o3_average value of all stations (65.5899 ppb), and had a low data coverage. We suspect these values are 615 faulty.
-The station with id 4589 was sorted out because it reported a max_population_density_5km of ca. 1 million per square kilometer which we suspect is faulty.