AERO-MAP: A data compilation and modelling approach to understand the fine and coarse mode aerosol composition

Mahowald, Natalie M.; Li, Longlei; Vira, Julius; Prank, Marje; Hamilton, Douglas S.; Matsui, Hitoshi; Miller, Ron L.; Lu, Louis; Akyuz, Ezgi; Meidan, Daphne; Hess, Peter; Lihavainen, Heikki; Wiedinmyer, Christine; Hand, Jenny; Alaimo, Maria Grazia; Alves, Célia; Alastuey, Andres; Artaxo, Paulo; Barreto, Africa; Barraza, Francisco; Becagli, Silvia; Calzolai, Giulia; Chellam, Shankarararman; Chen, Ying; Chuang, Patrick; Cohen, David D.; Colombi, Cristina; Diapouli, Evangelia; Dongarra, Gaetano; Eleftheriadis, Konstantinos; Galy-Lacaux, Corinne; Gaston, Cassandra; Gomez, Dario; González Ramos, Yenny; Hakola, Hannele; Harrison, Roy M.; Heyes, Chris; Herut, Barak; Hopke, Philip; Hüglin, Christoph; Kanakidou, Maria; Kertesz, Zsofia; Klimont, Zbiginiw; Kyllönen, Katriina; Lambert, Fabrice; Liu, Xiaohong; Losno, Remi; Lucarelli, Franco; Maenhaut, Willy; Marticorena, Beatrice; Martin, Randall V.; Mihalopoulos, Nikolaos; Morera-Gomez, Yasser; Paytan, Adina; Prospero, Joseph; Rodríguez, Sergio; Smichowski, Patricia; Varrica, Daniela; Walsh, Brenna; Weagle, Crystal; Zhao, Xi

doi:https://doi.org/10.5194/essd-2024-1

Preprints

https://doi.org/10.5194/essd-2024-1

Preprints

06 Feb 2024

| 06 Feb 2024

Status: this preprint has been withdrawn by the authors.

AERO-MAP: A data compilation and modelling approach to understand the fine and coarse mode aerosol composition

Natalie M. Mahowald, Longlei Li, Julius Vira, Marje Prank, Douglas S. Hamilton, Hitoshi Matsui, Ron L. Miller, Louis Lu, Ezgi Akyuz, Daphne Meidan, Peter Hess, Heikki Lihavainen, Christine Wiedinmyer, Jenny Hand, Maria Grazia Alaimo, Célia Alves, Andres Alastuey, Paulo Artaxo, Africa Barreto, Francisco Barraza, Silvia Becagli, Giulia Calzolai, Shankarararman Chellam, Ying Chen, Patrick Chuang, David D. Cohen, Cristina Colombi, Evangelia Diapouli, Gaetano Dongarra, Konstantinos Eleftheriadis, Corinne Galy-Lacaux, Cassandra Gaston, Dario Gomez, Yenny González Ramos, Hannele Hakola, Roy M. Harrison, Chris Heyes, Barak Herut, Philip Hopke, Christoph Hüglin, Maria Kanakidou, Zsofia Kertesz, Zbiginiw Klimont, Katriina Kyllönen, Fabrice Lambert, Xiaohong Liu, Remi Losno, Franco Lucarelli, Willy Maenhaut, Beatrice Marticorena, Randall V. Martin, Nikolaos Mihalopoulos, Yasser Morera-Gomez, Adina Paytan, Joseph Prospero, Sergio Rodríguez, Patricia Smichowski, Daniela Varrica, Brenna Walsh, Crystal Weagle, and Xi Zhao

Abstract. Aerosol particles are an important part of the Earth system, but their concentrations are spatially and temporally heterogeneous, as well as variable in size and composition. Aerosol particles can interact with incoming solar radiation and outgoing long wave radiation, change cloud properties, affect photochemistry, impact surface air quality, and when deposited impact surface albedo of snow and ice, and modulate carbon dioxide uptake by the land and ocean. High aerosol concentrations at the surface represent an important public health hazard. There are substantial datasets describing aerosol particles in the literature or in public health databases, but they have not been compiled for easy use by the climate and air quality modelling community. Here we present a new compilation of PM_2.5and PM₁₀ aerosol observations including composition, a methodology for comparing the datasets to model output, and show the implications of these results using one model. Overall, most of the planet or even the land fraction does not have sufficient observations of surface concentrations, and especially particle composition to understand the current distribution of aerosol particles. Most climate models exclude 10–30 % of the aerosol particles in both PM_2.5and PM₁₀ size fractions across large swaths of the globe in their current configurations, with ammonium nitrate and agricultural dust aerosol being the most important omitted aerosol types. The dataset is available on Zenodo (https://zenodo.org/records/10459654, Mahowald et al., 2024).

This preprint has been withdrawn.

How to cite. Mahowald, N. M., Li, L., Vira, J., Prank, M., Hamilton, D. S., Matsui, H., Miller, R. L., Lu, L., Akyuz, E., Meidan, D., Hess, P., Lihavainen, H., Wiedinmyer, C., Hand, J., Alaimo, M. G., Alves, C., Alastuey, A., Artaxo, P., Barreto, A., Barraza, F., Becagli, S., Calzolai, G., Chellam, S., Chen, Y., Chuang, P., Cohen, D. D., Colombi, C., Diapouli, E., Dongarra, G., Eleftheriadis, K., Galy-Lacaux, C., Gaston, C., Gomez, D., González Ramos, Y., Hakola, H., Harrison, R. M., Heyes, C., Herut, B., Hopke, P., Hüglin, C., Kanakidou, M., Kertesz, Z., Klimont, Z., Kyllönen, K., Lambert, F., Liu, X., Losno, R., Lucarelli, F., Maenhaut, W., Marticorena, B., Martin, R. V., Mihalopoulos, N., Morera-Gomez, Y., Paytan, A., Prospero, J., Rodríguez, S., Smichowski, P., Varrica, D., Walsh, B., Weagle, C., and Zhao, X.: AERO-MAP: A data compilation and modelling approach to understand the fine and coarse mode aerosol composition, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-1, 2024.

Received: 04 Jan 2024 – Discussion started: 06 Feb 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 27036 KB)

Withdrawal notice
This preprint has been withdrawn.
Preprint (27036 KB)

Supplement (1087 KB)

Download & links

This preprint has been withdrawn.

Preprint (27036 KB)
Metadata XML
Supplement (1087 KB)
BibTeX
EndNote

Interactive discussion

Status: closed

RC1:
'Referee Comment on essd-2024-1', Anonymous Referee #1, 20 Mar 2024
General comments
"AERO-MAP: A data compilation and modelling approach to understand the fine and coarse mode aerosol composition" aims to describe a new compilation of global aerosol observations including PM2.5, PM10 and PM components (BC, OC, SO42-, Al, NO3-, NH4+, Na, and Cl). The measurement data apparently consists of over 20 million observations taken over 15,000 stations spanning 1986-2023. Such a dataset would be of high value to the climate and air quality modeling communities. However the data that is under review has been temporally averaged in such a way that severely limits its use for climate or air quality model evaluation or characterization of global aerosols. I discuss this further in my Specific Comments. In addition, the paper is lacking in details related to the "planning, instrumentation, and execution of collection of data" which should be included for an ESSD data description paper. Despite the large volume of observations being presented, the observation data section consists of 4 paragraphs of the entire manuscript. The authors provide references for the observations in one of their data files but there should be a more in depth discussion on how instrument or sampling differences across the different data sources and across years of the study period (which spans decades) could affect interpretation of the observed values. It would also be beneficial to more explicitly state how this data compilation differs from other data sources. The datasets themselves have several technical issues which I discuss below.
The large portion of the paper is spent describing a global modeling study for 2013-2015 including detailed model specifications and an evaluation that compares the PM observed data to the model results. The authors state this component of the paper is included as "a methodology for comparing the datasets to model output, and show the implications of these results using one model."   However the observed data they use is not suitable for evaluating the global model output due to the incommensurability between the temporal averaging of the measurements and the 3 years of model simulations. Beyond that concern, the lengthy diagnostic evaluation section describes model biases and provides possible explanations for the differences between the model and observed data. This type of model application appears to deviate from the ESSD aim and scope for data articles and may be better suited for a technical science article.   Specifically the results section is counter to the ESSD policy: "Any interpretation of data is outside the scope of regular articles." and "Any comparison to other methods is beyond the scope of regular articles."
The model code and output is included as part of the data package but there is not a reason given why the model output itself would be of high value to the modeling community. I recommend the new compilation of observed data be shared with higher temporal resolution, ideally the native temporal resolution of the measurements if possible, e.g., hourly or daily averages.   I also recommend that the modeling component of the paper be removed and submitted to a different journal.
The authors state that they plan to provide more temporally resolved observed data in a future study but this is counter to the ESSD policy to "enable the reviewer and the reader to review and use the data, respectively, with the least amount of effort. To this end, all necessary information should be presented through the article text and references in a concise manner and each article should publish as much data as possible. The aim is to minimize the overall workload of reviewers (e.g. by reviewing one instead of many articles)".
Specific Comments
My largest concern is that the aerosol measurement data being shared has been averaged across all years of available data which spans multiple decades, 1986-2023. The only temporal information we are provided are variables called "PM_year_min" and "PM_year_max" and the number of observations taken between those two years. I mapped what this looked like for the global measurements and found that measurements taken in the US represent data taken across very different time intervals with some sites starting years in the 1980s and others starting in 2023. With the large changes in PM and its constituents across these decades the years that go into the temporal average will provide a very different picture of US air quality. In addition, there is no information on the sampling frequency between these years. Although a total number of observations is given this is not sufficient information for important questions such as whether the instrument at a given site had measurements throughout the year or only certain parts of the year, or whether the measurements were daily, 1 in 3 days, 1 in 6 days, etc. Also there is no start and end year provided for species other than PM. Looking globally we also see substantial differences in the temporal sampling. Almost all of the data in China begins after 2010 and a large portion of the data in India begin in 2023.
As a result, any spatial maps of these '1986-2023 average' aerosol measurements is essentially uninterpretable. The observed averages cannot be compared to model simulation data in any meaningful way although the authors include such a comparison against a global climate model.
In addition to the .csv file with average measurements at each station, the authors have provided the temporally averaged observed data as spatially averaged gridded data over the model grid (~ 2° x 2°). I think the spatially averaging decreases the value of the dataset and I think these spatial fields do not need to be included with the data package for ESSD. For example, in the PM25 gridded obs file there is a grid cell with 261 different stations and 725,414 observations and another grid cell with 1 station and 3 observations. Yet these two 'observed' values are treated equally in the evaluation of the model output although they are vastly different in their representation of a 'true' air quality for their respective locations.
In summary, why not provide the data at the original temporal and spatial resolution? This would allow the individual researcher to decide how to best spatially and temporally aggregate the original data for their application.
Review of the data files
The modelfiles.zip file contains a modelfiles.tar as well as the untarred files. Why the duplication?

AEROMPdata.csv

All of the column headings need to be defined with units. The last columns in the file are missing column headings altogether.

The AROMAPdata.csv only has PM min and max year. Temporal information should also be included for the other species. When the PM measurement is missing there is no min and max year so there is literally no temporal record for those obs. Even when there is a PM measurement available I assume the temporal coverage can be quite different from species to species, so only including a min and max year is insufficient.

The .csv includes negative longitude values as well as values up to 346. The measurement location information should be harmonized so that values either range from 0-360 or fall with +/- 180.

There are records in the AEROMAPdata.csv where the PM min year is 0 and one record where the PM max year is 2167.

There are locations in Brazil and the islands off the coast of Morocco in the PM2.5 gridded obs file that are not in the AROMAPdata.csv file.

The .nc files

There needs to be more meta data than what is provided in the netcdf 'header' information or the README text file. All variables should be clearly defined: gw, dep, source, conc, depmonth, concmonth, finconcmonth, coarseconcmonth.

The README file should also include the temporal averaging, i.e., 1986-2023.

The README includes this statement "Fine is PM2.5 and coarse is Pm10-Pm2.5, and if not designated it is total PM10.". The reference to 'not designated' is too vague.   I assume you are saying the variable 'conc' is PM10? Are there other variable names in the other files other than 'conc' that also refer to PM10?

Technical Corrections
Main paper

Title should include the temporal coverage (1986-2023)

Title: Remove 'the'

Line 15: Should be "Raleigh"

Line 76: To keep this sentence parallel and avoid using 'impact' twice in your list, recommend changing this to "...impact surface air quality, change surface albedo of snow and ice when deposited, and modulate..."

Line 82: I disagree that the paper provides "a methodology". The example evaluation has a lot of choices that are very specific to your model application.

Line 90: Recommend rewording with a more active voice. Also the qualified 'Recent' is probably not needed before "IPCC reports"

Line 103: Please add what "aerosol effects" is referring to in this sentence.

Line 157: formatting issue

Line 160: subscript needed

Line 182: Please state explicitly where the time period information is included. I had to hunt for it before finding it in the AEROdata.csv file. The temporal information is also inadequate in the .csv file as discussed above.

Line 201: formatting issue

Line 203: Include the range of dates of the observations.

Line 220: Using 2010 emissions will certainly impact your model's ability to accurately estimate air quality in 2013-2015. There are substantial changes in global air quality from 2010 -2015, for example decreasing emissions from industrial sources in China beginning in 2013.

Line 227: formatting issue

Line 228: What is meant by "present day" in this context? The previous paragraph refers to 2010 emissions for 2013-2015 simulations.

Line 240: What is meant by 'its importance can be isolated'

Lined 241: O< should be OC

Line 247: Sentence beginning with 'Assumptions…' is awkwardly worded, particularly the parenthetical statement and could use some word smithing.

Line 262: Awkwardly worded. How do particles 'interact with photochemistry'

Line 282: Sentence beginning with 'We do not choose..' should be edited for clarity, for example breaking it into 2 or more sentences.

Line 289: one-to-one

Line 303: What differences is this referring to, i.e., different between what two things?

Line 312: Why is the lat/long data location so imprecise in some cases? In what geographic region does this occur?

Line 314: Please reword sentence beginning 'Because of..' for clarity.

Line 323: What is meant by 'most' of the observations and 'most' of the stations? Compared to what baseline?

Line 324: Please restate the temporal coverage (1986-2023)

Line 324: You are not provided annual averages which implies you have separate averages for each year. Rather you are providing an average over 1986-2023 of all available data. At some sites you may not even have data for a single full year.

Line 327: Missing parentheses.

Line 330: Because these are temporal averages I disagree "this dataset presents a huge increase in the amount of data available to the aerosol modelling community"

Line 333: These are not annual averages. The observations span decades and there is no information on whether some of the sampling is only done during certain times of year.

Line 337: 'there could be differences in the model-data comparison because of the time period discrepancies.'    Since you are not matching the model values in time to the observations, it is not that there 'could' be differences, there most certainly ARE differences. It is hard for me to understand what can be learned from the evaluation study when the emissions and observations are so mismatched.

Line 346: Suggest rewording to avoid the repeat of 'highest modelled values' and 'high concentration values'

Line 348: "These discrepancies over India and China could be due to errors in the input emissions datasets or the aerosol transport modelling, or to differences in the time periods covered: the observations are more recent while the assumptions for the emissions are for the year 2010." Yes this is a huge discrepancy given the trends in PM in these areas between 2010 and 2023. It is hard to see what other conclusions can be drawn from this comparison because of this mismatch

Line 354: This is a valuable comparison across observation sources. It would be valuable to spend more of the paper summarizing and validating the different sources of observed data.

Line 369: An r value of .25 (r^2 = .06) does not support the statement that the model is 'roughly able' to capture spatial patterns.

Line 383: Please reword sentence beginning "As a proxy for dust" for clarity

Line 400: "NO3- aerosol particles compared against available observations show that over 2 orders of magnitude, the model results are able to simulate the spatial variability (Fig. 4k and l). Note that here, we have multiplied the simulations by a factor 0.5 in order to achieve a good mean comparison, as indicated by Vira et al. (2022). " I strongly disagree with the approach of scaling a model result to 'achieve a good mean comparison' and then claiming that the model is able to simulate observations. The Vira et al. paper is documenting a model bias, not suggesting that model values should be scaled by a single factor across space and time.

Line 408: The use of PM6.9 notation is non-standard and could create unnecessary confusion. Other air quality models calculate a cutoff for every data point based on the composition of the particles at that time and place.
For an example from the Community Mulitscale Air Quality (CMAQ) here is the equation to go from Aerodynamic diameter cutoff (i.e. PM10) to Stokes diameter cutoff: https://github.com/USEPA/CMAQ_Dev/blob/24c0840315978f94541d8b6288163e7a54c8694d/CCTM/src/aero/aero6/aero_subs.F#L2737

D_stokes = D_aerodynamic * (Rho0 / RhoP)^(1/2) * (Cc(D_aerodynamic)/Cc(D_stokes))^(1/2)

Rho0 is assumed to be 1.0 g cm-3

RhoP is the particle density (i.e., there are different RhoP values for different sources such as dust and sea spray)

CMAQ output is reported as PM10 (aerodynamic diameter space) rather than in Stokes diameter space and PM10 is the standard nomenclature for air quality modeling and evaluation against observations.

Line 417: Please reword sentence beginning "Again we are including new.." for clarity.

Line 431: Please reword sentence beginning "Similarly,"

Line 437: "For the NO3- in aerosol particles, similar to the PM2.5 size, the particles were multiplied by 0.5 to better match the observations following Vira et al (2022)." Same comment as above. It is not appropriate for a model evaluation study to scale model results and them proceed to describe model 'performance' of the scaled values. Rather than this approach you could summarize the evaluation in Vira et al (2022) and compare it to the biases you see in your study.

Line 488: "The fidelity of the annual means provided by this study will depend upon the ability of the measurement networks to capture the observed multi-decadal increases and decreases of emission that vary between source regions and sectors (Quaas et al., 2022)."

Again you are not providing annual means but a multi decadal mean. There is no way for the single average to capture trends in the data.

Figures
all figures (including the supplement) should include the time period of the averaging in the caption, i.e., 1986-2023.

Figure 1: The sources of the data should not be included in the figure caption but in the main text or as supplemental information.

Figure 2a: The colors of the gridded observations do not show up due to the outline of the colored dots. It would be helpful if you could add a column that is just the obs data (and remove black outline of the dots) then have middle column be just the model values.

Figure 2a: You include the comment "see Supplemental dataset for more details". Since you have multiple supplemental datasets, please be explicit about where this information is located, i.e., name of file, where to find it in the file.

Figure 2d: It is difficult to glean much from a scatter plot with over 7,000 points. It would be more helpful to see separate density scatter plots for each of the 7 regions where the color of the plot indicates the density of the points. This would allow the reader to more easily see consistent biases and correlation in each region.

Figure 4: Same comment as Figure 2 for the observation overlays on the model maps.

All figures: Using the log scale for model vs obs emphasizes smaller values and condenses the largest values. However from a health and ecosystem standpoint the largest values are of greatest interest and there is not much need to know the difference in model performance for .1 ug/m3 vs .01 ug/m3. Strongly suggest not plotting on log scale. Also recommend a density scatter plot when # points exceeds several hundred.

Figure 7. Include in the caption information the time period you are considering for the % coverage.
Citation: https://doi.org/10.5194/essd-2024-1-RC1
RC2:
'Comment on essd-2024-1', Anonymous Referee #2, 21 Mar 2024
General comments
This manuscript includes a comprehensive collection of aerosol composition data and provide a global overview to be compared with an earth system model. This is certainly an important effort since there is little work in the global perspective to include a complete analysis of the aerosol composition, there is especially lack of global information on the spatial distribution of coarse aerosols. A synthesis of data from a lot of different sources is great effort by the authors,
However, I find the lack of transparency of the data sources and the choice of data for the model comparison concerning. It seems a bit arbitrary which data has been selected and a reasoning for why you have selected those you have. There is a mixture of data from shorter campaign periods, others are compilations of data from regional programmes with long time series, and these are not necessarily comparable when everything is aggregated. There are no references or acknowledgement of the data from the large regional and global data repositories in the manuscript, which I presume quite a lot of the data are collected from, I.e. CAPMoN, CASTNET, EANET, EMEP, IMPROVE, INDAAF and GAW. I find it quite troublesome that the primary data repositories are not acknowledged or referred to better.
Further it is not clear why (and maybe how) the observations have been gridded/aggregated. There can be huge differences in concentrations between sites and between years, thus how well are the average concentrations representative for the grid? I.e. if you have 10 urban sites and one regional within the same grid, and some data for 1 year and others for 20 years? The authors should try to choose representative sites for the grids.
I also find the selected time periods for the comparison between model and observations troublesome. The model results are for the period 2013-2015 with emissions from 2010, and this is compared to data from a very large time period, 1986-2023. I don’t understand the usefulness of comparing this. The temporal changes in atmospheric chemistry have been large during this time period, with different trajectories for the various regions. Further some years probably have much more data than others. I think you should only use observational data for the same period ±2 years as the model (e.g 2010-2015), and not use data from all the way back to 1986.
Specific comments
As mentioned, the references to the data used need to be improved. An example of poor referencing is Alastuey 2016. This is a European campaign of dust measurements (2012-2013), but in the AEROMAP datafile this reference is used for data from Montseny only and for a longer period (2012-2017). Data from EMEP is referred to as Tørseth et al., 2012, that is ok, though newer reviews have been done. But more importantly, the data is collected from the data repository EBAS, and not from the paper itself. The reference to EBAS is lacking in the paper. It is stated in the csv file, but this is very much hidden. Further, quite a lot of data from EMEP is not included in the study of some reasons. The Asian EANET data is available from the EANET data repository (https://www.eanet.asia/) and should not be references to Tørseth et al., 2012. Another example is from Canada where the link to the data is not working: https://data.ec.gc.ca/data/air/monitor/national-air-pollution-surveillance-naps-program/Data-Donnees. These data from Environmental Canada I assume? If so, https://open.canada.ca/en/open-data is probably more appropriate, or maybe not -it seems like you lack a lot of chemistry data from Canada? I only did some test of references and links. There is a need for more checks by the authors.
line 135-136: I agreed certainly with the open data policy i.e. that monitoring networks and those using the data follow the FAIR data policy. However, I don’t see that this paper contributes so much to that. FAIRness of data includes proper links (i.e. with DOIs) to the data providers, and not compilations on files where there is poor tracking of the origin.

Line 167. Following upon the comment above. AEROMAP database should not be defined as a the global database without looking into the data policy for the original data repositories. Duplicating repositories is not a good way forward. Certainly, it is great that you made the data you have used available, but the way it is written you suggest the reader to go to AEROAMAP to collect aerosols data in the future. This is not sustainable.

146-147. Do you constrain the model by observations? It is rather a comparison and not an adjustment of the model output. I don’t find any text on how the model has been constrained by the measurements. In the next sentence the constrain is rather referred to as getting a full aerosol budget using tracers to estimate sea salt (Na) and dust, (Al) but constraining the PM budget that is something different? It would have been nice to show a spatial distribution plot with relative contribution of i.e. SIA, organics, sea salt, and dust to PM10 and PM2.5 if possible.

179-181, 324-325 and Fig 1 Not clear what is defined as number of observations. For e.g. a time series of PM10 is 10 years. Is it then 1,10 or 520 if weekly samples, but had it been hourly data it would have been 87 600 data points. Though how can you have 10^6 observations of OC? In Europe (EMEP program) there are ca 15 sites measuring OC and 10 measuring Al in aerosols depending on which year and most of these measurements are weekly and it does not sum up to millions of observations. It would make more sense to illustrate the number of annual datasets you use, or the number of comparison points (grids) with model. Which I assume is what is given in table S4.

I find it very strange that you have almost the same number of sites for EC, OC and Al compared to SO4 which certainly is measured at many more sites globally. Though you are maybe picking those sites which have all the constituents and not only one of them? The number of sites looks reasonable in Fig 1b and c, but I don’t understand the number in Fig1e. How can it by number of sites “for each 2x2 grid box is shown as a dotted line”? It must be a different number for each gird box depending of the spatial distribution of the sites.

307-308 and repeated at 342. I don’t understand the reasoning that the high number of sites will make the figure unreadable. The reason for gridding is to rather better compare with the model output?

Technical corrections
Figure caption fig 1: the link https://app.cpcbccr.com/ccr/#/caaqm-dashboard-all/caaqmlanding/data does not work

Line 227: Missing space between BC and follow

248: P is phosphorus?. The sentence in the parentheses is a bit strange

Line 241 written O< instead of OC or OM
Citation: https://doi.org/10.5194/essd-2024-1-RC2
AC1: 'Comment on essd-2024-1', Natalie Mahowald, 02 Apr 2024

Thanks very much to the reviewers for their careful reading of the paper. As noted by one of the reviewers, and after consultation with the editors, we realized the paper is not appropriate for ESSD since it includes interpretation of the data as well as a model/data comparison. We will withdraw the paper, revise to incorporate the comments by the reviewers and submit to a more appropriate journal.

Citation: https://doi.org/10.5194/essd-2024-1-AC1

Interactive discussion

Status: closed

RC1:
'Referee Comment on essd-2024-1', Anonymous Referee #1, 20 Mar 2024
General comments
"AERO-MAP: A data compilation and modelling approach to understand the fine and coarse mode aerosol composition" aims to describe a new compilation of global aerosol observations including PM2.5, PM10 and PM components (BC, OC, SO42-, Al, NO3-, NH4+, Na, and Cl). The measurement data apparently consists of over 20 million observations taken over 15,000 stations spanning 1986-2023. Such a dataset would be of high value to the climate and air quality modeling communities. However the data that is under review has been temporally averaged in such a way that severely limits its use for climate or air quality model evaluation or characterization of global aerosols. I discuss this further in my Specific Comments. In addition, the paper is lacking in details related to the "planning, instrumentation, and execution of collection of data" which should be included for an ESSD data description paper. Despite the large volume of observations being presented, the observation data section consists of 4 paragraphs of the entire manuscript. The authors provide references for the observations in one of their data files but there should be a more in depth discussion on how instrument or sampling differences across the different data sources and across years of the study period (which spans decades) could affect interpretation of the observed values. It would also be beneficial to more explicitly state how this data compilation differs from other data sources. The datasets themselves have several technical issues which I discuss below.
The large portion of the paper is spent describing a global modeling study for 2013-2015 including detailed model specifications and an evaluation that compares the PM observed data to the model results. The authors state this component of the paper is included as "a methodology for comparing the datasets to model output, and show the implications of these results using one model."   However the observed data they use is not suitable for evaluating the global model output due to the incommensurability between the temporal averaging of the measurements and the 3 years of model simulations. Beyond that concern, the lengthy diagnostic evaluation section describes model biases and provides possible explanations for the differences between the model and observed data. This type of model application appears to deviate from the ESSD aim and scope for data articles and may be better suited for a technical science article.   Specifically the results section is counter to the ESSD policy: "Any interpretation of data is outside the scope of regular articles." and "Any comparison to other methods is beyond the scope of regular articles."
The model code and output is included as part of the data package but there is not a reason given why the model output itself would be of high value to the modeling community. I recommend the new compilation of observed data be shared with higher temporal resolution, ideally the native temporal resolution of the measurements if possible, e.g., hourly or daily averages.   I also recommend that the modeling component of the paper be removed and submitted to a different journal.
The authors state that they plan to provide more temporally resolved observed data in a future study but this is counter to the ESSD policy to "enable the reviewer and the reader to review and use the data, respectively, with the least amount of effort. To this end, all necessary information should be presented through the article text and references in a concise manner and each article should publish as much data as possible. The aim is to minimize the overall workload of reviewers (e.g. by reviewing one instead of many articles)".
Specific Comments
My largest concern is that the aerosol measurement data being shared has been averaged across all years of available data which spans multiple decades, 1986-2023. The only temporal information we are provided are variables called "PM_year_min" and "PM_year_max" and the number of observations taken between those two years. I mapped what this looked like for the global measurements and found that measurements taken in the US represent data taken across very different time intervals with some sites starting years in the 1980s and others starting in 2023. With the large changes in PM and its constituents across these decades the years that go into the temporal average will provide a very different picture of US air quality. In addition, there is no information on the sampling frequency between these years. Although a total number of observations is given this is not sufficient information for important questions such as whether the instrument at a given site had measurements throughout the year or only certain parts of the year, or whether the measurements were daily, 1 in 3 days, 1 in 6 days, etc. Also there is no start and end year provided for species other than PM. Looking globally we also see substantial differences in the temporal sampling. Almost all of the data in China begins after 2010 and a large portion of the data in India begin in 2023.
As a result, any spatial maps of these '1986-2023 average' aerosol measurements is essentially uninterpretable. The observed averages cannot be compared to model simulation data in any meaningful way although the authors include such a comparison against a global climate model.
In addition to the .csv file with average measurements at each station, the authors have provided the temporally averaged observed data as spatially averaged gridded data over the model grid (~ 2° x 2°). I think the spatially averaging decreases the value of the dataset and I think these spatial fields do not need to be included with the data package for ESSD. For example, in the PM25 gridded obs file there is a grid cell with 261 different stations and 725,414 observations and another grid cell with 1 station and 3 observations. Yet these two 'observed' values are treated equally in the evaluation of the model output although they are vastly different in their representation of a 'true' air quality for their respective locations.
In summary, why not provide the data at the original temporal and spatial resolution? This would allow the individual researcher to decide how to best spatially and temporally aggregate the original data for their application.
Review of the data files
The modelfiles.zip file contains a modelfiles.tar as well as the untarred files. Why the duplication?

AEROMPdata.csv

All of the column headings need to be defined with units. The last columns in the file are missing column headings altogether.

The AROMAPdata.csv only has PM min and max year. Temporal information should also be included for the other species. When the PM measurement is missing there is no min and max year so there is literally no temporal record for those obs. Even when there is a PM measurement available I assume the temporal coverage can be quite different from species to species, so only including a min and max year is insufficient.

The .csv includes negative longitude values as well as values up to 346. The measurement location information should be harmonized so that values either range from 0-360 or fall with +/- 180.

There are records in the AEROMAPdata.csv where the PM min year is 0 and one record where the PM max year is 2167.

There are locations in Brazil and the islands off the coast of Morocco in the PM2.5 gridded obs file that are not in the AROMAPdata.csv file.

The .nc files

There needs to be more meta data than what is provided in the netcdf 'header' information or the README text file. All variables should be clearly defined: gw, dep, source, conc, depmonth, concmonth, finconcmonth, coarseconcmonth.

The README file should also include the temporal averaging, i.e., 1986-2023.

The README includes this statement "Fine is PM2.5 and coarse is Pm10-Pm2.5, and if not designated it is total PM10.". The reference to 'not designated' is too vague.   I assume you are saying the variable 'conc' is PM10? Are there other variable names in the other files other than 'conc' that also refer to PM10?

Technical Corrections
Main paper

Title should include the temporal coverage (1986-2023)

Title: Remove 'the'

Line 15: Should be "Raleigh"

Line 76: To keep this sentence parallel and avoid using 'impact' twice in your list, recommend changing this to "...impact surface air quality, change surface albedo of snow and ice when deposited, and modulate..."

Line 82: I disagree that the paper provides "a methodology". The example evaluation has a lot of choices that are very specific to your model application.

Line 90: Recommend rewording with a more active voice. Also the qualified 'Recent' is probably not needed before "IPCC reports"

Line 103: Please add what "aerosol effects" is referring to in this sentence.

Line 157: formatting issue

Line 160: subscript needed

Line 182: Please state explicitly where the time period information is included. I had to hunt for it before finding it in the AEROdata.csv file. The temporal information is also inadequate in the .csv file as discussed above.

Line 201: formatting issue

Line 203: Include the range of dates of the observations.

Line 220: Using 2010 emissions will certainly impact your model's ability to accurately estimate air quality in 2013-2015. There are substantial changes in global air quality from 2010 -2015, for example decreasing emissions from industrial sources in China beginning in 2013.

Line 227: formatting issue

Line 228: What is meant by "present day" in this context? The previous paragraph refers to 2010 emissions for 2013-2015 simulations.

Line 240: What is meant by 'its importance can be isolated'

Lined 241: O< should be OC

Line 247: Sentence beginning with 'Assumptions…' is awkwardly worded, particularly the parenthetical statement and could use some word smithing.

Line 262: Awkwardly worded. How do particles 'interact with photochemistry'

Line 282: Sentence beginning with 'We do not choose..' should be edited for clarity, for example breaking it into 2 or more sentences.

Line 289: one-to-one

Line 303: What differences is this referring to, i.e., different between what two things?

Line 312: Why is the lat/long data location so imprecise in some cases? In what geographic region does this occur?

Line 314: Please reword sentence beginning 'Because of..' for clarity.

Line 323: What is meant by 'most' of the observations and 'most' of the stations? Compared to what baseline?

Line 324: Please restate the temporal coverage (1986-2023)

Line 324: You are not provided annual averages which implies you have separate averages for each year. Rather you are providing an average over 1986-2023 of all available data. At some sites you may not even have data for a single full year.

Line 327: Missing parentheses.

Line 330: Because these are temporal averages I disagree "this dataset presents a huge increase in the amount of data available to the aerosol modelling community"

Line 333: These are not annual averages. The observations span decades and there is no information on whether some of the sampling is only done during certain times of year.

Line 337: 'there could be differences in the model-data comparison because of the time period discrepancies.'    Since you are not matching the model values in time to the observations, it is not that there 'could' be differences, there most certainly ARE differences. It is hard for me to understand what can be learned from the evaluation study when the emissions and observations are so mismatched.

Line 346: Suggest rewording to avoid the repeat of 'highest modelled values' and 'high concentration values'

Line 348: "These discrepancies over India and China could be due to errors in the input emissions datasets or the aerosol transport modelling, or to differences in the time periods covered: the observations are more recent while the assumptions for the emissions are for the year 2010." Yes this is a huge discrepancy given the trends in PM in these areas between 2010 and 2023. It is hard to see what other conclusions can be drawn from this comparison because of this mismatch

Line 354: This is a valuable comparison across observation sources. It would be valuable to spend more of the paper summarizing and validating the different sources of observed data.

Line 369: An r value of .25 (r^2 = .06) does not support the statement that the model is 'roughly able' to capture spatial patterns.

Line 383: Please reword sentence beginning "As a proxy for dust" for clarity

Line 400: "NO3- aerosol particles compared against available observations show that over 2 orders of magnitude, the model results are able to simulate the spatial variability (Fig. 4k and l). Note that here, we have multiplied the simulations by a factor 0.5 in order to achieve a good mean comparison, as indicated by Vira et al. (2022). " I strongly disagree with the approach of scaling a model result to 'achieve a good mean comparison' and then claiming that the model is able to simulate observations. The Vira et al. paper is documenting a model bias, not suggesting that model values should be scaled by a single factor across space and time.

Line 408: The use of PM6.9 notation is non-standard and could create unnecessary confusion. Other air quality models calculate a cutoff for every data point based on the composition of the particles at that time and place.
For an example from the Community Mulitscale Air Quality (CMAQ) here is the equation to go from Aerodynamic diameter cutoff (i.e. PM10) to Stokes diameter cutoff: https://github.com/USEPA/CMAQ_Dev/blob/24c0840315978f94541d8b6288163e7a54c8694d/CCTM/src/aero/aero6/aero_subs.F#L2737

D_stokes = D_aerodynamic * (Rho0 / RhoP)^(1/2) * (Cc(D_aerodynamic)/Cc(D_stokes))^(1/2)

Rho0 is assumed to be 1.0 g cm-3

RhoP is the particle density (i.e., there are different RhoP values for different sources such as dust and sea spray)

CMAQ output is reported as PM10 (aerodynamic diameter space) rather than in Stokes diameter space and PM10 is the standard nomenclature for air quality modeling and evaluation against observations.

Line 417: Please reword sentence beginning "Again we are including new.." for clarity.

Line 431: Please reword sentence beginning "Similarly,"

Line 437: "For the NO3- in aerosol particles, similar to the PM2.5 size, the particles were multiplied by 0.5 to better match the observations following Vira et al (2022)." Same comment as above. It is not appropriate for a model evaluation study to scale model results and them proceed to describe model 'performance' of the scaled values. Rather than this approach you could summarize the evaluation in Vira et al (2022) and compare it to the biases you see in your study.

Line 488: "The fidelity of the annual means provided by this study will depend upon the ability of the measurement networks to capture the observed multi-decadal increases and decreases of emission that vary between source regions and sectors (Quaas et al., 2022)."

Again you are not providing annual means but a multi decadal mean. There is no way for the single average to capture trends in the data.

Figures
all figures (including the supplement) should include the time period of the averaging in the caption, i.e., 1986-2023.

Figure 1: The sources of the data should not be included in the figure caption but in the main text or as supplemental information.

Figure 2a: The colors of the gridded observations do not show up due to the outline of the colored dots. It would be helpful if you could add a column that is just the obs data (and remove black outline of the dots) then have middle column be just the model values.

Figure 2a: You include the comment "see Supplemental dataset for more details". Since you have multiple supplemental datasets, please be explicit about where this information is located, i.e., name of file, where to find it in the file.

Figure 2d: It is difficult to glean much from a scatter plot with over 7,000 points. It would be more helpful to see separate density scatter plots for each of the 7 regions where the color of the plot indicates the density of the points. This would allow the reader to more easily see consistent biases and correlation in each region.

Figure 4: Same comment as Figure 2 for the observation overlays on the model maps.

All figures: Using the log scale for model vs obs emphasizes smaller values and condenses the largest values. However from a health and ecosystem standpoint the largest values are of greatest interest and there is not much need to know the difference in model performance for .1 ug/m3 vs .01 ug/m3. Strongly suggest not plotting on log scale. Also recommend a density scatter plot when # points exceeds several hundred.

Figure 7. Include in the caption information the time period you are considering for the % coverage.
Citation: https://doi.org/10.5194/essd-2024-1-RC1
RC2:
'Comment on essd-2024-1', Anonymous Referee #2, 21 Mar 2024
General comments
This manuscript includes a comprehensive collection of aerosol composition data and provide a global overview to be compared with an earth system model. This is certainly an important effort since there is little work in the global perspective to include a complete analysis of the aerosol composition, there is especially lack of global information on the spatial distribution of coarse aerosols. A synthesis of data from a lot of different sources is great effort by the authors,
However, I find the lack of transparency of the data sources and the choice of data for the model comparison concerning. It seems a bit arbitrary which data has been selected and a reasoning for why you have selected those you have. There is a mixture of data from shorter campaign periods, others are compilations of data from regional programmes with long time series, and these are not necessarily comparable when everything is aggregated. There are no references or acknowledgement of the data from the large regional and global data repositories in the manuscript, which I presume quite a lot of the data are collected from, I.e. CAPMoN, CASTNET, EANET, EMEP, IMPROVE, INDAAF and GAW. I find it quite troublesome that the primary data repositories are not acknowledged or referred to better.
Further it is not clear why (and maybe how) the observations have been gridded/aggregated. There can be huge differences in concentrations between sites and between years, thus how well are the average concentrations representative for the grid? I.e. if you have 10 urban sites and one regional within the same grid, and some data for 1 year and others for 20 years? The authors should try to choose representative sites for the grids.
I also find the selected time periods for the comparison between model and observations troublesome. The model results are for the period 2013-2015 with emissions from 2010, and this is compared to data from a very large time period, 1986-2023. I don’t understand the usefulness of comparing this. The temporal changes in atmospheric chemistry have been large during this time period, with different trajectories for the various regions. Further some years probably have much more data than others. I think you should only use observational data for the same period ±2 years as the model (e.g 2010-2015), and not use data from all the way back to 1986.
Specific comments
As mentioned, the references to the data used need to be improved. An example of poor referencing is Alastuey 2016. This is a European campaign of dust measurements (2012-2013), but in the AEROMAP datafile this reference is used for data from Montseny only and for a longer period (2012-2017). Data from EMEP is referred to as Tørseth et al., 2012, that is ok, though newer reviews have been done. But more importantly, the data is collected from the data repository EBAS, and not from the paper itself. The reference to EBAS is lacking in the paper. It is stated in the csv file, but this is very much hidden. Further, quite a lot of data from EMEP is not included in the study of some reasons. The Asian EANET data is available from the EANET data repository (https://www.eanet.asia/) and should not be references to Tørseth et al., 2012. Another example is from Canada where the link to the data is not working: https://data.ec.gc.ca/data/air/monitor/national-air-pollution-surveillance-naps-program/Data-Donnees. These data from Environmental Canada I assume? If so, https://open.canada.ca/en/open-data is probably more appropriate, or maybe not -it seems like you lack a lot of chemistry data from Canada? I only did some test of references and links. There is a need for more checks by the authors.
line 135-136: I agreed certainly with the open data policy i.e. that monitoring networks and those using the data follow the FAIR data policy. However, I don’t see that this paper contributes so much to that. FAIRness of data includes proper links (i.e. with DOIs) to the data providers, and not compilations on files where there is poor tracking of the origin.

Line 167. Following upon the comment above. AEROMAP database should not be defined as a the global database without looking into the data policy for the original data repositories. Duplicating repositories is not a good way forward. Certainly, it is great that you made the data you have used available, but the way it is written you suggest the reader to go to AEROAMAP to collect aerosols data in the future. This is not sustainable.

146-147. Do you constrain the model by observations? It is rather a comparison and not an adjustment of the model output. I don’t find any text on how the model has been constrained by the measurements. In the next sentence the constrain is rather referred to as getting a full aerosol budget using tracers to estimate sea salt (Na) and dust, (Al) but constraining the PM budget that is something different? It would have been nice to show a spatial distribution plot with relative contribution of i.e. SIA, organics, sea salt, and dust to PM10 and PM2.5 if possible.

179-181, 324-325 and Fig 1 Not clear what is defined as number of observations. For e.g. a time series of PM10 is 10 years. Is it then 1,10 or 520 if weekly samples, but had it been hourly data it would have been 87 600 data points. Though how can you have 10^6 observations of OC? In Europe (EMEP program) there are ca 15 sites measuring OC and 10 measuring Al in aerosols depending on which year and most of these measurements are weekly and it does not sum up to millions of observations. It would make more sense to illustrate the number of annual datasets you use, or the number of comparison points (grids) with model. Which I assume is what is given in table S4.

I find it very strange that you have almost the same number of sites for EC, OC and Al compared to SO4 which certainly is measured at many more sites globally. Though you are maybe picking those sites which have all the constituents and not only one of them? The number of sites looks reasonable in Fig 1b and c, but I don’t understand the number in Fig1e. How can it by number of sites “for each 2x2 grid box is shown as a dotted line”? It must be a different number for each gird box depending of the spatial distribution of the sites.

307-308 and repeated at 342. I don’t understand the reasoning that the high number of sites will make the figure unreadable. The reason for gridding is to rather better compare with the model output?

Technical corrections
Figure caption fig 1: the link https://app.cpcbccr.com/ccr/#/caaqm-dashboard-all/caaqmlanding/data does not work

Line 227: Missing space between BC and follow

248: P is phosphorus?. The sentence in the parentheses is a bit strange

Line 241 written O< instead of OC or OM
Citation: https://doi.org/10.5194/essd-2024-1-RC2
AC1: 'Comment on essd-2024-1', Natalie Mahowald, 02 Apr 2024

Thanks very much to the reviewers for their careful reading of the paper. As noted by one of the reviewers, and after consultation with the editors, we realized the paper is not appropriate for ESSD since it includes interpretation of the data as well as a model/data comparison. We will withdraw the paper, revise to incorporate the comments by the reviewers and submit to a more appropriate journal.

Citation: https://doi.org/10.5194/essd-2024-1-AC1

Supplement

https://doi.org/10.5194/essd-2024-1-supplement

Data sets

Datasets for: AERO-MAP: A data compilation and modelling approach to understand the fine and coarse mode aerosol composition Natalie M. Mahowald, Longlei Li, Julius Vira, Marje Prank, Douglas Hamilton, Hitoshi Matsui, Ron L. Miller, Louis Lu, Ezgi Akyuz, Daphne Meidan, Peter Hess, Heikki Lihavainen, Christine Wiedinmyer, Jenny Hand, Maria Grazia Alaimo, Célia Alves, Andres Alastuey, Paulo Artaxo, Africa Barreto, Francisco Barraza, Silvia Becagli, Giulia Calzolai, Shankarararman Chellam, Ying Chen, Patrick Chuang, David D. Cohen, Cristina Colombi, Evangelia Diapouli, Gaetano Dongarra, Konstantinos Elfetheriadis, Corinne Galy-Lacaux, Cassandra Gaston, Dario Gomez, Yenny González Ramos, Hannele Hakola, R.M. Harrison, Chris Hayes, Barak Herut, Philip Hopke, Christoph Hüglin, Maria Kanakidou, Zsofia Kertesz, Zbiginiw Klimont, Katriina Kyllönen, Fabrice Lambert, Xiaohong Liu, Remi Losno, Franco Lucarelli, Willy Maenhaut, Beatrice Marticorena, Randall V. Martin, Nikolaos Mihalopoulos, Yasser Morera-Gomez, Adina Paytan, Joseph Prospero, Sergio Rodríguez, Patricia Smichowski, Daniela Varrica, Brenna Walsh, Crystal Weagle, and Xi Zhao https://zenodo.org/records/10459654

Viewed

Total article views: 1,376 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
954	357	65	1,376	102	78	94

HTML: 954
PDF: 357
XML: 65
Total: 1,376
Supplement: 102
BibTeX: 78
EndNote: 94

Views and downloads (calculated since 06 Feb 2024)

Month	HTML	PDF	XML	Total
Feb 2024	228	77	10	315
Mar 2024	218	57	12	287
Apr 2024	96	20	9	125
May 2024	54	23	9	86
Jun 2024	68	15	3	86
Jul 2024	27	11	7	45
Aug 2024	27	8	5	40
Sep 2024	15	7	0	22
Oct 2024	13	15	0	28
Nov 2024	18	6	1	25
Dec 2024	28	7	1	36
Jan 2025	24	12	0	36
Feb 2025	16	10	0	26
Mar 2025	18	11	1	30
Apr 2025	20	15	0	35
May 2025	11	10	2	23
Jun 2025	31	28	1	60
Jul 2025	32	22	4	58
Aug 2025	10	3	0	13

Cumulative views and downloads (calculated since 06 Feb 2024)

Month	HTML	PDF	XML	Total
Feb 2024	228	77	10	315
Mar 2024	218	57	12	287
Apr 2024	96	20	9	125
May 2024	54	23	9	86
Jun 2024	68	15	3	86
Jul 2024	27	11	7	45
Aug 2024	27	8	5	40
Sep 2024	15	7	0	22
Oct 2024	13	15	0	28
Nov 2024	18	6	1	25
Dec 2024	28	7	1	36
Jan 2025	24	12	0	36
Feb 2025	16	10	0	26
Mar 2025	18	11	1	30
Apr 2025	20	15	0	35
May 2025	11	10	2	23
Jun 2025	31	28	1	60
Jul 2025	32	22	4	58
Aug 2025	10	3	0	13

Viewed (geographical distribution)

Total article views: 1,335 (including HTML, PDF, and XML) Thereof 1,335 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 07 Aug 2025

Download

This preprint has been withdrawn.

Preprint (27036 KB)
Metadata XML

Short summary

Aerosol particles can interact with incoming solar radiation and outgoing long wave radiation, change cloud properties, affect photochemistry, impact surface air quality, and when deposited impact surface albedo of snow and ice, and modulate carbon dioxide uptake by the land and ocean. Here we present a new compilation of aerosol observations including composition, a methodology for comparing the datasets to model output, and show the implications of these results using one model.


Total:	0
HTML:	0
PDF:	0
XML:	0