A Pan-European, High-Resolution, Daily Total, Fine-Mode and Coarse-Mode Aerosol Optical Depth dataset based on Quantile Machine Learning

Chen, Zhao-Yue; Méndez, Raul; Petetin, Hervé; Lacima, Aleksander; García-Pando, Carlos Pérez; Ballester, Joan

doi:https://doi.org/10.5194/essd-2023-104

Preprints

https://doi.org/10.5194/essd-2023-104

Preprints

27 Mar 2023

| 27 Mar 2023

Status: this preprint has been withdrawn by the authors.

A Pan-European, High-Resolution, Daily Total, Fine-Mode and Coarse-Mode Aerosol Optical Depth dataset based on Quantile Machine Learning

Zhao-Yue Chen, Raul Méndez, Hervé Petetin, Aleksander Lacima, Carlos Pérez García-Pando, and Joan Ballester

Abstract. Ambient particulate matter (PM) is a widespread air pollutant, consisting of a mixture of different particle species suspended in the air that negatively affects human health. Given the generally sparse distribution of in-situ PM measurement networks, spatially-resolved PM estimates are typically derived from Aerosol Optical Depth (AOD) obtained from satellites. However, satellite AOD data over land is affected by several limitations (e.g., data gaps; coarser resolution; higher uncertainty; unavailable or unreliable size fraction information), which weakens the relationship between AOD and PM. We have developed a 0.1 degree resolution daily AOD data set over Europe over the period 2003–2020, based on new Quantile Machine Learning (QML) models. The dataset provides reliable full-coverage AOD along with Fine-mode AOD (fAOD) and Coarse-mode AOD (cAOD), based on AERONET (AErosol RObotic NETwork) site observations and climate and air quality reanalyses. Our results show that the three QML AOD products guarantee better quality with an out-of-sample R² equal to 0.68 for AOD, 0.66 for fAOD and 0.65 for cAOD, which is 23–92 %, 11–13 % and 115–132 % higher than the corresponding satellite or reanalysis products, respectively. Over 88.8 %, 80.5 % and 88.6 % of QML AOD, fAOD and cAOD predictions fall within ± 20 % Expected Error (EE) envelopes, respectively. Previous studies reported that Europe is one of the regions with the poorest satellite AOD-PM correlation (Pearson correlation coefficient (PCC) around 0.1). Our results show that the three QML products are more correlated with ground-level PMs, especially when they are paired with their corresponding PMs in terms of size: AOD with PM10, fAOD with PM2.5 and cAOD with PM coarse (R = 0.41, 0.45 and 0.26, respectively). Our results show that different PM size fractions may be better predicted using different AOD size fractions, instead of total AOD. QML long-term aerosol dataset (and associated models) not only fix some problems of existing AOD data, but also provide better tools to monitor and analyse fine-mode and coarse-mode aerosols in spatial and temporal dimensions, and to further investigate their impacts on human health, climate, visibility, and biogeochemical cycling. The QML datasets can be downloaded from https://doi.org/10.5281/zenodo.7756570 (Chen et al., 2023).

This preprint has been withdrawn.

Received: 22 Mar 2023 – Discussion started: 27 Mar 2023

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 3897 KB)

Withdrawal notice
This preprint has been withdrawn.
Preprint (3897 KB)

Supplement (4037 KB)

Download & links

This preprint has been withdrawn.

Zhao-Yue Chen, Raul Méndez, Hervé Petetin, Aleksander Lacima, Carlos Pérez García-Pando, and Joan Ballester

Interactive discussion

Status: closed

RC1:
'Comment on essd-2023-104', Anonymous Referee #1, 25 Apr 2023
The manuscript describes the application of a supervised machine learning algorithm (lightGBM) for the retrieval of AOD, fAOD, and cAOD over Europe. However, the method presented for aerosol retrieval is not new, and I have some main concerns about this study. Firstly, the claimed high-resolution (0.1 degree) aerosol product is questionable. Secondly, the validation of the proposed model shows severe overfitting.
Major concerns:
The study claims that their AOD products were generated at a spatial resolution of 0.1 degrees. However, it should be noted that the key input variable, MAIAC AOD, only has a spatial resolution of 1km, and was eventually excluded from the models. Other variables used in the study have a lower spatial resolution than 0.1 degrees. Therefore, it is questionable whether the resulting product is truly a 0.1 degree product. Additionally, Figure 11 shows that the developed AOD (B1) does not provide better details than the CAMS AOD (0.75 degrees) and MERRA-2 AOD (0.625 degrees * 0.5 degrees).

Based on the input variables listed in Table S2, it appears that only the CAMS reanalysis data provides information related to aerosol size. The study seems just used the lightGBM algorithm to correct the CAMS-based fAOD and cAOD using meteorological data.

It is unclear how well the developed fAOD and cAOD models perform at locations where no AERONET data is available. It is also unclear whether the study used completely independent ground-based data to test the results, such as a test site that was not used in the training process. If the Table S3 intends to show this validation, but the R2 of fAOD decreased significantly from 0.68 to 0.56 in M3, suggesting that the model may have an severe issue with overfitting.

During the lightGBM-based training for fAOD and cAOD, the AERONET only provides data for fAOD and cAOD at 500nm. However, it is unclear how the model was trained to calculate fAOD and cAOD at 550nm, which is a crucial issue that the paper did not address.

Specific concerns:
In Figure 1, it is not clear how to use Boruta to select the variables.

The caption of the Figure 3 says “Spatial and temporal distribution of the median value of AERONet (a) AOD, (b) fAOD and (c) cAOD data”. It makes me confused how (a), (b) and (c) reveal the temporal information.

AERONET in the figure caption is “AERONet”, but in the text is “AERONET”.

Typing errors: P10, L285, (Levy et al., 2010; Xiao et al., 2016; Yan et al., 2022)).
Citation: https://doi.org/10.5194/essd-2023-104-RC1
- AC1: 'Reply on RC1', Zhaoyue Chen, 23 May 2023
  
  Thanks for your efforts in our manuscript. We have revised the manuscript accordingly. Please find attached our point-to-point responses to each comment.
  
  Citation: https://doi.org/10.5194/essd-2023-104-AC1
RC2:
'Comment on essd-2023-104', Anonymous Referee #2, 25 Apr 2023
This study presents a daily AOD data set over Europe over the period 2003-2020, which was derived by post-processing the current satellite and reanalysis products, based on Machine Learning method. The accuracy of the total AOD in this dataset has been greatly improved. At the same time, the dataset can provide additional fine/coarse AOD data, which are also relatively reliable and will be very helpful for particulate matter (PM) prediction. The dataset will be interesting for the scientific community. Therefore, I have some comments before it could be accepted for publication.
Major comments:
For the Route in the absence of satellite data, the spatial resolution of all input reanalysis of AOD data (e.g. MERRA-2, CAMS) is relatively coarse lower than 0.1 degrees, it is not appropriate to increase the spatial resolution of final AOD product to 0.1 degrees through interpolation, as simple interpolation cannot increase the AOD variation in spatial details. I think the spatial resolution of the final AOD product should not be higher than the maximum spatial resolution of one of input reanalysis data.

For the correction of total AOD, it can be understood that the information of AOD mainly comes from the AOD data of reanalysis product. But for obtaining fine AOD and coarse AOD, this study should clarify which input data plays a dominant role.

I'm also curious, what would happen for QML AOD if two reanalysis datasets MERRA-2 and CAMS were not used as input data simultaneously?

Minor comments:
In section 2, this manuscript should introduce the basic information of PM data, as it was used in subsequent experiments.

Line 105, how about fAOD and cAOD at 550nm was interpolated?

Line 108, I believe the MODIS MAIAC data that the manuscript used is Collection 6 (C6), not v6.1, as the C6.1 product (MCD19A2) has not yet completed production.

Line 155, how is the MODIS 1km AOD product made to 0.1 degrees?

Line 269, the description is not clear about“Sat scenario”and “Non-Sat scenario”, what do these two words mean? How to distinguish“Sat scenario”and “Non-Sat scenario”?

Line 391, how was EE=±0.025 ±20 %/40 % determined? I think most literature uses 0.05 instead of 0.025.
Citation: https://doi.org/10.5194/essd-2023-104-RC2
- AC2: 'Reply on RC2', Zhaoyue Chen, 23 May 2023
  
  Thanks for your constructive feedback and valuable advice on our manuscript. We have revised the manuscript accordingly. Please find attached our point-to-point responses to each comment.
  
  Citation: https://doi.org/10.5194/essd-2023-104-AC2
RC3:
'Comment on essd-2023-104', Anonymous Referee #3, 07 Jun 2023

The dataset of AOD, fAOD and cAOD over Europe has application value for environment analysis. The machine learning method was used to produce daily AODs. The manuscript should be revised before considering publication.
General comments:
1 The spatial and temporal resolution of all input and output data for the machine learning should be listed. Due to the different resolutions of each data, the method of spatio-temporal matching should be clarified.
2 As the satellite AOD was given up, I think all the inputs are reanalysis data. So the temporal resolution of AOD, fAOD and cAOD is not necessary daily. Then, which one or some certain times in one day were selected to produce daily AOD, fAOD and cAOD? And Why?
3 Why chose LightGBM from kinds of machine learning methods? Decision-tree based machine learning methods would adopt some fixed thresholds, which may create systematic "boundary" in the product. For example, if the latitude was included in the input data, you can see a AOD systematic boundary at a latitude line. Other parameters has the similar affects.
4 The spatial distribution, I am not sure if it means some AERONET sites data were not used in training, and only used in test? If so, that's real spatial independent validation. If not, we can not give the accuracy over locations which has no AERONET site.
Minor comments:
1 The abbreviation should be explained at the first appearance, such as "NMB" in the supplement.
2 The section numbers are wrong in chapter 4.

Citation: https://doi.org/10.5194/essd-2023-104-RC3
- AC3: 'Reply on RC3', Zhaoyue Chen, 17 Jun 2023
  
  Thanks for your constructive feedback and valuable advice on our manuscript. We have revised the manuscript accordingly. Please find attached our point-to-point responses to each comment.
  
  Citation: https://doi.org/10.5194/essd-2023-104-AC3

Interactive discussion

Status: closed

RC1:
'Comment on essd-2023-104', Anonymous Referee #1, 25 Apr 2023
The manuscript describes the application of a supervised machine learning algorithm (lightGBM) for the retrieval of AOD, fAOD, and cAOD over Europe. However, the method presented for aerosol retrieval is not new, and I have some main concerns about this study. Firstly, the claimed high-resolution (0.1 degree) aerosol product is questionable. Secondly, the validation of the proposed model shows severe overfitting.
Major concerns:
The study claims that their AOD products were generated at a spatial resolution of 0.1 degrees. However, it should be noted that the key input variable, MAIAC AOD, only has a spatial resolution of 1km, and was eventually excluded from the models. Other variables used in the study have a lower spatial resolution than 0.1 degrees. Therefore, it is questionable whether the resulting product is truly a 0.1 degree product. Additionally, Figure 11 shows that the developed AOD (B1) does not provide better details than the CAMS AOD (0.75 degrees) and MERRA-2 AOD (0.625 degrees * 0.5 degrees).

Based on the input variables listed in Table S2, it appears that only the CAMS reanalysis data provides information related to aerosol size. The study seems just used the lightGBM algorithm to correct the CAMS-based fAOD and cAOD using meteorological data.

It is unclear how well the developed fAOD and cAOD models perform at locations where no AERONET data is available. It is also unclear whether the study used completely independent ground-based data to test the results, such as a test site that was not used in the training process. If the Table S3 intends to show this validation, but the R2 of fAOD decreased significantly from 0.68 to 0.56 in M3, suggesting that the model may have an severe issue with overfitting.

During the lightGBM-based training for fAOD and cAOD, the AERONET only provides data for fAOD and cAOD at 500nm. However, it is unclear how the model was trained to calculate fAOD and cAOD at 550nm, which is a crucial issue that the paper did not address.

Specific concerns:
In Figure 1, it is not clear how to use Boruta to select the variables.

The caption of the Figure 3 says “Spatial and temporal distribution of the median value of AERONet (a) AOD, (b) fAOD and (c) cAOD data”. It makes me confused how (a), (b) and (c) reveal the temporal information.

AERONET in the figure caption is “AERONet”, but in the text is “AERONET”.

Typing errors: P10, L285, (Levy et al., 2010; Xiao et al., 2016; Yan et al., 2022)).
Citation: https://doi.org/10.5194/essd-2023-104-RC1
- AC1: 'Reply on RC1', Zhaoyue Chen, 23 May 2023
  
  Thanks for your efforts in our manuscript. We have revised the manuscript accordingly. Please find attached our point-to-point responses to each comment.
  
  Citation: https://doi.org/10.5194/essd-2023-104-AC1
RC2:
'Comment on essd-2023-104', Anonymous Referee #2, 25 Apr 2023
This study presents a daily AOD data set over Europe over the period 2003-2020, which was derived by post-processing the current satellite and reanalysis products, based on Machine Learning method. The accuracy of the total AOD in this dataset has been greatly improved. At the same time, the dataset can provide additional fine/coarse AOD data, which are also relatively reliable and will be very helpful for particulate matter (PM) prediction. The dataset will be interesting for the scientific community. Therefore, I have some comments before it could be accepted for publication.
Major comments:
For the Route in the absence of satellite data, the spatial resolution of all input reanalysis of AOD data (e.g. MERRA-2, CAMS) is relatively coarse lower than 0.1 degrees, it is not appropriate to increase the spatial resolution of final AOD product to 0.1 degrees through interpolation, as simple interpolation cannot increase the AOD variation in spatial details. I think the spatial resolution of the final AOD product should not be higher than the maximum spatial resolution of one of input reanalysis data.

For the correction of total AOD, it can be understood that the information of AOD mainly comes from the AOD data of reanalysis product. But for obtaining fine AOD and coarse AOD, this study should clarify which input data plays a dominant role.

I'm also curious, what would happen for QML AOD if two reanalysis datasets MERRA-2 and CAMS were not used as input data simultaneously?

Minor comments:
In section 2, this manuscript should introduce the basic information of PM data, as it was used in subsequent experiments.

Line 105, how about fAOD and cAOD at 550nm was interpolated?

Line 108, I believe the MODIS MAIAC data that the manuscript used is Collection 6 (C6), not v6.1, as the C6.1 product (MCD19A2) has not yet completed production.

Line 155, how is the MODIS 1km AOD product made to 0.1 degrees?

Line 269, the description is not clear about“Sat scenario”and “Non-Sat scenario”, what do these two words mean? How to distinguish“Sat scenario”and “Non-Sat scenario”?

Line 391, how was EE=±0.025 ±20 %/40 % determined? I think most literature uses 0.05 instead of 0.025.
Citation: https://doi.org/10.5194/essd-2023-104-RC2
- AC2: 'Reply on RC2', Zhaoyue Chen, 23 May 2023
  
  Thanks for your constructive feedback and valuable advice on our manuscript. We have revised the manuscript accordingly. Please find attached our point-to-point responses to each comment.
  
  Citation: https://doi.org/10.5194/essd-2023-104-AC2
RC3:
'Comment on essd-2023-104', Anonymous Referee #3, 07 Jun 2023

The dataset of AOD, fAOD and cAOD over Europe has application value for environment analysis. The machine learning method was used to produce daily AODs. The manuscript should be revised before considering publication.
General comments:
1 The spatial and temporal resolution of all input and output data for the machine learning should be listed. Due to the different resolutions of each data, the method of spatio-temporal matching should be clarified.
2 As the satellite AOD was given up, I think all the inputs are reanalysis data. So the temporal resolution of AOD, fAOD and cAOD is not necessary daily. Then, which one or some certain times in one day were selected to produce daily AOD, fAOD and cAOD? And Why?
3 Why chose LightGBM from kinds of machine learning methods? Decision-tree based machine learning methods would adopt some fixed thresholds, which may create systematic "boundary" in the product. For example, if the latitude was included in the input data, you can see a AOD systematic boundary at a latitude line. Other parameters has the similar affects.
4 The spatial distribution, I am not sure if it means some AERONET sites data were not used in training, and only used in test? If so, that's real spatial independent validation. If not, we can not give the accuracy over locations which has no AERONET site.
Minor comments:
1 The abbreviation should be explained at the first appearance, such as "NMB" in the supplement.
2 The section numbers are wrong in chapter 4.

Citation: https://doi.org/10.5194/essd-2023-104-RC3
- AC3: 'Reply on RC3', Zhaoyue Chen, 17 Jun 2023
  
  Thanks for your constructive feedback and valuable advice on our manuscript. We have revised the manuscript accordingly. Please find attached our point-to-point responses to each comment.
  
  Citation: https://doi.org/10.5194/essd-2023-104-AC3

Zhao-Yue Chen, Raul Méndez, Hervé Petetin, Aleksander Lacima, Carlos Pérez García-Pando, and Joan Ballester

Supplement

https://doi.org/10.5194/essd-2023-104-supplement

Data sets

A Pan-European, Quantile Machine learning (QML) based, Total, Fine-Mode and Coarse-Mode Aerosol Optical Depth dataset (QML AOD)) Zhao-yue Chen, Raul Méndez, Hervé Petetin, Aleksander Lacima, Carlos Pérez García-Pando, and Joan Ballester https://doi.org/10.5281/zenodo.7756570

Zhao-Yue Chen, Raul Méndez, Hervé Petetin, Aleksander Lacima, Carlos Pérez García-Pando, and Joan Ballester

Viewed

Total article views: 2,105 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
1,670	359	76	2,105	140	92	102

HTML: 1,670
PDF: 359
XML: 76
Total: 2,105
Supplement: 140
BibTeX: 92
EndNote: 102

Views and downloads (calculated since 27 Mar 2023)

Month	HTML	PDF	XML	Total
Mar 2023	106	26	4	136
Apr 2023	138	35	6	179
May 2023	58	15	5	78
Jun 2023	84	12	6	102
Jul 2023	71	11	0	82
Aug 2023	119	9	0	128
Sep 2023	104	20	2	126
Oct 2023	44	11	2	57
Nov 2023	131	10	0	141
Dec 2023	59	11	0	70
Jan 2024	20	2	1	23
Feb 2024	15	11	5	31
Mar 2024	26	11	5	42
Apr 2024	25	8	6	39
May 2024	17	9	5	31
Jun 2024	15	3	2	20
Jul 2024	12	4	5	21
Aug 2024	21	4	3	28
Sep 2024	7	3	0	10
Oct 2024	4	5	0	9
Nov 2024	10	2	0	12
Dec 2024	12	7	0	19
Jan 2025	9	8	4	21
Feb 2025	13	4	2	19
Mar 2025	21	2	1	24
Apr 2025	33	15	3	51
May 2025	26	18	2	46
Jun 2025	24	26	0	50
Jul 2025	25	17	2	44
Aug 2025	69	15	1	85
Sep 2025	318	14	3	335
Oct 2025	33	10	1	44
Nov 2025	1	1	0	2

Cumulative views and downloads (calculated since 27 Mar 2023)

Month	HTML	PDF	XML	Total
Mar 2023	106	26	4	136
Apr 2023	138	35	6	179
May 2023	58	15	5	78
Jun 2023	84	12	6	102
Jul 2023	71	11	0	82
Aug 2023	119	9	0	128
Sep 2023	104	20	2	126
Oct 2023	44	11	2	57
Nov 2023	131	10	0	141
Dec 2023	59	11	0	70
Jan 2024	20	2	1	23
Feb 2024	15	11	5	31
Mar 2024	26	11	5	42
Apr 2024	25	8	6	39
May 2024	17	9	5	31
Jun 2024	15	3	2	20
Jul 2024	12	4	5	21
Aug 2024	21	4	3	28
Sep 2024	7	3	0	10
Oct 2024	4	5	0	9
Nov 2024	10	2	0	12
Dec 2024	12	7	0	19
Jan 2025	9	8	4	21
Feb 2025	13	4	2	19
Mar 2025	21	2	1	24
Apr 2025	33	15	3	51
May 2025	26	18	2	46
Jun 2025	24	26	0	50
Jul 2025	25	17	2	44
Aug 2025	69	15	1	85
Sep 2025	318	14	3	335
Oct 2025	33	10	1	44
Nov 2025	1	1	0	2

Viewed (geographical distribution)

Total article views: 2,035 (including HTML, PDF, and XML) Thereof 2,035 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 02 Nov 2025

Download

This preprint has been withdrawn.

Preprint (3897 KB)
Metadata XML

Short summary

Given in the limitations of existing AOD and its size fraction information, a new 18-year daily Aerosol Optical Depth (AOD) dataset over Europe has been developed based on quantile machine learning (QML) models. This dataset improves the ability to monitor and analyse fine-mode and coarse-mode aerosols. They provide better tools to investigate negatively affect human health and have impacts on climate, visibility, and biogeochemical cycling.


Total:	0
HTML:	0
PDF:	0
XML:	0