Sequential spatiotemporal distribution of PM<sub>2.5</sub>, SO<sub>2</sub> and Ozone in China from 2015 to 2020

Chi, Yufeng; Zhan, Yu; Wang, Kai; Ye, Hong

doi:10.5194/essd-2023-76

Preprints

https://doi.org/10.5194/essd-2023-76

Preprints

08 Mar 2023

| 08 Mar 2023

Status: this discussion paper is a preprint. It has been under review for the journal Earth System Science Data (ESSD). The manuscript was not accepted for further review after discussion.

Sequential spatiotemporal distribution of PM_2.5, SO₂ and Ozone in China from 2015 to 2020

Yufeng Chi, Yu Zhan, Kai Wang, and Hong Ye

Abstract. Currently, in the modeling of various atmospheric pollutants, the simulation of independent trace gases (SO₂ and O₃) is constrained by the insufficient resolution of key remote sensing products, resulting in insufficient simulation reliability. In this study, spatial sampling and parameter convolution are combined to optimize LightGBM by utilizing ground observations, remote sensing products, meteorological data, assistance data, and random ID. Through the above techniques and an sequentialsimulation of air pollutants, we produce seamless daily 1-km-resolution products of PM_2.5, SO₂ and O₃ for most parts of China from 2015 to 2020. Through random sampling, random site sampling, area-specific validation, comparisons of different models, and a cross-sectional comparison of different studies, we verified that our simulations of the spatial distribution of multiple atmospheric pollutants are reliable and effective. The CV of the random sample yielded an R² of 0.88 and an RMSE of 9.91 µg/m³ for PM_2.5, an R² of 0.89 and an RMSE of 4.62 µg/m³ for SO₂, and an R² of 0.91 and an RMSE of 6.88 µg/m³ for O₃. Combined with the SHapley Additive exPlanations (SHAP) approach, the roles of different parameters in the simulation process were clarified, and the positive role of parameter convolution was confirmed. Our dataset was used to assess the changes in the Air Pollution Index (API) in China before and after the outbreak of COVID-19, and the results indicate that these changes were relatively small huge, suggesting that the epidemic control measures in 2020 were effective. The study demonstrates that the multipollutant datasets produced with the proposed models are of great value for long-term, large-scale, and regional-scale air pollution monitoring and prediction, as well as population health evaluation. The datasets are available at https://doi.org/10.5281/zenodo.7533813 (Chi et al. 2023a), https://doi.org/10.5281/zenodo.7547774 (Chi et al. 2023b), https://doi.org/10.5281/zenodo.7312179 (Chi et al. 2023c), https://doi.org/10.5281/zenodo.7580714 (Chi et al. 2023d), https://doi.org/10.5281/zenodo.7580720 (Chi et al. 2023e), https://doi.org/10.5281/zenodo.7580726 (Chi et al. 2023f).

Received: 01 Mar 2023 – Discussion started: 08 Mar 2023

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 1198 KB)

Supplement (354 KB)

Download & links

Yufeng Chi, Yu Zhan, Kai Wang, and Hong Ye

Status: closed

RC1:
'Comment on essd-2023-76', Anonymous Referee #1, 22 Mar 2023
Chi et al. developed a method to estimate 1-km PM_2.5, SO₂, and O₃ across China during 2015-2020. They claimed the new-developed dataset showed satisfied performance. Overall, the topic is very interesting and high-resolution air quality dataset is very useful for health effect assessment. Unfortunately, the method suffered from serious flaws because no strong 1-km proxy (variable) was applied to train the model for SO₂ and O₃. The robustness of 1-km SO₂ and O₃ dataset might remain high uncertainty. Moreover, the novelty of the dataset in this manuscript compared with CHAP and TAP remained high uncertainty. Therefore, I did not recommend the manuscript for publication on ESSD in the current form. However, I can support the publication if the authors could make a significant revision and provide sufficient proof. The detailed comments are as follows:
Many studies have constructed the gap-free 1-km PM₅ dataset in China such as CHAP and TAP, and both of these datasets showed the better performances. The authors should compare the performances of the updated dataset with CHAP and TAP, and confirm your dataset showed the better accuracy compared with both of these datasets.

To date, nearly all of the current studies used TROPOMI to estimate 1-km SO₂ and O₃ levels across China. However, the TROPOMI dataset is available since 2018. During 2015-2018, we generally cannot find strong proxy to estimate 1-km SO₂ and O₃. Although some variables such as land use types showed the higher resolution, these variables often possessed low time-resolution and were not closely linked with air pollutants especially O₃. The developed method in this manuscript cannot ensure the robustness of 1-km SO₂ and O₃

MEIC possesses emission inventory at 1-km resolution across China. I suggest the authors could integrate the WRF-Chem output based on 1-km emission inventory with TROPOMI satellite product to simulate the 1-km SO₂ and O₃ concentrations across China.

Some figures are not clear (e.g., line 274, 291) and the authors must make them clearer.

The authors should compare the performances of different methods at regional scales because the 1-km resolution might be the major novelty of this study.

The introduction should be reorganized because many important datasets CHAP and TAP were not introduced in this part. The major novelty of this study compared with previous studies were also missing. I do not know the novelty of this study from the introduction alone.

Line 207: The authors should introduce the importance of random ID in this part.

Figure 11: Please give sufficient proof to explain the higher O₃ concentrations in South Tibet and Northwest Inner Mongolia.

Figure 12: I cannot distinguish the difference of these figures.

The English language throughout the manuscript should be significantly revised.
Citation: https://doi.org/10.5194/essd-2023-76-RC1
RC2:
'Comment on essd-2023-76', Anonymous Referee #2, 27 Mar 2023
In this study, LightGBM was optimized by using ground observation, remote sensing products, meteorological data, auxiliary data and random ID, combined with spatial sampling and parameter convolution, and sequential simulation of air pollutants was conducted to generate PM2.5, SO2 and O3 products with a daily resolution of 1km from 2015 to 2020. However, current manuscript quality is not good enough.
The authors mention that there is an interaction between PM2.5, SO2 and O3, as well as a firm synergy between spatial and temporal trends, requiring the introduction of different pollutants into the forecast model. Why the selection of the simulation prediction sequence is PM2.5-SO2-O3? Is there any basis for this? In particular, how does SO2 affect O3? Please provide details.

Please indicate in the text data introduction what variable Ps represents in the RF-Ps model.

The predicted spatial air pollutants are used as model inputs. As the number of parameters increases, the R2 of PM25, SO2 and O3 increase successively. If you change the order of predictions, will there be similar results?

As shown in Figure 4, in some areas with dense sample size of eastern stations, R2 of two adjacent stations is quite different. What causes it?

The results of NCP and YRD were lower than those of PRD and SB. The authors explain that the reason for these differences may be related to the amount of training data and validation data used. This reason does not seem to support the conclusion. The sample size of NCP and YRD is higher than that of PRD and SB, and in general, the accuracy of model training is proportional to the sample size.

What are the specific parameters of LSTM, RF-Ps and LightGBM models respectively?

As for the distribution of PM2.5, SO2 and O3 in Figure 8, can the author overlap the observation values of the sites in the figure to increase the credibility, or add a comparison with other studies or products? After all, I have noticed that R2 and RMSE of LSTM, RF-Ps and LightGBM models can also reach good levels, and it is unlikely that there is such a large distribution difference.

What causes the high ozone concentration in the southern part of the Qinghai-Tibet Plateau? What is the physical quantity in Figure S1? What are the reasons for the high values on the Tibetan Plateau?

What are the reasons for selecting these auxiliary variables for pollutant estimation?

Why is the resolution of latitude in the data processing grid 0.008° instead of 0.01°?

Why use random ID instead of other ways of representing spatial locations? What are the benefits of this?

Some pictures are fuzzy, and the color recognition is not high.

In Figure 10, the author uses SHAP method to output the feature importance of the model. I noticed that in the O3 model, the temperature feature importance score is not high. Studies have shown that temperature affects the photochemical reactions that produce ozone and has an important effect on ozone concentration. So explain why the model temperature features are less important.

It is recommended to add the advantages, disadvantages, and prospects of the article. Many people use machine learning methods to obtain pollutant data through MAIAC AOD data, and this article does not have enough innovation in such a high scoring journal.
Citation: https://doi.org/10.5194/essd-2023-76-RC2
RC3:
'Comment on essd-2023-76', Anonymous Referee #3, 17 Apr 2023
In this paper, the spatiotemporal distribution of PM2.5, SO2, and Ozone in China from 2015 to 2020 were obtained using the LightGBM model based on ground observations, remote sensing products, meteorological data, and assistance data. High-resolution mapping of air pollutants is of use for China, from this point of view, this study has some value and may attract a broad readership. However, there are numerous issues with this paper that must be resolved.

Major issues:
Many papers have used remote sensing data to estimate long-term air pollution datasets in China, while this article uses the LightGBM model to estimate PM2.5, SO2, and Ozone in China from 2015 to 2020. The authors should clearly explain the contribution of this article, and compare the advantages and disadvantages of their dataset with other published datasets. Such comparisons must be discussed in the article.

In this article, 1-km PM2.5 is estimated based on MAIAC AOD, while for ozone and SO2, the spatial resolution of remote sensing data is 0.25°. How are ozone and SO2 estimated to obtain a resolution of 1km? How reliable is this estimation?

In the Methods section, spatial sampling has been used in many papers, while Random ID and Parameter convolution are actually confusing. It is recommended to explain them clearly. These techniques are introduced into the LightGBM model, how much improvement do they bring to the accuracy of the model? This part needs to be supplemented and discussed.

From the results of the article, the contribution of DOY and Year is large. How can this phenomenon be explained? Two questions need to be considered. First, the large contribution of these two variables indicates that these two variables can already estimate PM2.5, SO2 and ozone well. Is the result reliable? Second, is this model feasible for predicting time series? Because the prediction of other time periods is independent of the training set, will the inclusion of these two variables affect the prediction results?

From the estimation results (Figure 11), the spatial distribution of O3 is significantly different from published papers and datasets, and is also inconsistent with our common sense of precursor emissions of ozone. How did the authors consider this issue?

The English writing of this article requires significant improvement.

Minor issues:
Line 22: What is “random ID”, it is confusing here. “sequentialsimulation” —>“sequential simulation”.

Lines 52-54: Why was only the full name of ozone given, while PM2.5 and SO2 were given their abbreviations directly?

Line 66: Please provide the full names of SCIAMACHY and ENVISAT.

Line 101：LightGBM is a machine learning model, why use “machine learning-based”?

Line 179: What does “RF-Ps” mean?

Lines 206: Does the center pixel not included?

Lines 525-526: Isn't it inappropriate to say 'physical variables'? These variables are not derived from a physical perspective.
Citation: https://doi.org/10.5194/essd-2023-76-RC3

Status: closed

RC1:
'Comment on essd-2023-76', Anonymous Referee #1, 22 Mar 2023
Chi et al. developed a method to estimate 1-km PM_2.5, SO₂, and O₃ across China during 2015-2020. They claimed the new-developed dataset showed satisfied performance. Overall, the topic is very interesting and high-resolution air quality dataset is very useful for health effect assessment. Unfortunately, the method suffered from serious flaws because no strong 1-km proxy (variable) was applied to train the model for SO₂ and O₃. The robustness of 1-km SO₂ and O₃ dataset might remain high uncertainty. Moreover, the novelty of the dataset in this manuscript compared with CHAP and TAP remained high uncertainty. Therefore, I did not recommend the manuscript for publication on ESSD in the current form. However, I can support the publication if the authors could make a significant revision and provide sufficient proof. The detailed comments are as follows:
Many studies have constructed the gap-free 1-km PM₅ dataset in China such as CHAP and TAP, and both of these datasets showed the better performances. The authors should compare the performances of the updated dataset with CHAP and TAP, and confirm your dataset showed the better accuracy compared with both of these datasets.

To date, nearly all of the current studies used TROPOMI to estimate 1-km SO₂ and O₃ levels across China. However, the TROPOMI dataset is available since 2018. During 2015-2018, we generally cannot find strong proxy to estimate 1-km SO₂ and O₃. Although some variables such as land use types showed the higher resolution, these variables often possessed low time-resolution and were not closely linked with air pollutants especially O₃. The developed method in this manuscript cannot ensure the robustness of 1-km SO₂ and O₃

MEIC possesses emission inventory at 1-km resolution across China. I suggest the authors could integrate the WRF-Chem output based on 1-km emission inventory with TROPOMI satellite product to simulate the 1-km SO₂ and O₃ concentrations across China.

Some figures are not clear (e.g., line 274, 291) and the authors must make them clearer.

The authors should compare the performances of different methods at regional scales because the 1-km resolution might be the major novelty of this study.

The introduction should be reorganized because many important datasets CHAP and TAP were not introduced in this part. The major novelty of this study compared with previous studies were also missing. I do not know the novelty of this study from the introduction alone.

Line 207: The authors should introduce the importance of random ID in this part.

Figure 11: Please give sufficient proof to explain the higher O₃ concentrations in South Tibet and Northwest Inner Mongolia.

Figure 12: I cannot distinguish the difference of these figures.

The English language throughout the manuscript should be significantly revised.
Citation: https://doi.org/10.5194/essd-2023-76-RC1
RC2:
'Comment on essd-2023-76', Anonymous Referee #2, 27 Mar 2023
In this study, LightGBM was optimized by using ground observation, remote sensing products, meteorological data, auxiliary data and random ID, combined with spatial sampling and parameter convolution, and sequential simulation of air pollutants was conducted to generate PM2.5, SO2 and O3 products with a daily resolution of 1km from 2015 to 2020. However, current manuscript quality is not good enough.
The authors mention that there is an interaction between PM2.5, SO2 and O3, as well as a firm synergy between spatial and temporal trends, requiring the introduction of different pollutants into the forecast model. Why the selection of the simulation prediction sequence is PM2.5-SO2-O3? Is there any basis for this? In particular, how does SO2 affect O3? Please provide details.

Please indicate in the text data introduction what variable Ps represents in the RF-Ps model.

The predicted spatial air pollutants are used as model inputs. As the number of parameters increases, the R2 of PM25, SO2 and O3 increase successively. If you change the order of predictions, will there be similar results?

As shown in Figure 4, in some areas with dense sample size of eastern stations, R2 of two adjacent stations is quite different. What causes it?

The results of NCP and YRD were lower than those of PRD and SB. The authors explain that the reason for these differences may be related to the amount of training data and validation data used. This reason does not seem to support the conclusion. The sample size of NCP and YRD is higher than that of PRD and SB, and in general, the accuracy of model training is proportional to the sample size.

What are the specific parameters of LSTM, RF-Ps and LightGBM models respectively?

As for the distribution of PM2.5, SO2 and O3 in Figure 8, can the author overlap the observation values of the sites in the figure to increase the credibility, or add a comparison with other studies or products? After all, I have noticed that R2 and RMSE of LSTM, RF-Ps and LightGBM models can also reach good levels, and it is unlikely that there is such a large distribution difference.

What causes the high ozone concentration in the southern part of the Qinghai-Tibet Plateau? What is the physical quantity in Figure S1? What are the reasons for the high values on the Tibetan Plateau?

What are the reasons for selecting these auxiliary variables for pollutant estimation?

Why is the resolution of latitude in the data processing grid 0.008° instead of 0.01°?

Why use random ID instead of other ways of representing spatial locations? What are the benefits of this?

Some pictures are fuzzy, and the color recognition is not high.

In Figure 10, the author uses SHAP method to output the feature importance of the model. I noticed that in the O3 model, the temperature feature importance score is not high. Studies have shown that temperature affects the photochemical reactions that produce ozone and has an important effect on ozone concentration. So explain why the model temperature features are less important.

It is recommended to add the advantages, disadvantages, and prospects of the article. Many people use machine learning methods to obtain pollutant data through MAIAC AOD data, and this article does not have enough innovation in such a high scoring journal.
Citation: https://doi.org/10.5194/essd-2023-76-RC2
RC3:
'Comment on essd-2023-76', Anonymous Referee #3, 17 Apr 2023
In this paper, the spatiotemporal distribution of PM2.5, SO2, and Ozone in China from 2015 to 2020 were obtained using the LightGBM model based on ground observations, remote sensing products, meteorological data, and assistance data. High-resolution mapping of air pollutants is of use for China, from this point of view, this study has some value and may attract a broad readership. However, there are numerous issues with this paper that must be resolved.

Major issues:
Many papers have used remote sensing data to estimate long-term air pollution datasets in China, while this article uses the LightGBM model to estimate PM2.5, SO2, and Ozone in China from 2015 to 2020. The authors should clearly explain the contribution of this article, and compare the advantages and disadvantages of their dataset with other published datasets. Such comparisons must be discussed in the article.

In this article, 1-km PM2.5 is estimated based on MAIAC AOD, while for ozone and SO2, the spatial resolution of remote sensing data is 0.25°. How are ozone and SO2 estimated to obtain a resolution of 1km? How reliable is this estimation?

In the Methods section, spatial sampling has been used in many papers, while Random ID and Parameter convolution are actually confusing. It is recommended to explain them clearly. These techniques are introduced into the LightGBM model, how much improvement do they bring to the accuracy of the model? This part needs to be supplemented and discussed.

From the results of the article, the contribution of DOY and Year is large. How can this phenomenon be explained? Two questions need to be considered. First, the large contribution of these two variables indicates that these two variables can already estimate PM2.5, SO2 and ozone well. Is the result reliable? Second, is this model feasible for predicting time series? Because the prediction of other time periods is independent of the training set, will the inclusion of these two variables affect the prediction results?

From the estimation results (Figure 11), the spatial distribution of O3 is significantly different from published papers and datasets, and is also inconsistent with our common sense of precursor emissions of ozone. How did the authors consider this issue?

The English writing of this article requires significant improvement.

Minor issues:
Line 22: What is “random ID”, it is confusing here. “sequentialsimulation” —>“sequential simulation”.

Lines 52-54: Why was only the full name of ozone given, while PM2.5 and SO2 were given their abbreviations directly?

Line 66: Please provide the full names of SCIAMACHY and ENVISAT.

Line 101：LightGBM is a machine learning model, why use “machine learning-based”?

Line 179: What does “RF-Ps” mean?

Lines 206: Does the center pixel not included?

Lines 525-526: Isn't it inappropriate to say 'physical variables'? These variables are not derived from a physical perspective.
Citation: https://doi.org/10.5194/essd-2023-76-RC3

Yufeng Chi, Yu Zhan, Kai Wang, and Hong Ye

Supplement

https://doi.org/10.5194/essd-2023-76-supplement

Data sets

Spatial distribution of various air pollutants in China at 1 km(SO2 2018-03-21:2020-12-31) Yufeng Chi https://doi.org/10.5281/zenodo.7580714

Yufeng Chi, Yu Zhan, Kai Wang, and Hong Ye

Viewed

Total article views: 2,142 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
1,649	418	75	2,142	175	83	145

HTML: 1,649
PDF: 418
XML: 75
Total: 2,142
Supplement: 175
BibTeX: 83
EndNote: 145

Views and downloads (calculated since 08 Mar 2023)

Month	HTML	PDF	XML	Total
Mar 2023	299	77	12	388
Apr 2023	109	27	3	139
May 2023	48	6	2	56
Jun 2023	48	9	1	58
Jul 2023	45	16	0	61
Aug 2023	36	20	0	56
Sep 2023	53	17	1	71
Oct 2023	35	12	2	49
Nov 2023	20	5	1	26
Dec 2023	24	8	1	33
Jan 2024	31	8	1	40
Feb 2024	23	8	2	33
Mar 2024	31	10	3	44
Apr 2024	24	4	5	33
May 2024	24	7	8	39
Jun 2024	19	2	1	22
Jul 2024	11	2	1	14
Aug 2024	19	3	2	24
Sep 2024	20	2	0	22
Oct 2024	11	1	0	12
Nov 2024	24	2	0	26
Dec 2024	14	2	0	16
Jan 2025	27	2	4	33
Feb 2025	10	2	1	13
Mar 2025	21	6	2	29
Apr 2025	13	10	2	25
May 2025	18	8	3	29
Jun 2025	28	13	0	41
Jul 2025	20	6	2	28
Aug 2025	61	13	1	75
Sep 2025	293	8	1	302
Oct 2025	42	19	0	61
Nov 2025	35	19	3	57
Dec 2025	29	22	1	52
Jan 2026	47	14	5	66
Feb 2026	25	22	4	51
Mar 2026	12	6	0	18

Cumulative views and downloads (calculated since 08 Mar 2023)

Month	HTML	PDF	XML	Total
Mar 2023	299	77	12	388
Apr 2023	109	27	3	139
May 2023	48	6	2	56
Jun 2023	48	9	1	58
Jul 2023	45	16	0	61
Aug 2023	36	20	0	56
Sep 2023	53	17	1	71
Oct 2023	35	12	2	49
Nov 2023	20	5	1	26
Dec 2023	24	8	1	33
Jan 2024	31	8	1	40
Feb 2024	23	8	2	33
Mar 2024	31	10	3	44
Apr 2024	24	4	5	33
May 2024	24	7	8	39
Jun 2024	19	2	1	22
Jul 2024	11	2	1	14
Aug 2024	19	3	2	24
Sep 2024	20	2	0	22
Oct 2024	11	1	0	12
Nov 2024	24	2	0	26
Dec 2024	14	2	0	16
Jan 2025	27	2	4	33
Feb 2025	10	2	1	13
Mar 2025	21	6	2	29
Apr 2025	13	10	2	25
May 2025	18	8	3	29
Jun 2025	28	13	0	41
Jul 2025	20	6	2	28
Aug 2025	61	13	1	75
Sep 2025	293	8	1	302
Oct 2025	42	19	0	61
Nov 2025	35	19	3	57
Dec 2025	29	22	1	52
Jan 2026	47	14	5	66
Feb 2026	25	22	4	51
Mar 2026	12	6	0	18

Viewed (geographical distribution)

Total article views: 2,119 (including HTML, PDF, and XML) Thereof 2,119 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 08 Mar 2026

Download

Preprint (1198 KB)
Metadata XML

Short summary

A data set of regional spatial distribution of PM_2.5, SO₂ and Ozone in China for 6 years from 2015 to 2020 is provided. The time resolution of the data is 1d, the spatial resolution is about 1 km, and the cross-validation R² is about 0.9. Data sharing is on the zenodo platform. This data can be directly used to visualize the distribution of regional air pollutants, and can also be used for data analysis, ecological applications, etc.


Total:	0
HTML:	0
PDF:	0
XML:	0

Sequential spatiotemporal distribution of PM2.5, SO2 and Ozone in China from 2015 to 2020

Supplement

Data sets

Viewed

Viewed (geographical distribution)

Sequential spatiotemporal distribution of PM_2.5, SO₂ and Ozone in China from 2015 to 2020