the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Sequential spatiotemporal distribution of PM2.5, SO2 and Ozone in China from 2015 to 2020
Abstract. Currently, in the modeling of various atmospheric pollutants, the simulation of independent trace gases (SO2 and O3) is constrained by the insufficient resolution of key remote sensing products, resulting in insufficient simulation reliability. In this study, spatial sampling and parameter convolution are combined to optimize LightGBM by utilizing ground observations, remote sensing products, meteorological data, assistance data, and random ID. Through the above techniques and an sequentialsimulation of air pollutants, we produce seamless daily 1-km-resolution products of PM2.5, SO2 and O3 for most parts of China from 2015 to 2020. Through random sampling, random site sampling, area-specific validation, comparisons of different models, and a cross-sectional comparison of different studies, we verified that our simulations of the spatial distribution of multiple atmospheric pollutants are reliable and effective. The CV of the random sample yielded an R2 of 0.88 and an RMSE of 9.91 µg/m3 for PM2.5, an R2 of 0.89 and an RMSE of 4.62 µg/m3 for SO2, and an R2 of 0.91 and an RMSE of 6.88 µg/m3 for O3. Combined with the SHapley Additive exPlanations (SHAP) approach, the roles of different parameters in the simulation process were clarified, and the positive role of parameter convolution was confirmed. Our dataset was used to assess the changes in the Air Pollution Index (API) in China before and after the outbreak of COVID-19, and the results indicate that these changes were relatively small huge, suggesting that the epidemic control measures in 2020 were effective. The study demonstrates that the multipollutant datasets produced with the proposed models are of great value for long-term, large-scale, and regional-scale air pollution monitoring and prediction, as well as population health evaluation. The datasets are available at https://doi.org/10.5281/zenodo.7533813 (Chi et al. 2023a), https://doi.org/10.5281/zenodo.7547774 (Chi et al. 2023b), https://doi.org/10.5281/zenodo.7312179 (Chi et al. 2023c), https://doi.org/10.5281/zenodo.7580714 (Chi et al. 2023d), https://doi.org/10.5281/zenodo.7580720 (Chi et al. 2023e), https://doi.org/10.5281/zenodo.7580726 (Chi et al. 2023f).
- Preprint
(1198 KB) - Metadata XML
-
Supplement
(354 KB) - BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on essd-2023-76', Anonymous Referee #1, 22 Mar 2023
Chi et al. developed a method to estimate 1-km PM2.5, SO2, and O3 across China during 2015-2020. They claimed the new-developed dataset showed satisfied performance. Overall, the topic is very interesting and high-resolution air quality dataset is very useful for health effect assessment. Unfortunately, the method suffered from serious flaws because no strong 1-km proxy (variable) was applied to train the model for SO2 and O3. The robustness of 1-km SO2 and O3 dataset might remain high uncertainty. Moreover, the novelty of the dataset in this manuscript compared with CHAP and TAP remained high uncertainty. Therefore, I did not recommend the manuscript for publication on ESSD in the current form. However, I can support the publication if the authors could make a significant revision and provide sufficient proof. The detailed comments are as follows:
- Many studies have constructed the gap-free 1-km PM5 dataset in China such as CHAP and TAP, and both of these datasets showed the better performances. The authors should compare the performances of the updated dataset with CHAP and TAP, and confirm your dataset showed the better accuracy compared with both of these datasets.
- To date, nearly all of the current studies used TROPOMI to estimate 1-km SO2 and O3 levels across China. However, the TROPOMI dataset is available since 2018. During 2015-2018, we generally cannot find strong proxy to estimate 1-km SO2 and O3. Although some variables such as land use types showed the higher resolution, these variables often possessed low time-resolution and were not closely linked with air pollutants especially O3. The developed method in this manuscript cannot ensure the robustness of 1-km SO2 and O3
- MEIC possesses emission inventory at 1-km resolution across China. I suggest the authors could integrate the WRF-Chem output based on 1-km emission inventory with TROPOMI satellite product to simulate the 1-km SO2 and O3 concentrations across China.
- Some figures are not clear (e.g., line 274, 291) and the authors must make them clearer.
- The authors should compare the performances of different methods at regional scales because the 1-km resolution might be the major novelty of this study.
- The introduction should be reorganized because many important datasets CHAP and TAP were not introduced in this part. The major novelty of this study compared with previous studies were also missing. I do not know the novelty of this study from the introduction alone.
- Line 207: The authors should introduce the importance of random ID in this part.
- Figure 11: Please give sufficient proof to explain the higher O3 concentrations in South Tibet and Northwest Inner Mongolia.
- Figure 12: I cannot distinguish the difference of these figures.
- The English language throughout the manuscript should be significantly revised.
Citation: https://doi.org/10.5194/essd-2023-76-RC1 -
RC2: 'Comment on essd-2023-76', Anonymous Referee #2, 27 Mar 2023
In this study, LightGBM was optimized by using ground observation, remote sensing products, meteorological data, auxiliary data and random ID, combined with spatial sampling and parameter convolution, and sequential simulation of air pollutants was conducted to generate PM2.5, SO2 and O3 products with a daily resolution of 1km from 2015 to 2020. However, current manuscript quality is not good enough.
- The authors mention that there is an interaction between PM2.5, SO2 and O3, as well as a firm synergy between spatial and temporal trends, requiring the introduction of different pollutants into the forecast model. Why the selection of the simulation prediction sequence is PM2.5-SO2-O3? Is there any basis for this? In particular, how does SO2 affect O3? Please provide details.
- Please indicate in the text data introduction what variable Ps represents in the RF-Ps model.
- The predicted spatial air pollutants are used as model inputs. As the number of parameters increases, the R2 of PM25, SO2 and O3 increase successively. If you change the order of predictions, will there be similar results?
- As shown in Figure 4, in some areas with dense sample size of eastern stations, R2 of two adjacent stations is quite different. What causes it?
- The results of NCP and YRD were lower than those of PRD and SB. The authors explain that the reason for these differences may be related to the amount of training data and validation data used. This reason does not seem to support the conclusion. The sample size of NCP and YRD is higher than that of PRD and SB, and in general, the accuracy of model training is proportional to the sample size.
- What are the specific parameters of LSTM, RF-Ps and LightGBM models respectively?
- As for the distribution of PM2.5, SO2 and O3 in Figure 8, can the author overlap the observation values of the sites in the figure to increase the credibility, or add a comparison with other studies or products? After all, I have noticed that R2 and RMSE of LSTM, RF-Ps and LightGBM models can also reach good levels, and it is unlikely that there is such a large distribution difference.
- What causes the high ozone concentration in the southern part of the Qinghai-Tibet Plateau? What is the physical quantity in Figure S1? What are the reasons for the high values on the Tibetan Plateau?
- What are the reasons for selecting these auxiliary variables for pollutant estimation?
- Why is the resolution of latitude in the data processing grid 0.008° instead of 0.01°?
- Why use random ID instead of other ways of representing spatial locations? What are the benefits of this?
- Some pictures are fuzzy, and the color recognition is not high.
- In Figure 10, the author uses SHAP method to output the feature importance of the model. I noticed that in the O3 model, the temperature feature importance score is not high. Studies have shown that temperature affects the photochemical reactions that produce ozone and has an important effect on ozone concentration. So explain why the model temperature features are less important.
- It is recommended to add the advantages, disadvantages, and prospects of the article. Many people use machine learning methods to obtain pollutant data through MAIAC AOD data, and this article does not have enough innovation in such a high scoring journal.
Citation: https://doi.org/10.5194/essd-2023-76-RC2 -
RC3: 'Comment on essd-2023-76', Anonymous Referee #3, 17 Apr 2023
In this paper, the spatiotemporal distribution of PM2.5, SO2, and Ozone in China from 2015 to 2020 were obtained using the LightGBM model based on ground observations, remote sensing products, meteorological data, and assistance data. High-resolution mapping of air pollutants is of use for China, from this point of view, this study has some value and may attract a broad readership. However, there are numerous issues with this paper that must be resolved.
Major issues:
- Many papers have used remote sensing data to estimate long-term air pollution datasets in China, while this article uses the LightGBM model to estimate PM2.5, SO2, and Ozone in China from 2015 to 2020. The authors should clearly explain the contribution of this article, and compare the advantages and disadvantages of their dataset with other published datasets. Such comparisons must be discussed in the article.
- In this article, 1-km PM2.5 is estimated based on MAIAC AOD, while for ozone and SO2, the spatial resolution of remote sensing data is 0.25°. How are ozone and SO2 estimated to obtain a resolution of 1km? How reliable is this estimation?
- In the Methods section, spatial sampling has been used in many papers, while Random ID and Parameter convolution are actually confusing. It is recommended to explain them clearly. These techniques are introduced into the LightGBM model, how much improvement do they bring to the accuracy of the model? This part needs to be supplemented and discussed.
- From the results of the article, the contribution of DOY and Year is large. How can this phenomenon be explained? Two questions need to be considered. First, the large contribution of these two variables indicates that these two variables can already estimate PM2.5, SO2 and ozone well. Is the result reliable? Second, is this model feasible for predicting time series? Because the prediction of other time periods is independent of the training set, will the inclusion of these two variables affect the prediction results?
- From the estimation results (Figure 11), the spatial distribution of O3 is significantly different from published papers and datasets, and is also inconsistent with our common sense of precursor emissions of ozone. How did the authors consider this issue?
- The English writing of this article requires significant improvement.
Minor issues:
- Line 22: What is “random ID”, it is confusing here. “sequentialsimulation” —>“sequential simulation”.
- Lines 52-54: Why was only the full name of ozone given, while PM2.5 and SO2 were given their abbreviations directly?
- Line 66: Please provide the full names of SCIAMACHY and ENVISAT.
- Line 101:LightGBM is a machine learning model, why use “machine learning-based”?
- Line 179: What does “RF-Ps” mean?
- Lines 206: Does the center pixel not included?
- Lines 525-526: Isn't it inappropriate to say 'physical variables'? These variables are not derived from a physical perspective.
Citation: https://doi.org/10.5194/essd-2023-76-RC3
Status: closed
-
RC1: 'Comment on essd-2023-76', Anonymous Referee #1, 22 Mar 2023
Chi et al. developed a method to estimate 1-km PM2.5, SO2, and O3 across China during 2015-2020. They claimed the new-developed dataset showed satisfied performance. Overall, the topic is very interesting and high-resolution air quality dataset is very useful for health effect assessment. Unfortunately, the method suffered from serious flaws because no strong 1-km proxy (variable) was applied to train the model for SO2 and O3. The robustness of 1-km SO2 and O3 dataset might remain high uncertainty. Moreover, the novelty of the dataset in this manuscript compared with CHAP and TAP remained high uncertainty. Therefore, I did not recommend the manuscript for publication on ESSD in the current form. However, I can support the publication if the authors could make a significant revision and provide sufficient proof. The detailed comments are as follows:
- Many studies have constructed the gap-free 1-km PM5 dataset in China such as CHAP and TAP, and both of these datasets showed the better performances. The authors should compare the performances of the updated dataset with CHAP and TAP, and confirm your dataset showed the better accuracy compared with both of these datasets.
- To date, nearly all of the current studies used TROPOMI to estimate 1-km SO2 and O3 levels across China. However, the TROPOMI dataset is available since 2018. During 2015-2018, we generally cannot find strong proxy to estimate 1-km SO2 and O3. Although some variables such as land use types showed the higher resolution, these variables often possessed low time-resolution and were not closely linked with air pollutants especially O3. The developed method in this manuscript cannot ensure the robustness of 1-km SO2 and O3
- MEIC possesses emission inventory at 1-km resolution across China. I suggest the authors could integrate the WRF-Chem output based on 1-km emission inventory with TROPOMI satellite product to simulate the 1-km SO2 and O3 concentrations across China.
- Some figures are not clear (e.g., line 274, 291) and the authors must make them clearer.
- The authors should compare the performances of different methods at regional scales because the 1-km resolution might be the major novelty of this study.
- The introduction should be reorganized because many important datasets CHAP and TAP were not introduced in this part. The major novelty of this study compared with previous studies were also missing. I do not know the novelty of this study from the introduction alone.
- Line 207: The authors should introduce the importance of random ID in this part.
- Figure 11: Please give sufficient proof to explain the higher O3 concentrations in South Tibet and Northwest Inner Mongolia.
- Figure 12: I cannot distinguish the difference of these figures.
- The English language throughout the manuscript should be significantly revised.
Citation: https://doi.org/10.5194/essd-2023-76-RC1 -
RC2: 'Comment on essd-2023-76', Anonymous Referee #2, 27 Mar 2023
In this study, LightGBM was optimized by using ground observation, remote sensing products, meteorological data, auxiliary data and random ID, combined with spatial sampling and parameter convolution, and sequential simulation of air pollutants was conducted to generate PM2.5, SO2 and O3 products with a daily resolution of 1km from 2015 to 2020. However, current manuscript quality is not good enough.
- The authors mention that there is an interaction between PM2.5, SO2 and O3, as well as a firm synergy between spatial and temporal trends, requiring the introduction of different pollutants into the forecast model. Why the selection of the simulation prediction sequence is PM2.5-SO2-O3? Is there any basis for this? In particular, how does SO2 affect O3? Please provide details.
- Please indicate in the text data introduction what variable Ps represents in the RF-Ps model.
- The predicted spatial air pollutants are used as model inputs. As the number of parameters increases, the R2 of PM25, SO2 and O3 increase successively. If you change the order of predictions, will there be similar results?
- As shown in Figure 4, in some areas with dense sample size of eastern stations, R2 of two adjacent stations is quite different. What causes it?
- The results of NCP and YRD were lower than those of PRD and SB. The authors explain that the reason for these differences may be related to the amount of training data and validation data used. This reason does not seem to support the conclusion. The sample size of NCP and YRD is higher than that of PRD and SB, and in general, the accuracy of model training is proportional to the sample size.
- What are the specific parameters of LSTM, RF-Ps and LightGBM models respectively?
- As for the distribution of PM2.5, SO2 and O3 in Figure 8, can the author overlap the observation values of the sites in the figure to increase the credibility, or add a comparison with other studies or products? After all, I have noticed that R2 and RMSE of LSTM, RF-Ps and LightGBM models can also reach good levels, and it is unlikely that there is such a large distribution difference.
- What causes the high ozone concentration in the southern part of the Qinghai-Tibet Plateau? What is the physical quantity in Figure S1? What are the reasons for the high values on the Tibetan Plateau?
- What are the reasons for selecting these auxiliary variables for pollutant estimation?
- Why is the resolution of latitude in the data processing grid 0.008° instead of 0.01°?
- Why use random ID instead of other ways of representing spatial locations? What are the benefits of this?
- Some pictures are fuzzy, and the color recognition is not high.
- In Figure 10, the author uses SHAP method to output the feature importance of the model. I noticed that in the O3 model, the temperature feature importance score is not high. Studies have shown that temperature affects the photochemical reactions that produce ozone and has an important effect on ozone concentration. So explain why the model temperature features are less important.
- It is recommended to add the advantages, disadvantages, and prospects of the article. Many people use machine learning methods to obtain pollutant data through MAIAC AOD data, and this article does not have enough innovation in such a high scoring journal.
Citation: https://doi.org/10.5194/essd-2023-76-RC2 -
RC3: 'Comment on essd-2023-76', Anonymous Referee #3, 17 Apr 2023
In this paper, the spatiotemporal distribution of PM2.5, SO2, and Ozone in China from 2015 to 2020 were obtained using the LightGBM model based on ground observations, remote sensing products, meteorological data, and assistance data. High-resolution mapping of air pollutants is of use for China, from this point of view, this study has some value and may attract a broad readership. However, there are numerous issues with this paper that must be resolved.
Major issues:
- Many papers have used remote sensing data to estimate long-term air pollution datasets in China, while this article uses the LightGBM model to estimate PM2.5, SO2, and Ozone in China from 2015 to 2020. The authors should clearly explain the contribution of this article, and compare the advantages and disadvantages of their dataset with other published datasets. Such comparisons must be discussed in the article.
- In this article, 1-km PM2.5 is estimated based on MAIAC AOD, while for ozone and SO2, the spatial resolution of remote sensing data is 0.25°. How are ozone and SO2 estimated to obtain a resolution of 1km? How reliable is this estimation?
- In the Methods section, spatial sampling has been used in many papers, while Random ID and Parameter convolution are actually confusing. It is recommended to explain them clearly. These techniques are introduced into the LightGBM model, how much improvement do they bring to the accuracy of the model? This part needs to be supplemented and discussed.
- From the results of the article, the contribution of DOY and Year is large. How can this phenomenon be explained? Two questions need to be considered. First, the large contribution of these two variables indicates that these two variables can already estimate PM2.5, SO2 and ozone well. Is the result reliable? Second, is this model feasible for predicting time series? Because the prediction of other time periods is independent of the training set, will the inclusion of these two variables affect the prediction results?
- From the estimation results (Figure 11), the spatial distribution of O3 is significantly different from published papers and datasets, and is also inconsistent with our common sense of precursor emissions of ozone. How did the authors consider this issue?
- The English writing of this article requires significant improvement.
Minor issues:
- Line 22: What is “random ID”, it is confusing here. “sequentialsimulation” —>“sequential simulation”.
- Lines 52-54: Why was only the full name of ozone given, while PM2.5 and SO2 were given their abbreviations directly?
- Line 66: Please provide the full names of SCIAMACHY and ENVISAT.
- Line 101:LightGBM is a machine learning model, why use “machine learning-based”?
- Line 179: What does “RF-Ps” mean?
- Lines 206: Does the center pixel not included?
- Lines 525-526: Isn't it inappropriate to say 'physical variables'? These variables are not derived from a physical perspective.
Citation: https://doi.org/10.5194/essd-2023-76-RC3
Data sets
Spatial distribution of various air pollutants in China at 1 km(SO2 2018-03-21:2020-12-31) Yufeng Chi https://doi.org/10.5281/zenodo.7580714
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
908 | 241 | 46 | 1,195 | 75 | 43 | 54 |
- HTML: 908
- PDF: 241
- XML: 46
- Total: 1,195
- Supplement: 75
- BibTeX: 43
- EndNote: 54
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1