the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Reconstructing long-term (1980–2022) daily ground particulate matter datasets in India (LongPMInd)
Abstract. Severe airborne particulate matter (PM, including PM2.5 and PM10) pollution in India has caused widespread concern. Accurate PM datasets are fundamental for scientific policymaking and health impact assessment, while surface observations in India are limited due to scarce sites and uneven distribution. In this work, a simple structured, efficient, and robust model based on the Light Gradient Boosting Machine (LightGBM) was developed to fuse multi-source data and estimate long-term (1980–2022) historical daily ground PM datasets in India (LongPMInd). The LightGBM model shows good accuracy with out-of-sample, out-of-site, and out-of-year cross-validation CV test R2 of 0.77, 0.70, and 0.66, respectively. Small performance gaps between PM2.5 training and testing (delta RMSE of 1.06, 3.83, and 7.74 μg m-3) indicate low overfitting risks. With great generalization ability, the open-accessible, long-term, and high-quality daily PM2.5 and PM10 products were then reconstructed (10 km, 1980–2022). It shows that India has experienced severe PM pollution in the Indo-Gangetic Plain (IGP), especially in winter. PM concentrations significantly increased (p<0.05) in most regions since 2000 (0.34 μg m-3 year-1). The turning point occurred in 2018 when the Indian government launched the National Clean Air Program, PM2.5 concentrations declined in most regions (- 0.78 μg m-3 year-1) during 2018–2022. Severe PM2.5 pollution caused continuous increased attributable premature mortalities, from 0.73 (95 % CI: 0.65–0.80) million in 2000 to 1.22 (95 % CI: 1.03–1.41) million in 2019, particularly in the IGP, where attributable mortality increased from 0.36 to 0.60 million. The LongPMInd datasets have the potential to support multi-applications of air quality management, public health, and climate change. The daily and monthly PM2.5 and PM10 datasets are publicly accessible at https://doi.org/10.5281/zenodo.10073944 (Wang et al., 2023a).
- Preprint
(2076 KB) - Metadata XML
-
Supplement
(1933 KB) - BibTeX
- EndNote
Status: open (extended)
-
RC1: 'Comment on essd-2024-34', Anonymous Referee #1, 06 Mar 2024
reply
Review of the paper entitled “Reconstructing long-term (1980-2022) daily ground particulate matter datasets in India (LongPMInd)” by Wang et al.
Based on the Light Gradient Boosting Machine (LightGBM), this paper constructs a model for fusing multi-source data and estimating the long-term (1980-2022) historical daily ground PM datasets in India (LongPMInd). This study supplements data for regions in India lacking observation sites based on available data, providing data support for future research in areas such as air quality, public health, and climate. In light of these considerations, the manuscript could be suitable for publication after addressing the following minor comments.
Comments:
- L41-51: There are various methods for estimating ground PM5. Based on the analysis of the data obtained from these methods, are there differences in the PM2.5 in India?
- References need to be carefully checked:
L56-58: References should be cited in the following format: “Wei et al. (2021)”.
L216: Subscript.
L298-299: Format of the references.
There are some missing parts in the References, such as L396 and 435.
- L62-63: What exactly is the insufficient model robustness and implementation capacity? Is it caused by a lack of training data or is it a flaw in the model itself?
- The article mentions many machine learning methods, but the choice of the LightGBM method was rather abrupt, with features such as simple structure, high efficiency and robustness not supported by data and literature.
- L79: Why data larger than 99.99% should be excluded.
- Is there any standard for the choice of meteorological factors. For example, why was evaporation considered, and why was humidity not used directly. It was mentioned in L92 that features were filtered with relative importance, so which features were used for selection before and which parameters were filtered out in this step?
- Figure 1: The panel below should be (b) PM10.
Figure 2: Missing labels in the second column.
Table S2: Column names are also capitalized to match the content of the article.
Figure 5: Numbers in (b) should be kept to two decimal places and axes are adjusted according to the range of CVD_IHD to ensure a complete presentation of the data.
- L109: The concept of GBD is appearing for the first time and should be labeled with its full name.
- L119: Does it mean that people will experience health effects related to PM2.5 when the concentration of PM2.5 is in this range?
- L140: “RMSE (35.35 and 60.65 μg m-3) and MAE(21.54 and 40.74 μg m-3)” I don't think they can be described as small.
- L241: PM“1”? Was it a mistake?
- The references in the introduction are can be reinforced. For example, the random forest and LightGBM are also used to construct PM2.5 and ozone data in China (e.g., Li et al., 2021; Ni et al., 2024).
References:
Li, H., Yang, Y, Wang, H., Li, B., Wang, P., Li, J., and Liao, H., Constructing a spatiotemporally coherent long-term PM2.5 concentration dataset over China during 1980–2019 using a machine learning approach, Sci. Total Environ., 765, 144263, https://doi.org/10.1016/j.scitotenv.2020.144263, 2021.
Ni, Y., Yang, Y., Wang, H., Li, H., Li, M., Wang, P., Li, K., and Liao, H., Contrasting changes in ozone during 2019–2021 between eastern and the other regions of China attributed to anthropogenic emissions and meteorological conditions, Sci. Total Environ., 908, 168272, https://doi.org/10.1016/j.scitotenv.2023.168272, 2024.
Citation: https://doi.org/10.5194/essd-2024-34-RC1
Data sets
LongPMInd: long-term (1980-2022) daily ground particulate matter datasets in India Shuai Wang, Sri Harsha Kota, and Hongliang Zhang https://zenodo.org/records/10073944
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
319 | 48 | 17 | 384 | 23 | 19 | 14 |
- HTML: 319
- PDF: 48
- XML: 17
- Total: 384
- Supplement: 23
- BibTeX: 19
- EndNote: 14
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1