the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Tracking County-level Cooking Emissions and Their Drivers in China from 1990 to 2021 by Ensemble Machine Learning
Abstract. Cooking emissions are a significant source of PM2.5, posing considerable public health risks due to their high toxicity and proximity to densely populated areas. Despite their importance, there is currently a lack of an accurate, long-term, high-resolution national cooking emission inventory in China, primarily due to the challenges in obtaining high-quality activity level data over extended periods and at fine spatial scales. Here, we address these limitations by leveraging advanced machine learning techniques to predict activity levels and further estimate emissions.
Specifically, we develop an ensemble model of machine learning algorithms—Random Forest (RF), eXtreme Gradient Boosting (XGBoost), Multilayer Perceptron Neural Network (MLP), and Deep Neural Networks (DNN)—to accurately predict cooking activity levels across Chinese counties based on statistical indicators related to population, economy, and the catering industry. The ensemble machine learning model demonstrates exceptional generalization and transferability (R2=0.892–0.989), outperforming traditional statistical models and individual machine learning models. Unlike previous inventories that rely on simplistic proxy data such as population for calculation and downscaling, our inventory directly calculates county-level cooking emissions, providing more accurate emission estimates and spatial distributions. Furthermore, we incorporate critical but previously missing toxic pollutants, such as ultrafine particles (UFPs) and polycyclic aromatic hydrocarbons (PAHs), into the national cooking emission inventory. Therefore, we develop China's first county-level cooking emission inventory, spanning from 1990 to 2021, with high spatial resolution and wide pollutant coverage.
According to our inventory, in 2021, China’s total cooking emissions of organics in the full volatility range, PM2.5, UFPs, and PAHs are 997 kt, 408 kt, 6.50 × 1025 particles, and 15.8 kt, respectively. From 1990 to 2021, emissions of these pollutants increased by over 65 %, and their spatiotemporal trends were affected to varying degrees by external factors, such as population migration, economic development, pollution control policies, and the pandemic at different periods. Using the SHapley Additive exPlanations (SHAP) algorithm, we further analyze the contribution patterns of key driving factors, such as urbanization rate, population, and local emission factors, to emission changes. Notably, driver analysis reveals that existing control measures are insufficient to curb the rapid growth of emissions, necessitating enhanced controls. Regarding control strategies, our county-level inventory finds that 62.3 % of the China’s organic emissions are concentrated in 30 % of the counties, which are densely populated and occupy only 14.4 % of the national land area. Therefore, prioritizing control of these areas will be an efficient and targeted strategy. Our research provides crucial data and insights for understanding the impact of cooking emissions on air pollution and health, aiding in policy development. Our long-term, high-resolution emission datasets are publicly available at https://doi.org/10.6084/m9.figshare.26085487 (Li et al, 2025).
- Preprint
(6264 KB) - Metadata XML
-
Supplement
(4362 KB) - BibTeX
- EndNote
Status: open (until 01 Jun 2025)
-
RC1: 'Comment on essd-2025-104', Anonymous Referee #1, 21 Apr 2025
reply
This study presents a comprehensive county-level cooking emission inventory for China (1990-2021) using ensemble machine learning, with inclusion of UFPs and PAHs. The work is methodologically sound and provides valuable datasets for air quality research. The manuscript is well-organized, and I recommend the publication after minor revisions.
Specific comments:
Introduction: Consider briefly introducing unique characteristics of Chinese cooking and the special requirements of these characteristics for the construction of emission inventories, which will help international readers better understand the importance of the research.
Line 165-166: Clarify how "variables of lower importance" were determined (e.g., specific threshold for RF feature importance scores).
Line 215-217: "directly calculates county-level cooking emissions" is inaccurate. The emissions of this study are still estimated through machine learning predictions, not direct estimates, so the statement needs to be revised.
Figure 3: Provide sector-specific spatial distributions (commercial/residential/canteen) for a representative year in the supplement.
Lines 332-333: Provide percentage contributions of key regions (Beijing-Tianjin-Hebei, Yangtze Delta, etc.) to national total emissions.
Citation: https://doi.org/10.5194/essd-2025-104-RC1 -
RC2: 'Comment on essd-2025-104', Anonymous Referee #2, 22 May 2025
reply
This study employed an ensemble machine learning approach to develop, for the first time, a county-level cooking emission inventory for China spanning from 1990 to 2021. The inventory included key pollutants such as PM2.5, ultrafine particles (UFPs), and polycyclic aromatic hydrocarbons (PAHs), thereby filling a critical gap in high-resolution, long-term, and multi-pollutant emission datasets. By significantly improving the accuracy and interpretability of cooking emission estimates, the study also revealed the long-term influences of demographic shifts, urbanization, and catering industry development. The results provided essential scientific evidence and a robust data foundation for enhancing air pollution control and informing targeted emission reduction strategies. However, several issues need to be addressed before publication, and the reviewer recommended a major revision for it. There are also some other concerns. Please find my detailed concerns/suggestions below.
- In the Introduction, it is recommended to more systematically review existing domestic and international methodologies for constructing cooking emission inventories, along with their limitations. The study’s innovations in spatiotemporal resolution, pollutant coverage, and emission estimation accuracy should be quantitatively emphasized. In addition, the discussion of health risks associated with cooking emissions would benefit from stronger empirical evidence. The roles of ensemble machine learning and SHAP in addressing key scientific challenges should also be clarified, thereby strengthening the logical connection between methodological choices and research objectives.
- In Section 2.1, the high-resolution cooking activity data used for model training only cover the period from 2015 to 2021. It remains unclear whether these data are sufficient to support reliable backcasting of activity levels for earlier periods such as the 1990s and early 2000s. The authors are advised to explicitly discuss the potential temporal bias introduced by this limited training window and to clarify whether any sensitivity analysis or validation was conducted to assess its impact on historical emission estimates and associated uncertainties.
- In Section 2.1, between 1990 and 2020, numerous counties in China underwent administrative changes, including mergers, splits, and renaming. The authors should provide a detailed explanation of how historical data were spatially mapped to the standardized 2020 county boundary system. Furthermore, they should address whether this harmonization process might introduce artificial discontinuities or aggregation errors, and whether any validation was performed to assess the accuracy and consistency of the spatial mapping.
- In Section 2.2, while the ensemble model integrates Random Forest, XGBoost, MLP, and DNN, the final fusion is implemented via ridge regression. This linear and relatively simple approach may not fully capture the complementary strengths of the base models. The authors are encouraged to consider alternative fusion strategies, such as weighted averaging, stacking ensembles, or dynamic weight allocation based on validation performance, and to provide comparative results to justify the choice of ridge regression.
- In Section 2.3, the test set includes data from 2015-2016 and 2020-2021, which are temporally adjacent to the training period. This design may not adequately evaluate the model's extrapolation performance for earlier years, such as the 1990s. The authors are advised to consider using forward validation or rolling window techniques to assess the model’s robustness over longer temporal horizons. It is also recommended to discuss potential structural changes in the data over time and their implications for prediction reliability.
- In Section 2.4, although the SHAP method provides valuable insights into feature contributions, it is fundamentally a model interpretability tool rather than a causal inference technique. Using SHAP to identify the “driving forces” behind emission trends may risk confounding, especially in the absence of controls for latent variables. The authors should acknowledge this limitation and consider supplementing the analysis with causal inference methods to strengthen the validity of the claimed relationships.
- In Section 3.2, although figure 3 clearly illustrates the spatial expansion of high-emission areas, the underlying socioeconomic drivers such as population migration, urbanization dynamics, and regional policy changes are not explicitly addressed. The authors are encouraged to analyze these factors in more depth and clarify whether such dynamics were incorporated into the modeling framework or considered in the interpretation of the results.
- In Section 3.3, the one-factor-at-a-time method assumes that variables change independently. However, in real-world socioeconomic systems, variables are often highly correlated. It is recommended that the authors assess the potential impact of multicollinearity on the sensitivity analysis results. Additionally, it may be beneficial to explore multivariable perturbation approaches or incorporate causal structural models to validate the robustness of the findings.
- In Section 4.2, the authors highlight that the dominant drivers of cooking emission growth vary significantly across different time periods, including factors such as population growth, urbanization, industrial restructuring, and the rapid expansion of the catering industry. However, the analysis lacks a quantitative evaluation of the emission reduction effects of policy interventions implemented during these periods. The current discussion on policy impacts remains largely qualitative. It is recommended that the authors incorporate systematic quantitative approaches, such as the development of policy stringency indicators or counterfactual analyses, to comprehensively assess the actual contribution of policy implementation to emission reductions over time.
- In Section 6, the authors note that current pollution control measures are insufficient to curb the continued growth of cooking emissions, particularly in high-emission counties. However, the study does not provide specific, actionable technological pathways or policy instruments. Given that the research has already identified the spatial distribution and socioeconomic characteristics of these high-emission areas, it is recommended to explore differentiated mitigation strategies that take into account emission intensity, population exposure risk, and local resource capacity.
- It is recommended to further modify the format of the manuscript and correct some grammatical errors.
Citation: https://doi.org/10.5194/essd-2025-104-RC2
Data sets
High-resolution emission inventory of full-volatility organic from cooking souce in China during 2015-2021 Zeqi Li, Bin Zhao, Shengyue Li, Zhezhe Shi, Dejia Yin, Qingru Wu, Fenfen Zhang, Xiao Yun, Guanghan Huang, Yun Zhu, and Shuxiao Wang https://doi.org/10.6084/m9.figshare.26085487
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
289 | 24 | 10 | 323 | 18 | 10 | 11 |
- HTML: 289
- PDF: 24
- XML: 10
- Total: 323
- Supplement: 18
- BibTeX: 10
- EndNote: 11
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1