the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Global Scenario Reference Datasets for Climate Change Integrated Assessment with Machine Learning
Abstract. The deepening of global climate change research and increasingly complex integrated assessment methods generate large amounts of heterogeneous data. The rapid development of artificial intelligence (AI) models, particularly large language models (LLMs) and deep learning techniques, has enhanced the ability to handle vast data, providing new approaches and perspectives for climate analysis. To address the demand for multi-dimensional and comparable scenario design in climate change prediction and policy simulation, this study employs hybrid machine learning techniques to collect and process scenario data from existing literature, developing the Global Climate Scenario Reference datasets (GCSR). The GCSR incorporates data from approximately 90,000 articles across multiple temporal and spatial scales and extracts approximately 53,185 scenarios. With its large scale, extensive coverage, and detailed classification, the GCSR provides a robust foundation for climate change prediction, risk assessment, mitigation policy, and adaptation strategy planning, supporting scenario design in related fields.
- Preprint
(1584 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 21 Jul 2025)
-
RC1: 'Comment on essd-2025-299', Yang Ou, 15 Jun 2025
reply
Wei et al present a valuable effort to develop a Global Climate Scenario Reference (GCSR) dataset using hybrid machine learning and large language model techniques. The authors have done an impressive job in collecting a vast amount of literature, extracting scenario-relevant information, and building a searchable database that could support the climate modeling and policy communities. The technical description of methods—from scenario extraction to semantic cleaning, keyword recognition, and topic classification—is detailed and appears rigorous. However, several aspects, especially related to the practical value and interpretability of the dataset, would benefit from further clarification.
One major concern relates to the nature of the keywords extracted by the ML models. As shown in Figure 4, many of the top keywords, such as “carbon,” “emissions,” “future,” “scenario,” “represents,” or even “high,” appear overly generic or linguistic in nature rather than offering deep insight into the content of a scenario. While these extracted terms may reflect high-frequency usage, they do not always help differentiate scenario narratives in a policy-relevant or disciplinary sense. Compared to author-provided keywords, which may be more targeted (e.g., “carbon pricing,” “renewable deployment,” “bioenergy with CCS”), the ML-extracted terms risk being semantically shallow. It would strengthen the contribution of the paper to more clearly demonstrate how the ML-based keyword extraction adds value beyond the original metadata—perhaps through examples where the system uncovers meaningful connections that conventional indexing would miss.
The results section, while methodologically rich, could be significantly enhanced by including concrete use cases. For example, it would be useful to see how a researcher interested in "carbon tax" scenarios could use the GCSR to find the top 10 most relevant articles or narratives. At present, the paper primarily demonstrates labeling and classification capabilities, but it falls short in showing how this system operates in practice. A few scenario search walkthroughs would help illustrate how the system assists users in navigating complex and fragmented literature. These applications are important, especially if the database aims to support scenario design or policy planning, as claimed.
Additionally, the spatial scale of the collected scenarios is not explicitly discussed. Since climate policy and scenario relevance often depend on spatial context—global, national, or subnational—it is important to clarify whether the GCSR provides regional tags or metadata. Can users search for mitigation strategies in China versus Sub-Saharan Africa? Do scenarios specify context-sensitive assumptions or are they abstracted from place? These questions are critical for users who work at the interface of regional policy and global climate modeling.
The positioning of the GCSR relative to established IAM scenario databases like the IPCC AR6 Scenario Explorer could also be articulated more clearly. While the paper mentions that existing datasets emphasize quantitative results and that GCSR focuses on narratives, the potential complementarity is not fully developed. The IPCC, for example, provides structured scenario classification systems like C1–C9 (based on climate outcomes) and P1–P4 (based on mitigation strategies), and it would be valuable to discuss whether similar crosswalks could be created between GCSR classifications and those IPCC categories. Doing so would not only help validate the extracted scenario dimensions but also offer users a richer, multidimensional perspective on scenario content that bridges qualitative and quantitative insights.
Lastly, the abstract and conclusions hint at broader applications, such as supporting prediction, risk assessment, and policy development, but these remain vague. Beyond indexing literature, what can the GCSR enable in terms of scenario co-design, participatory workshops, or identifying blind spots in existing scenario narratives? There is potential for this work to support innovative scenario generation or narrative-based model input creation, but that vision should be more clearly described. A fuller articulation of future use cases would help readers grasp the transformative potential of the GCSR and distinguish it from existing bibliometric or scenario archives.
A minor point: the technical sections (especially on BM25 and BERTopic) are sound but could be lightened with intuitive explanation for general ESSD readers.
Citation: https://doi.org/10.5194/essd-2025-299-RC1
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
236 | 29 | 8 | 273 | 5 | 7 |
- HTML: 236
- PDF: 29
- XML: 8
- Total: 273
- BibTeX: 5
- EndNote: 7
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1