the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Global Scenario Reference Datasets for Climate Change Integrated Assessment with Machine Learning
Abstract. The deepening of global climate change research and increasingly complex integrated assessment methods generate large amounts of heterogeneous data. The rapid development of artificial intelligence (AI) models, particularly large language models (LLMs) and deep learning techniques, has enhanced the ability to handle vast data, providing new approaches and perspectives for climate analysis. To address the demand for multi-dimensional and comparable scenario design in climate change prediction and policy simulation, this study employs hybrid machine learning techniques to collect and process scenario data from existing literature, developing the Global Climate Scenario Reference datasets (GCSR). The GCSR incorporates data from approximately 90,000 articles across multiple temporal and spatial scales and extracts approximately 53,185 scenarios. With its large scale, extensive coverage, and detailed classification, the GCSR provides a robust foundation for climate change prediction, risk assessment, mitigation policy, and adaptation strategy planning, supporting scenario design in related fields.
- Preprint
(1584 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (extended)
-
RC1: 'Comment on essd-2025-299', Yang Ou, 15 Jun 2025
reply
Wei et al present a valuable effort to develop a Global Climate Scenario Reference (GCSR) dataset using hybrid machine learning and large language model techniques. The authors have done an impressive job in collecting a vast amount of literature, extracting scenario-relevant information, and building a searchable database that could support the climate modeling and policy communities. The technical description of methods—from scenario extraction to semantic cleaning, keyword recognition, and topic classification—is detailed and appears rigorous. However, several aspects, especially related to the practical value and interpretability of the dataset, would benefit from further clarification.
One major concern relates to the nature of the keywords extracted by the ML models. As shown in Figure 4, many of the top keywords, such as “carbon,” “emissions,” “future,” “scenario,” “represents,” or even “high,” appear overly generic or linguistic in nature rather than offering deep insight into the content of a scenario. While these extracted terms may reflect high-frequency usage, they do not always help differentiate scenario narratives in a policy-relevant or disciplinary sense. Compared to author-provided keywords, which may be more targeted (e.g., “carbon pricing,” “renewable deployment,” “bioenergy with CCS”), the ML-extracted terms risk being semantically shallow. It would strengthen the contribution of the paper to more clearly demonstrate how the ML-based keyword extraction adds value beyond the original metadata—perhaps through examples where the system uncovers meaningful connections that conventional indexing would miss.
The results section, while methodologically rich, could be significantly enhanced by including concrete use cases. For example, it would be useful to see how a researcher interested in "carbon tax" scenarios could use the GCSR to find the top 10 most relevant articles or narratives. At present, the paper primarily demonstrates labeling and classification capabilities, but it falls short in showing how this system operates in practice. A few scenario search walkthroughs would help illustrate how the system assists users in navigating complex and fragmented literature. These applications are important, especially if the database aims to support scenario design or policy planning, as claimed.
Additionally, the spatial scale of the collected scenarios is not explicitly discussed. Since climate policy and scenario relevance often depend on spatial context—global, national, or subnational—it is important to clarify whether the GCSR provides regional tags or metadata. Can users search for mitigation strategies in China versus Sub-Saharan Africa? Do scenarios specify context-sensitive assumptions or are they abstracted from place? These questions are critical for users who work at the interface of regional policy and global climate modeling.
The positioning of the GCSR relative to established IAM scenario databases like the IPCC AR6 Scenario Explorer could also be articulated more clearly. While the paper mentions that existing datasets emphasize quantitative results and that GCSR focuses on narratives, the potential complementarity is not fully developed. The IPCC, for example, provides structured scenario classification systems like C1–C9 (based on climate outcomes) and P1–P4 (based on mitigation strategies), and it would be valuable to discuss whether similar crosswalks could be created between GCSR classifications and those IPCC categories. Doing so would not only help validate the extracted scenario dimensions but also offer users a richer, multidimensional perspective on scenario content that bridges qualitative and quantitative insights.
Lastly, the abstract and conclusions hint at broader applications, such as supporting prediction, risk assessment, and policy development, but these remain vague. Beyond indexing literature, what can the GCSR enable in terms of scenario co-design, participatory workshops, or identifying blind spots in existing scenario narratives? There is potential for this work to support innovative scenario generation or narrative-based model input creation, but that vision should be more clearly described. A fuller articulation of future use cases would help readers grasp the transformative potential of the GCSR and distinguish it from existing bibliometric or scenario archives.
A minor point: the technical sections (especially on BM25 and BERTopic) are sound but could be lightened with intuitive explanation for general ESSD readers.
Citation: https://doi.org/10.5194/essd-2025-299-RC1 -
RC2: 'Comment on essd-2025-299', Anonymous Referee #2, 05 Aug 2025
reply
Review of “Global Scenario Reference Datasets for Climate Change Integrated Assessment with Machine Learning”
Summary and recommendation- In this paper, authors generate a dataset of climate change integrated assessment studies using a machine learning based approach. The authors approach combines usage of an LLM with clustering methods combined with quality control to generate a dataset that classifies several studies across different characteristics. While the paper is well written and I am generally supportive of using LLMs and other machine learning (ML) methods to understand datasets, I found the paper generally lacking a strong justification for publication in ESSD. I therefore recommend rejection of the paper in its current state. I have added detailed comments below that hopefully explain my decision. My major concerns are as follows-
- Novelty and utility relative to current studies- Firstly, the authors have used existing ML methods to “classify” rather than analyze existing papers on climate change integrated assessment. While this is somewhat useful, I believe this is no more than a classification exercise rather than actual data development. I think the utility of such a dataset to the community is rather overstated. There have been several papers that have used ML methods to understand drivers of climate effects (e.g. see here- https://www.nature.com/articles/s44168-025-00251-4) or papers that have used LLMs to evaluate claims related to climate change (e.g. - https://www.nature.com/articles/s44168-025-00215-8). Relative to the existing body of research, a classification exercise that the authors have performed while interesting does not justify publication of a dataset style paper.
- Treatment of the outcome variable as a discrete variable- One aspect of this paper that is especially problematic is that scenarios are classified as discrete i.e. they can belong to one group or the other. This largely ignores the high levels of multi-disciplinary efforts that go into integrated assessment modelling studies. For example, there are probably several studies on integrated assessment that address causes and impacts across several dimensions. In fact, I would say ML techniques would capture such heterogeneity inherent in the scenarios. Why would the results from the methods presented here be any different than a classification and regression tree (CART)?
- Comparison to other methods- Building on point number 2, how would this method compare to a simple classification algorithm since the end product is a classified dataset? Also, if text classification is the most important part of this analysis, then a simple tf-idf vectorizer would have provided the results the authors were looking for. In fact, a tf-idf vectorizer would provide a “score” as opposed to a simple classification. Also, features such as “duplication removal”, “text cleaning”, “high frequency word statistics” (which are mentioned in the paper) are all available within python packages for the tf-idf vectorizer. This is an important point to address since if this same dataset can be constructed using simple classification, it calls into question the need for such complexity.
- Utilization of existing LLM- A key part of this paper is the usage of an LLM (DeepSeek) to analyze the current body of scenarios. While this is not a problem by itself, since this is a large and prominent part of the paper (simply the usage of an existing tool), I do not see much value add beyond that. I acknowledge that the authors have tried describing DeepSeek’s usage in detail, but there is no way of evaluating the effect of the current weights in DeepSeek’s algorithm on the search results and would thus make the results presented here questionable or at the very least unreproducible.
- Evaluation of results (Lack of out of sample testing)- One very important part of the analysis in a paper which uses any kind of ML based methods is out of sample testing to ensure that there is no overfitting involved. I could not find any mention of out of sample testing to evaluate these methods. An example of how this could be conducted is to give the method a sample that it is not trained on to see if it can reproduce a classification.
- Lack of emphasis on the usage of Bertopic values- One part of the manuscript I did find intriguing was the Bertopic values. On examination in the final dataset, this seems to indicate some kind of continuous value. The interpretation of this variable should be explained in more detail. If the authors ever consider resubmitting this paper, they should focus on this variable rather than the simple text-based classification shown here.
Citation: https://doi.org/10.5194/essd-2025-299-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
430 | 63 | 24 | 517 | 11 | 18 |
- HTML: 430
- PDF: 63
- XML: 24
- Total: 517
- BibTeX: 11
- EndNote: 18
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1