A Global Drought Dataset from Clustering-Based Event Identification with Integrated Population, and GDP Exposure and Socioeconomic Impacts

Samantaray, Alok Kumar; Messori, Gabriele

doi:10.5194/essd-2025-646

Preprints

https://doi.org/10.5194/essd-2025-646

Preprints

03 Feb 2026

| 03 Feb 2026

Status: this preprint is currently under review for the journal ESSD.

A Global Drought Dataset from Clustering-Based Event Identification with Integrated Population, and GDP Exposure and Socioeconomic Impacts

Alok Kumar Samantaray and Gabriele Messori

Abstract. Drought events pose significant challenges to both ecosystems and human societies, requiring precise methodologies for their detection and impact assessment. A key challenge is linking physical drought indicators to socioeconomic consequences, such as the number of people affected or economic losses. This study introduces a robust two-step framework that integrates drought detection with impact analysis. In the first step, a clustering algorithm is used to identify coherent drought events and extract key characteristics such as severity and spatial extent. These events are tracked as spatially and temporally evolving objects. In the second step, the drought events are linked to population and GDP exposure, as well as to impact data from global disaster databases.

To characterize droughts, the study employs two widely used drought indices: the Standardized Precipitation Index (SPI) and the Standardized Precipitation Evapotranspiration Index (SPEI). Precipitation and temperature data from the ERA5 reanalysis are used to compute these indices at four different timescales (1, 3, 6, and 12 months). Drought events are identified for different severity levels (-1, -1.5, and -2). The study also incorporates high resolution gridded datasets of global population and economic activity, alongside disaster impact data on affected populations and economic losses. The resulting drought dataset provides valuable information on the association between drought characteristics, exposure, and recorded impacts.

The analysis shows that a relatively large buffer distance is needed to match the identified drought events to impacts from disaster databases, and that more severe drought thresholds isolate fewer but higher-impact events. Population exposure is found to be highest in Asia, while GDP exposure is largest in North America. This integrated framework (https://doi.org/10.5281/zenodo.17251815; Samantaray & Messori, 2025) bridges the gap between physical drought characteristics, exposure, and documented impacts, supporting vulnerability analyses, improved climate adaptation planning and disaster risk management.

Received: 27 Oct 2025 – Discussion started: 03 Feb 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Alok Kumar Samantaray and Gabriele Messori

Status: final response (author comments only)

RC1:
'Comment on essd-2025-646', Nir Krakauer, 06 Feb 2026

The idea behind this work, of organizing a gridded meteorological drought index into clusters describing spatially widespread dry periods and correlating these with reports of drought impacts, is meritorious, but details of the execution need more justification and improvement:
How PET is calculated for determining SPEI doesn't seem to be explained.
It appears that only point data from GDIS are used to localize drought reports. However, according to its documentation, GDIS provides polygons of the affected provinces in the "geometry" field. Comparing this areal information to the ERA5 meteorological drought extent should show much clearer correspondences compared to only centroids or other individual points from GDIS.
At l. 267, "Frequency" seems like the wrong word for the number of months of drought; "Duration" would be more appropriate. Also, the definition of "Severity" is not clear.
Figure 3: The caption fails to state what the blue areas in the maps are.
Since "detection percentages are consistently higher for SPEI than for SPI", I recommend for the SPEI based drought definition to be used as the primary one for reporting the results, and SPI-based results to be given as secondary, whereas now it's mostly the opposite.
There are no clear conclusions drawn as to what SP(E)I timescale is considered to define drought. Most of the figures arbitrarily only show the 1 month timescale, which admittedly can be a "flash drought" but seems too short to correspond to impactful drought in most cases. I suggest to first analyze which SP(E)I timescale matches the drought disaster dataset the best, and then report findings primarily for that timescale.
As well as population and GDP, considering measures of agriculture intensity may be helpful in predicting the impacts of drought, since agriculture is by far the most water intensive major economic sector. Oddly, agriculture is not mentioned at all except in the literature review.
Seasonality is also never mentioned. It might be hypothesized that droughts occurring during the growing season have much bigger impacts than those at other times of year.
The mostly 3-D figures (e.g. 5-12) are not interpretable. I strongly recommend to find a different way to show results, and adjust the discussion in the text accordingly.
$2 × 10⁷ USD seems like a small amount of average damage for a large-scale drought in North America, since the USA has experienced quite a number of droughts that caused multiple billions in damage. It would be helpful to show more statistical information about the set of EM-DAT droughts included in the analysis, including their minimum, maximum, mean, and median damages, and to compare this to information from other databases.
Additionally, since this is for publication in a data description journal, the paper should say more about the format of the generated dataset, including what fields it contains and what are some anticipated use cases.

Citation: https://doi.org/10.5194/essd-2025-646-RC1
- AC1: 'Reply on RC1', Alok Samantaray, 06 Apr 2026
  
  We have provided a detailed, point-by-point response to the reviewer's feedback in the attached document.
  
  Citation: https://doi.org/10.5194/essd-2025-646-AC1
RC2:
'Comment on essd-2025-646', Anonymous Referee #2, 10 May 2026
This manuscript presents a global drought dataset generated through clustering-based event identification and links the resulting drought clusters with population exposure, GDP exposure, and socioeconomic impact information from EM-DAT and GDIS. The topic is important, and the attempt to connect physical drought characteristics with socioeconomic consequences is relevant for drought risk assessment. The manuscript also provides a useful workflow combining SPI/SPEI, ERA5, GDIS, EM-DAT, gridded population, and gridded GDP data. However, in my view, the manuscript in its current form does not yet provide sufficient methodological innovation or analytical rigor to justify publication as a robust global drought-impact dataset. The study largely combines existing drought indices, existing reanalysis data, existing disaster databases, and existing exposure datasets. Although the authors claim to develop a “robust clustering algorithm” and a globally consistent framework, the novelty relative to previous clustering, Lagrangian drought tracking, GDIS-based drought impact studies, and exposure-based risk analyses is not clearly demonstrated. The manuscript itself acknowledges that previous studies have already used GDIS, ERA5/ERA5-Land, drought indices, and spatiotemporal drought-cluster tracking to examine drought impacts and propagation.

The manuscript states that the study addresses the lack of a globally consistent framework combining spatiotemporal drought clustering with high-resolution socioeconomic exposure and impact data. The stated objectives are to identify drought “objects,” assign population and GDP exposure, and relate drought characteristics to EM-DAT impacts. However, these objectives are mostly achieved by assembling already available datasets and applying relatively standard drought-index thresholding and connected-component clustering. The authors should make much clearer what is genuinely new. Is the novelty the clustering algorithm? The event-matching strategy? The derived dataset? The weighted exposure metric? Or the joint linkage between physical drought and disaster impacts? At present, the manuscript reads more like an integration exercise than a methodological or data-product advance.

The drought event identification method depends on several user-defined parameters, including target-point grouping, maximum distance threshold, buffer distance, bounding-box construction, drought-index threshold, eight-neighbor connectivity, and cluster-merging distance. The authors describe these steps, but the physical and statistical justification for these parameter choices remains weak. For example, using a rectangular bounding box around GDIS target points may artificially include regions unrelated to the reported drought event, particularly for large countries or multi-location disasters. Similarly, the use of eight-neighbor connected components is computationally convenient but not necessarily sufficient to represent real drought propagation, hydroclimatic coherence, or regional drought dynamics.

The manuscript reports that larger buffer distances and less severe thresholds lead to higher matching percentages, with match rates reaching around 90% for a 500 km buffer and SPI threshold of -1, but falling below 60% under stricter parameter settings. This is an important result, but it also raises a concern: high match rates may be achieved simply by expanding the search area rather than by improving physical event identification. The authors should not interpret a higher match percentage as evidence of a better drought-event detection method without independent validation. A 500 km buffer is very large and may capture many drought-affected pixels that are only loosely related to the recorded disaster location. The manuscript should include additional validation using well-documented historical drought events, independent drought-monitoring products, national drought reports, or regional case studies.

The manuscript correctly notes that EM-DAT reports disasters mainly at the country level and that GDIS provides georeferenced information only for a subset of EM-DAT events. It also identifies inconsistencies between EM-DAT and GDIS records, including cases where the same disaster number is assigned to different countries, and removes seven discrepant events using a 250 km distance threshold. However, these inconsistencies suggest a deeper uncertainty problem that is not fully addressed. The manuscript should quantify how many drought events are retained, excluded, unmatched, or only partially matched at each processing step. The authors should provide a clear event-accounting table showing the number of original EM-DAT drought events, GDIS drought entries, harmonized events, excluded mismatches, successfully clustered events, and events with valid population/GDP/impact data.

The manuscript calculates total population and GDP exposure by summing values over pixels that fall within drought clusters for at least one month, and then introduces a weighted exposure metric based on frequency and severity. While this is a useful first step, the current formulation is too simple to support strong conclusions about drought impacts. Exposure is not equivalent to vulnerability or realized impact. Population and GDP located within a drought cluster may not necessarily experience damage, while areas outside the cluster may still suffer indirect impacts through food systems, water infrastructure, migration, or economic networks. The manuscript should avoid implying that exposure estimates directly explain EM-DAT impacts unless vulnerability, adaptive capacity, reporting practices, and sectoral sensitivity are considered.
Citation: https://doi.org/10.5194/essd-2025-646-RC2
- AC2: 'Reply on RC2', Alok Samantaray, 15 Jun 2026
  
  Please find attached the response to Reviewer 2.
  
  Citation: https://doi.org/10.5194/essd-2025-646-AC2
RC3:
'Comment on essd-2025-646', Anonymous Referee #3, 10 May 2026
I think this is a promising and timely work. The idea of linking SPI/SPEI-based drought information with population and GDP exposure, and reported impacts from EM-DAT/GDIS is valuable. I also like the aim of moving beyond meteorological drought indicators and towards a dataset that is more relevant for exposure and impact studies. However, I do not believe the current version is ready for publication. The main issue is not about the idea, but about the way the dataset is presented and documented. For an ESSD paper, the data product per se needs to be front and centre. Users should be able to understand what the dataset contains, how it was produced, what its limitations are, and how they can reproduce or reuse it. The manuscript now still reads partly like a research paper, while the actual dataset description, uncertainty discussion, validation, and user guidance are not yet strong enough. I would therefore recommend a major revision. My detailed comments are as follows:
My first major concern is that the current title and abstract give the impression of a general global drought dataset. However, after reviewing the manuscript, IMHO, the workflow is anchored to drought disaster records in EM-DAT/GDIS and searches for drought clusters around those events. This means that the product is not a global catalogue of meteorological drought events. It is more like a dataset of drought clusters associated with reported drought disasters. This should be made very clear in the title, abstract, introduction, and conclusion.

Abstract: You state that events are tracked as spatially and temporally evolving objects. From the current manuscript, if I understand correctly, it looks more like monthly clusters are identified within a target year, but it is not clear whether there is a formal object-tracking algorithm across months. If real temporal tracking is done, please describe it clearly. If not, please rephrase this expression.

Section 2: ERA5 is at 0.25°, the population data are at 1 km, and the GDP data are at 5 arcmin. Please clearly elaborate on the spatiotemporal integration of these datasets.

Similarly, please clarify the actual study period and address the temporal mismatch between datasets. While the meteorological data and GDIS events cover 1960–2018, the population and GDP datasets only begin in 1975 and 1990, respectively.

Section 3.2: You state that ERA5 precipitation and temperature were used to calculate SPI and SPEI at diverse timescales, but several key details in determining the indices are still missing.

Section 3.3: The description of maximum distance threshold, spatial buffer distance, merging distance, bounding boxes, target-point grouping, and cluster merging is hard to follow. Please define each parameter clearly and explain how they were chosen. In addition, you treat the GDIS locations as target points, but it is not clear whether these are centroids of affected areas, administrative locations, reported place names, or something else? Please describe the nature/limitations of the event coordinates recorded in GDIS more explicitly.

Section 3.4: You count all population or GDP pixels that fall in a drought cluster for at least one month during the target year. This is understandable, but it can include short-lived like flash droughts or relatively weak drought conditions in the total exposure. The weighted exposure partly addresses this, but the weighting scheme is still a methodological choice rather than a standard measure of impact. Please make the distinction between exposure, weighted exposure, and actual impact clearer, and avoid implying that exposure is the same as impact or risk.

Section 3: as another major issue, EM-DAT reports impacts mostly at national scale, while GDIS provides geocoded locations for reported events. Linking national-level impacts to subnational drought clusters can introduce uncertainty. Please clearly explain how duplicate allocation is avoided and how users should interpret the matched impact values.

Sections 4: you spend a lot of space interpreting continental exposure and impact patterns. I would suggest condensing some of the interpretation and using that space to explain the data.

Section 4: various terms have been used throughout the results, but their exact definitions are not always clear. For a multi-month event, for example, is area calculated as a monthly mean, a maximum, or a union footprint over the year? Is severity averaged over drought pixels only, over months, or over the full bounding box? Please further elaborate on this.

Section 4: The reported economic damages appear quite low for major drought disasters, for example values around 2.5×10⁷ USD, as the highest total damages, at a 500 km buffer distance. Please double-check this. Also, GDP exposure is reported in constant 2015 USD, so it is not directly comparable with unadjusted historical damage values unless the damage data are also harmonised. This should also be clarified carefully.

Section 4: The detection percentage across thresholds, timescales, and buffer distances is a good start. However, it mainly shows parameter sensitivity, not whether the drought clusters are reliable. Please add a more explicit quality assessment, for example, comparison with well-known drought events, independent products, or regional case studies. At least, you should provide clearer guidance on which settings are recommended for which use cases.

You refer to many supplementary figures and tables. Without these materials, it is not possible to assess several key claims in this study. To make this manuscript self-contained, key statistics should be moved into the main text, rather than leaving the main conclusions dependent on supplementary material.

Fig.3: Please add longitude and latitude labels, and other necessary elements.

Figs.4-12: These figures are difficult to interpret in the current format. The 3D surface and scatter plots do not work well, and it is hard to read actual values or see the main relationships. I do not think these 3D figures are indeed necessary, and I would suggest replacing most of them with clearer 2D figures.

Please standardise the spelling/abbreviation of the Emergency Events database throughout the manuscript. Currently, it appears inconsistently as EMDAT, EM-DAT, and EM DAT.

Lines 102: Please double check the reference (Kageyama and Sawada 2024). You have got it down as 2024 here, but the reference list gives it is 2022 (Line 682).

Lines 146 and 568: A similar issue occurs with (Kummu et al. 2023) The main text says 2023, but the reference list has it pegged at 2025 (Line 682).

Line 147: Please pick either USD or US dollars and stick with it throughout the manuscript.

Line 484-486: I am a bit lost on how you pulled that conclusion directly from Figure 1.

Line 570-571: Please double-check this link here.

There are a few spelling errors remain, see examples in Lines 386, 430, 469, 515.
Citation: https://doi.org/10.5194/essd-2025-646-RC3
- AC3: 'Reply on RC3', Alok Samantaray, 15 Jun 2026
  
  Please find attached the response to Reviewer 3.
  
  Citation: https://doi.org/10.5194/essd-2025-646-AC3
RC4:
'Comment on essd-2025-646', Anonymous Referee #4, 11 May 2026

This is an interesting work where the authors have explored drought exposure and impacts data from two datasets, viz, EM-DAT and DISM and have linked these to drought characteristics. They also access how well do drought impacts link to the choice of drought indicator, threshold and timescale. The finding that SPEI relates to reported drought impacts more than SPI is particularly interesting. However, the paper in its current form needs improvement on several fronts to be considered for publication in this journal to match the journal standards. The findings have a potential to be published in this journal but only after major revisions. The authors could have given more thought on how they could have done a better job with their figures to demonstrate their findings. The current 3-D plots make it difficult to interpret those figures. The writing is a bit vague at many places. The introduction section, particularly, is not appropriately utilised to set the scene for the present study and is filled with basic information on droughts which is already well known. The introduction section needs to be thoroughly revised to tailor it to the purpose of the work carried out and to match the standards of the journal. The authors also do not highlight clearly what the new dataset around which they have centred this work. My detailed comments are as follows:
Section 1: A large part of the Introduction section is basic background to droughts which is already well-known knowledge. repeating all those is a redundant use of the introduction. the authors should revise the intro to streamline it to what their overarching aim is, what gaps motivated it and the advantage of doing so. the current introduction is poorly written, lacking structure and central messages of the paragraphs. the authors fail to articulate what are the gaps in these existing studies that motivate the current work.
L 68-70: India is a big region. please specify the regions India that were affected by the drought.
L119: the words ‘resulting dataset’ is a bit vague. Here and throughout the paper, I found it difficult to understand what exactly is the dataset that the authors have come up with. Please dedicate a few sentences wherever appropriate to state this clearly.
L 210: but SPI is known to not do a good job for some climate types, such as arid regions.
Section 3.3: please cite literatures that have used the methods described in this section.
L 267-268: but the drought may extend beyond a year. this is also the reason why I am not sure if it makes sense to have a 'target year' as you describe in line 265
L 272-273: what is the reference for this? The widely accepted formula for risk is risk=hazard x exposure x vulnerability.
Figure 3 caption: what do the ‘different clusters’ mean, please specify
L 292: you mentioned 'connnecting drought events to impacts data' in your previous heading too. Would recommend better organsiation of the subsections.
L 323-303: reason is not discussed
L 310-311: but did you not in Figure 1 show Australia as reported to have higher number of drought events? Doesn’t this mean you have information on events for Australia?
L 311-312: vague. meaning is not clear.
Figure 5-12: i think it might have helped to have the same range of the axes in all the figures so that they could be better visually compared
L 375-378: difficult to understand what the authors are trying to say.
L 406-408: it might be worthwhile adding a sentence here on concluding what advantage the weighing of exposure is offering in giving a new/different/similar perspective on understanding the links between exposure and drought characteristics. this would give a good closure to this section.
L 469: s not S in shorter timescale
L 481: ‘small sample size’ : vague. please elaborate on what you mean by 'sample'
L 538: add a sentence on what this means
L 539-540: here and everywhere else where you mention correlation, I think it might be good to quote the average value of the correlation
L 545-546: here and even in the previous section, when you conclude that SPEI corresponds to impacts more than SPI, I think it will benefit the paper a lot if you include a figure that clearly shows this, as this does seem like an important conclusion stemming from this work.
Section 6: since this is a data paper, it would be useful to include a sentence here on what new dataset you have produced in this study. I felt that clarity on this aspect was a bit lacking in the paper.
L 589: when you say ‘scale’, please be specific. I think you mean timescale here.

L 593-594: if this is a key conclusion, I would urge you to put a result figure corresponding to this in the main manuscript rather than in the SI. You don't need to put all the figures, but maybe one that clearly shows this point.

Citation: https://doi.org/10.5194/essd-2025-646-RC4
- AC4: 'Reply on RC4', Alok Samantaray, 15 Jun 2026
  
  Please find attached the response to Reviewer 4.
  
  Citation: https://doi.org/10.5194/essd-2025-646-AC4

Alok Kumar Samantaray and Gabriele Messori

Data sets

Reproducible Workflows for Global Drought Clustering with Socioeconomic Exposure Alok Kumar Samantaray and Gabriele Messori https://zenodo.org/records/17251815?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImMxZjRkNTk5LWRjZTQtNDc0MS1iOTUxLTRjNTc3NGEwNmUxYiIsImRhdGEiOnt9LCJyYW5kb20iOiJlMTUxNzQ5OGFhOGU2OTQ4MzE3ZDViM2ViMDM3MTQwZCJ9.fhEgvQggJwkVyDVUKEFrQcC5aspzBert5potIVNbLG9SO0FtzuH09SBN9ba3e9DCCGpjwlrRdRgXylKVYJeTIw

Model code and software

Alok Kumar Samantaray and Gabriele Messori

Viewed

Total article views: 652 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
399	222	31	652	37	48

HTML: 399
PDF: 222
XML: 31
Total: 652
BibTeX: 37
EndNote: 48

Views and downloads (calculated since 03 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	185	90	10	285
Mar 2026	76	47	7	130
Apr 2026	69	42	8	119
May 2026	44	21	3	68
Jun 2026	7	9	1	17
Jul 2026	18	13	2	33

Cumulative views and downloads (calculated since 03 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	185	90	10	285
Mar 2026	76	47	7	130
Apr 2026	69	42	8	119
May 2026	44	21	3	68
Jun 2026	7	9	1	17
Jul 2026	18	13	2	33

Viewed (geographical distribution)

Total article views: 623 (including HTML, PDF, and XML) Thereof 623 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 25 Jul 2026

Short summary

We present a global dataset that links drought events to human and economic exposure and impacts. Using a two-step approach, we first cluster drought events from precipitation and temperature records to track their severity and extent, and then connect them to population and Gross Domestic Product exposure and to impacts reported in disaster databases. The dataset supports risk planning and mitigation efforts.


Total:	0
HTML:	0
PDF:	0
XML:	0