the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Geospatial micro-estimates of slum populations in 129 Global South countries using machine learning and public data
Abstract. Slums are a visible manifestation of poverty in Global South countries. Reliable estimation of slum population is crucial for urban planning, humanitarian aid provision, and improving well-being. However, large-scale and fine-grained mapping is still lacking due to inconsistent methodologies and definitions across countries. Existing datasets often rely on government statistics, lacking spatial continuity or underestimating slum population due to factors such as city image and privacy concerns. Here, we develop a standardized bottom-up approach to estimate slum population at the neighborhood level (~6.72 km resolution at the equator) for 129 Global South countries in 2018. Leveraging the Sustainable Development Goals 11.1 framework and machine learning, our estimation integrates household-based surveys, satellite imagery, and grided population data. Our models explain 82 % to 96 % of the variation in ground-truth surveys, with a root mean squared error of 4.85 % to 10.47 %, outperforming previous benchmarks. Cross-validation with independent data confirms the reliability of our estimates. To our knowledge, this is the first comprehensive geospatial inventory of slum populations across Global South countries, offering valuable insights for advancing urban sustainability and supporting further research on vulnerable populations. (https://doi.org/10.5281/zenodo.13779003 (Li et al., 2025)).
- Preprint
(1586 KB) - Metadata XML
-
Supplement
(799 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on essd-2025-260', Anonymous Referee #1, 13 Jun 2025
This study presents a standardized, machine learning-based approach to estimate slum populations at the neighborhood level across 129 Global South countries, using satellite imagery, household surveys, and gridded population data. The methodology aligns with the UN SDG 11.1 framework and demonstrates strong performance (R² = 0.82–0.96; RMSE = 4.85–10.47%), outperforming previous benchmarks. By addressing the limitations of government-reported data, this work provides the first comprehensive geospatial inventory of slum populations, offering critical insights for urban planning and humanitarian efforts.
This is a pioneering study that leverages satellite imagery to estimate slum populations at a regional scale, with significant potential for future applications in urban policy, research, and humanitarian aid. The manuscript is well-written, and the methodology is rigorous and clearly presented. I have only a few minor suggestions for the authors to consider before publication.
#1 The study employs a fine-tuned CNN model (ResNet) and XGBoost to classify slum households, which is innovative. However, the necessity of fine-tuning the CNN and performing extensive feature extraction is not entirely clear. It appears that classification might be achievable using the original satellite image bands without expanding the RGB inputs into 500+ features. Could the authors elaborate on the specific benefits of fine-tuning and high-dimensional feature extraction in this context? Additionally, did you evaluate the performance improvement compared to models using only raw image bands? Such comparison would help clarify the added value of the proposed approach.
#2 Another concern relates to the quality of the ground-truth labels used for training and evaluating the model. In particular, for the demographic and health surveys utilized, did the authors perform any quality assessment or validation of the labeling process? Additional details on how label reliability was ensured would strengthen the credibility of the results.
Specific comments:
Figure 2. The numbers seem confusing. Shouldn't it be from f(x)1 to f(x)7 and tree1 to tree7, layer 1 to layer 7?
Citation: https://doi.org/10.5194/essd-2025-260-RC1 -
RC2: 'Comment on essd-2025-260', Anonymous Referee #2, 25 Jun 2025
The authors take advantage of machine learning, satellite imagery, local surveys, and global population data to produce a gridded estimate of population in slum-like conditions for 129 Global South countries. The authors propose a generalized, regional modeling effort as country-specific data is varied in quantity and quality. There is certainly a need for this information. Humanitarian efforts, public health, and emergency response are among the countless uses that could benefit from these data.
I believe this paper and the associated slum population maps have potential to be impactful for the community. However, I believe the paper has several major issues that should be addressed.
I have details below, but in summary, the paper could benefit from 1) Clearer and more detailed descriptions of the methods, particularly regarding combining and overlaying raster datasets with different resolutions. 2) Consistent definition and use of terms (e.g., clusters, settlement slices, neighborhoods) 3) Tables and figures along with their captions should be stand alone and require limited information from the text. Likewise, the text should not rely on content only described in a caption. Figures 1 and 2 are vital to the paper and care must be taken to make sure all details within them are consistent and clear. 4) Parts of the text should be reorganized into more appropriate sections.
I hope the authors take these notes in the constructive way they are intended. I believe the paper will be more impactful and useful with these major adjustments.
RE: Resolution and scale
The use of “cluster” and “cluster-level” throughout is confusing. Clusters are not defined in the paper, but I assume the authors are referring to DHS enumeration areas as “clusters”. I believe the authors are stating a pixel of size ~6.72 km is roughly equivalent to the ‘average’ size of a DHS enumeration unit (neighborhood or village). It should be clear when the authors are referring to irregularly shaped vector boundaries (e.g., DHS clusters, villages) or a pixel with 6.72 km dimensions. For example, on line 156, you collect Landsat imagery centered on each cluster. Are you collecting imagery for each DHS cluster or is the imagery collected for every 6.72 km grid cell?
In the abstract, the authors refer to these pixels as neighborhood level (line 27) and as cluster-level in the rest of the paper.
The authors go back and forth between cluster-level, 6.72km and 3.63 arc-minutes. A consistent approach would be clearer.
The authors also refer to “settlement slices”. Are these also 6.72km grid cells or are they something else?
The authors say on line 521 that the resolution was carefully chosen based on several factors, including the characteristics of satellite imagery, algorithm architecture, and the truth‐ground survey data. On line 673 the resolution was to protect privacy. On line 530 the resolution was set because the model required 224x224 input images and 30 m * 224 = 6.72 km. My hunch is the explanation on line 530 is the primary reason for the output resolution. It is appropriate to discuss the pros and cons of this decision but be clear and consistent about your reasoning.
There are several vector and raster datasets used throughout with different resolutions. It would be helpful to consistently and clearly describe when and how these are converted or resampled (mean, mode, natural neighbor, bicubic, etc.)
- For example, I resampled the GHS_POP Population map by converting the raster to points and then aggregating (summing) those points to a new raster that aligned with your slum population map. For the 6.2 km pixel with centroid of (77.11267606, 23.16936109) I obtained a total population of ~16,324. The slum population value for this pixel is ~38,762. Did I resample differently or is it possible that your method can produce slum population estimates > GHS_POP estimates?
Organization.
The order of 3.2 and 3.3 seems off. On line 296 you produce the ‘final model’ from the entire dataset after determining the optimal hyperparameters. In 3.3 there are 9 models with 5-fold cross validation. Is the cross-validation part of the grid search? Its not clear how you use the cross-validation results to get to the final 9 models mentioned in line 325. Do you retrain on the entire dataset after the cross-validation? Reorganizing these sections would make this clearer.
Section 3.4 This is the final part of the methods and should be expanded on. “We apply satellite imagery and nighttime light data to grid cells” is vague. Same for “we integrate the GHS-POP dataset with the map of cluster-level slum indicator.”
There is no section 3.5
The methods, results, and discussion bleed into each other at various parts. There are more than these examples. Please check throughout.
- 363-368 – Move to methods.
- 380-386 – Move to discussion.
- 401-402 – Move to methods. Refine how? Which land cover types do you mask? It is not clear what you are doing with the land cover data.
- 431-437 Move to methods. Also how do you implement the hierarchical classification from very low to very high. Thresholds?
- 457-461 Move to discussion.
- Section 5.3 is a mixture of methods, results, and discussion
Accuracy and Robustness
How did you calculate RMSE? On line 367 the slum indicator (i.e., percentage of households in slum-like conditions) is the ground truth and your method (i.e., total population living in slum-like conditions) is the predicated value. How did you get these two values on the same scale? Did you assume percentage of households was the same as the percentage of population? Did you scale population based on average household size?
The paragraph starting at 349 can use more explanation. It is confusing to introduce the new weighting scenarios and then in the next sentence say “we also compare the performance of regional models with country-specific models” I assume the country-specific models for the comparison follow the “equal weight” (0.33, 0.33, 0.33) scenario but it’s not clear with the previous sentence. This should be stated.
How did you calculate variation? Is it pixel-by-pixel and then aggregated to country, percent difference between the two models, or something else?
Section 5.3 – A table with a list of assessment cities/regions, the number of estimates for that city/region, and citations would be useful. This is in regards to my comment below for Figure 7. It would clarify how many estimates you have for Rio de Janeiro and why is it a range in 7a and a single value in 7b. Are these from different sources, etc.
Tables
Table 1 – Nice
Table 2 – Nice
Table 3 – Consider adding more information to title (e.g., to optimize an XGBoost model, etc.). What are best_num_boost_rounds and Param_grid? These descriptions should either be expanded or explained in the table caption.
Table 4 – Make each line a unique Region/Income Group (e.g., East Africa – Low Income, West Africa – Low Income). You can leave how you defined these in Supplemental, but this layout makes it appear as though you made 2 models for one combination (Africa/Low income) as opposed to 1 model for 2 different regions.
Figures
Fig 1 – It is difficult to follow the progression of satellite imagery through the ResNet Algorithm, into the XGBoost algorithm, and ultimately to the Slum population map. It is also difficult to tell if the slum indicator is the response variable for the ResNet algorithm, the XGBoost Algorithm, or both. Consider reorganizing to make this clearer. It looks like only 80% of the landsat images go to the ResNet Algorithm. Is that accurate? The “Results Analysis” panel is not very informative. It could likely be removed or consolidated to a single box to make room for clearer descriptions in the other sections.
Fig 2 – I interpret the “weight copy” arrows as using all the weights from ResNet, but you only used those for RGB. The others were averaged from the RGB (Line 256). Then, are they all scaled by 3/7 or only the non-RGB values? Since there are only 7 layers it might be best to expand the ellipses and describe each layer along with the weights used. You introduce a “Global Average Pooling Layer” in the figure that is not described or explained in the text. Where are the 512-dimensional features in this figure or figure 1? All “trees” are identical and have the same label (Tree1{..}). Is this intentional? It would be helpful to have the response variable (Slum indicator for 67,204 clusters) documented in this figure or figure 1. Clearer descriptions of your workflow and the input/output of each step would be more useful than the bottom portion of the figure illustrating XGBoost decision tree splits. This is documented elsewhere.
Fig 3 – Check title. These models are across geographical regions and income groups (line 304-5) not income only.
Fig 4a - Nice
Fig 4b – Nice. The bar charts may be more informative as a stacked bar chart like Fig 5. I concede that incorporating the different scale of Europe and Central Asia may preclude this suggestion, so I leave it the authors discretion.
Fig. 6 – It is unusual to introduce a figure like this in discussion. These belong in results. Figure title should be more objective and avoid the authors interpretation.
Fig 7 – Same as Fig 6 re: section. The caption needs more explanation and titles for each subplot.
7(a) The asterisks should be explained in the caption. What are the n for each boxplot? Are (a) and (b) related; are the “Other literature” values in (b) the averages of (a) and if so why are some cities in one but not the other? If not, its confusing to have some cities (e.g., Rio de Janeiro) as a range in (a) and single value in (b). More caption information would help clarify this.
7(c) – What are each dot? A country, select cities?
7(d) – Should the legend entry for Other literature be UN-Habitat? Should UN-Habitat be in the label for the x-axis?
Recommend hiding the top and right spines in (a) and (c) to match the other two plots.
Abstract
Line 32 - … outperforming previous benchmarks. I don’t recall mention or reference to previous benchmarks in this paper.
Supplement
Figure S1 – This is a slightly difficult map to interpret. I would suggest emphasizing the 129 countries in the Global South (or de-emphasizing the non-129), and perhaps a gradient based on the number of surveys in each country. This would highlight the countries with more, less, and no training data. However, I leave this to the authors discretion.
Minor Comments
142 – deprive? Should this be derive?
158 – 161. Seems to be missing 1 or more citations.
172 – 174. Why is it particularly suitable for long-term trends? Either provide citations of its use or explain.
249 – what are fine-grained mapping tasks?
280 - saying XGBoost is renowned needs some citations.
305 – A brief explanation of how you set/acquired the income levels for each country would be useful. A map of the 9 geographic/income regions would be helpful but is entirely optional.
328 – The 129 counties are from UN Finance Center? This should be introduced on first mention of the 129 counties. Is there a citation?
673 – 675. RMSE values ranging from … when compared to our slum indicators derived from DHS surveys? Consider expanding on the text a bit to make the conclusion more impactful.
Citation: https://doi.org/10.5194/essd-2025-260-RC2
Data sets
Geospatial micro-estimates of slum populations in 129 Global South countries using machine learning and public data Dan Li https://zenodo.org/records/13779003
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
215 | 29 | 24 | 268 | 16 | 8 | 12 |
- HTML: 215
- PDF: 29
- XML: 24
- Total: 268
- Supplement: 16
- BibTeX: 8
- EndNote: 12
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1