the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A 100-m gridded population dataset of China’s seventh census using ensemble learning and geospatial big data
Abstract. China has undergone rapid urbanization and internal migration in past years and its up-to-date gridded population datasets are essential for diverse applications. Existing datasets for China, however, suffer from either outdatedness or failure to incorporate the latest seventh national population census data conducted in 2020. In this study, we develop a novel population downscaling approach that leverages stacking ensemble learning and geospatial big data to produce up-to-date population grids at a 100-m resolution for China from the seventh census data at both county and town levels. The proposed approach employs random forest, XGBoost, and LightGBM as base models for stacking ensemble learning and delineates the inhabited areas from geospatial big data to enhance the gridded population estimation. Experimental results demonstrate that the proposed approach exhibits the best fit performance compared to individual base models. Meanwhile, the out-of-sample town-level test set indicates that the estimated gridded population dataset (R2=0.8936) is more accurate than existing WorldPop (R2=0.7427) and LandScan (R2=0.7165) products for China in 2020. Furthermore, with the inhabited areas enhancement, the spatial distribution of population grids is more reasonable intuitively than the two existing products. Hence, the proposed population downscaling approach provides a valuable option for producing gridded population datasets. The estimated 100-m gridded population dataset of China holds great significance for future applications and it is publicly available at https://figshare.com/s/d9dd5f9bb1a7f4fd3734 (Chen et al., 2024).
- Preprint
(1641 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 05 May 2024)
-
RC1: 'Comment on essd-2023-541', Anonymous Referee #1, 06 Apr 2024
reply
Downscaling census data into population grids can address the limitations of census data with irregular units. This paper proposed a new population downscaling method using ensemble learning and geospatial big data. The method is adopted to generate a 100-m gridded population dataset of China's seventh census. The accuracy assessment on the generated population dataset shows that it has higher accuracy than the two existing datasets of WorldPop and LandScan. In general, the paper is well-written, clear, concise and complete in structure. I believe the generated dataset is important to a wide range of geoscience applications. However, the following comments need to be addressed:
1. Line 100, what resample method was used for the varibale?
2. How to get the number of POIs and the road length for each grid in Section 2.2?
3. Although the authors provided the downscaling procedure (Fig. 4), it's still unclear why do the 'population density' used instead of the direct population count.
4. In Section 5.2, how do you get the feature importance for stacking? How about the three base models?
5. Section 5.3 presents the parameter selection, what's the search interval within the search space?
6. Please add the caption of Fig. 1c.
7. Please enlarge the font size of Figs. 1-3.Citation: https://doi.org/10.5194/essd-2023-541-RC1 -
CC1: 'Comment on essd-2023-541', Lingling Li, 09 Apr 2024
reply
1. Tencent density user positioning data belongs to instantaneous data, showing significant differences during the day and night, and even between different time periods. What was the filming time for this paper? However, no matter which time period Tencent density user positioning data is available, it is not suitable for census data and is more suitable for using survey data to predict population density.
2. Similarly, please explain how to obtain the feature importance by stacking ensemble learning . Is the ranking of feature importance applicable to the population distribution of all regions in China due to its complex landforms?
3. The article claims to have used a extensive dataset of 60 million POIs. What is the specific POI category for the application? The impact of POI categories on the population varies. They should be selected and weighted.
4. Road data should also specify the category of use. For example, Expressways are not suitable for introduction, while Expressway toll stations are suitable for introduction.
Citation: https://doi.org/10.5194/essd-2023-541-CC1 -
RC2: 'Comment on essd-2023-541', Anonymous Referee #2, 28 Apr 2024
reply
The authors develop a novel population downscaling approach that apply stacking ensemble learning and geospatial big data to produce 100-m population grids data. The RF, XGBoost, and LightGBM are combined to stack ensemble learning. Results show high accuracy and the proposed downscaling approach indeed provide a valuable option for producing gridded population datasets. However, there are still some problems to solve.
1. The Introduction is well written. However, the author still need to add some research progress about the population estimation. The author should also add more about the machine learning method and ensemble learning method, and their comparison.
2. In line 110, the authors used the NTL data, and resample it into 100m. The authors should clarify the method for resampling.
3. There are some expressions that should be more rigorous, such as “100 grid” in line 105 should be “100m grid”
4. The authors used the DEM and Slope features from ALOS data for the population downscaling. However, did DEM and Slope features have strong correlation with population density. In west of China, such as Chongqing, population also distributed in areas with large elevation and slope.
5. The authors applied the inhabited area data to enhance the accuracy of gridded population estimates. The authors should clarify the data and method to extract the inhabited area. There are also some errors for this step, and how to minimize this problem.
6. The training data and test data are essential for the population estimation. The authors should introduce more about this.
7. The stacking ensemble learning is the important part of this work. What is the technical innovation of proposed method when compared with other method. What is the technical reference for similar studies.
8. The discussion part should be deeper about the methods, result comparison, advantages and disadvantages, model transferability, validation, future research prospect and etc.
9. The figure should be improved. For figure4, the text is too large. Figure 5 should clarify more about the ensemble learning. Figure 7 should be more standard and beautiful. Figure 8 should show the comparison between existing products and estimated results, not only the existing products. The unit of y-axis in Figure 10 should be clarified.
10. There are lots of good information and findings to highlight for the paper. What is the main contribution of this work from the perspective of technical problem. Please add accordingly in the abstract and conclusion parts.
11. The manuscript still exhibits minor issues in language expression, including spelling and grammatical errors. For example, some sentences in the introduction are overly complex, and the transitions between different viewpoints could be made smoother.Citation: https://doi.org/10.5194/essd-2023-541-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
888 | 231 | 44 | 1,163 | 29 | 26 |
- HTML: 888
- PDF: 231
- XML: 44
- Total: 1,163
- BibTeX: 29
- EndNote: 26
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1