the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A new upgraded high-precision gridded precipitation dataset considering spatiotemporal and physical correlations for mainland China
Abstract. Precipitation is a critical driver of the water cycle, profoundly influencing water resources, agricultural productivity, and natural disasters. However, existing gridded precipitation datasets exhibit markable deficiencies in capturing the spatiotemporal and physical correlations of precipitation, which limits their accuracy, particularly in regions with sparse meteorological stations. Therefore, this study proposes a completely new gridded precipitation generation scheme to address these issues. The long-term daily observation from 3,476 gauges and incorporated 11 related precipitation variables were utilized to characterize the correlations of precipitation. By employing an improved inverse distance weighting method combined with the machine learning-based light gradient boosting machine (LGBM) algorithm, a new high-precision, long-term, daily gridded precipitation dataset for mainland China (CHM_PRE V2) was developed, which aims to improve upon and surpass the CHM_PRE V1 dataset, developed in our previous work. Validation against 63,397 high-density gauges demonstrated that CHM_PRE V2 significantly outperforms existing datasets, achieving a mean absolute error of 1.48 mm/day and a Kling-Gupta efficiency of 0.88, representing improvements of 12.84 % and 12.86 %, respectively, compared to the previously optimal dataset. Regarding precipitation event detection, CHM_PRE V2 achieved a Heidke skill score of 0.68 and a false alarm ratio of 0.24, surpassing other datasets by 17.24 % and 29.17 %, respectively. Feature importance analysis revealed that spatiotemporal and physical correlations contributed 37.10 %, 34.11 %, and 28.78 % to precipitation retrieval, underscoring the necessity of incorporating temporal and physical correlations. CHM_PRE V2 markedly enhances precipitation measurement accuracy, reduces overestimation of precipitation events, and provides a reliable foundation for hydrological modelling and climate assessments. This dataset features a resolution of 0.1°, spans from 1960 to 2023, and will be updated annually. Free access to the dataset can be found at https://doi.org/10.5281/zenodo.14632157 (Hu and Miao, 2025).
- Preprint
(6210 KB) - Metadata XML
-
Supplement
(219 KB) - BibTeX
- EndNote
Status: open (until 24 Apr 2025)
-
RC1: 'Comment on essd-2025-20', Guoqiang Tang, 16 Mar 2025
reply
This study introduces a new precipitation dataset, CHM_PRE V2, for China, demonstrating notable accuracy improvements over its predecessor, CHM_PRE V1, as well as several other existing precipitation datasets. The work represents a valuable contribution to the field, particularly for researchers seeking high-quality precipitation data in China, and is well-suited for publication in ESSD. My comments are as below. Hope those can help the authors further improve the manuscript.
Comment:
My only concern is about a terminology used in this manuscript. The phrase “interpolation considering spatiotemporal and physical correlations” appears to introduce a new term for a method that has been widely used in prior research. While the authors aim to highlight the integration of spatial, temporal, and physical factors in their interpolation approach, the study does not explicitly quantify real correlation coefficients or provide a transparent framework for how these correlations are incorporated; instead, it employs a black-box approach where the spatiotemporal and physical correlations are not directly tangible.
For precipitation estimation, precipitation can be treated as a predictand, with various predictors such as static variables (latitude, longitude, elevation, slope) and dynamic variables (gridded precipitation datasets, soil moisture, precipitation climatology) as outlined in Table 1. Categorizing them strictly into spatial, temporal, and physical correlations (as done in Table 1 and elsewhere in the manuscript) may not accurately reflect their complex interdependencies. Many variables exhibit overlapping spatial, temporal, and physical correlations simultaneously. In addition, classifying GLDAS and satellite precipitation under “physical correlation” does not make sense, as it essentially implies that “precipitation correlates with precipitation”. Your approach looks more like merging multiple sources of precipitation data. Additionally, the physical correlation of NDVI on a daily scale is questionable and warrants further justification. Vegetation does not show immediate response to precipitation.
Furthermore, the importance analysis based on these correlation classifications may not be reliable. For instance, if additional features are added to a specific category, I think this could artificially inflate the perceived importance of that category (e.g., Figure 5d).
Given these concerns, I recommend that the authors avoid introducing a new term that may not accurately describe the method’s nature, especially given the extensive body of research on precipitation estimation. The authors approach of using new predictors (i.e., Table 1) can benefit accuracy improvement, while this falls within the feature engineering field which can be clarified in the manuscript.
Minor comments:
- About the title, the two words “new upgraded” seems repeated.
- There are three versions of datasets published at https://zenodo.org/records/14634575. What’s the difference among them?
- Line 23 and 25: I suppose those improvements use CHM_PRE V1 as a benchmark. Please specify this.
- Please introduce the methodological difference between V2 and V1 datasets in the abstract.
- Line 26: What do the three numbers represent?
- Line 51: What interpolation method?
- Line 60: “are” should be “is”. Besides, putting “historical precipitation data” in this sentence is weird.
- Line 95: I think the station data cannot be freely accessed from this website ...
- Line 222-223: This does not seem to be solid reason for selecting LGBM.
- Section 3.3: The description of data training is unclear to me. I recommend that the users use a few bullet points to explain what are the inputs and outputs of CHM_PRE production. For example, after reading Section 3.3 and looking at Table 1, I am still not sure what are samples you used in model training.
Citation: https://doi.org/10.5194/essd-2025-20-RC1 -
RC2: 'Comment on essd-2025-20', Anonymous Referee #2, 22 Mar 2025
reply
The study presents a novel and interesting approach to developing high-resolution gridded precipitation data (CHM_PRE v2.0) by integrating station data, multiple covariate factors, and machine learning techniques. Given the significant spatial and temporal variability of precipitation, the development of reliable and credible gridded precipitation datasets is crucial for hydroclimatological research. The authors have clearly put considerable effort into this work, particularly by incorporating covariate factors beyond traditional spatial interpolation methods. This dataset is likely to have broad applicability in the field. However, several issues need to be addressed to improve the clarity, rigor, and impact of the manuscript:
- The introduction mentions that the authors previously developed CHM_PRE v1.0. It is unclear how much innovation or improvement has been achieved in v2.0 compared to v1.0. The authors should provide a detailed explanation of the differences between the two versions and justify why a new release (v2.0) is necessary instead of simply updating v1.0. This is critical for readers to understand the added value of this new version.
- The authors fused data from 2,816 stations to produce 0.1-degree gridded precipitation data. However, the rationale for choosing 0.1-degree resolution over finer resolutions (e.g., 0.05-degree or 1 km) is not explained. Given the availability of high-resolution precipitation datasets in China, including sub-daily data, the authors should discuss why 0.1-degree resolution was selected and whether finer resolutions were considered.
- The terms "Spatiotemporal correlated data" and "Physically correlated data" are introduced but not clearly defined. A more detailed explanation of these terms is necessary to ensure readers fully understand the methodology and its theoretical basis.
- The manuscript contains numerous abbreviations, which hinder the readability and flow of the text. The authors should minimize the use of abbreviations or provide a glossary for reference.
- In CHM_PRE v1.0, the authors used ADW (Anisotropic Distance Weighting) interpolation, but in v2.0, they reverted to IDW (Inverse Distance Weighting). The rationale for this change is not explained. The authors should clarify why IDW was chosen for v2.0 and how it compares to ADW in terms of performance.
- The term "CMA-HD" is used but not defined. The authors should provide a clear explanation of what this term refers to.
- The authors evaluated the dataset using 63,397 station data points. However, instead of interpolating these station data to 0.1-degree grids using IDW or ADW, they averaged the station values within each 0.1-degree grid for accuracy assessment. The rationale for this approach should be explained, as interpolation might provide a more consistent comparison.
- The units of variables in Equations 1-8 are not provided. The authors should include the units to ensure clarity and reproducibility.
- The relative importance plot in Figure 5 is not well explained. Specifically, it is unclear how the relative importance values were calculated and what "2nd-day prior Prec." and "5th-day prior Prec." represent. Are these cumulative values? A more detailed explanation is needed.
- Figure 6 shows notably high absolute errors in the NC and SCC regions. The authors should discuss the potential reasons for these high errors and whether they are related to regional characteristics or methodological limitations.
- From the perspective of RSD (Relative Standard Deviation), it appears that GSMaP might have higher accuracy than CHM_PRE v1.0. The authors should address this observation and discuss how CHM_PRE v2.0 compares to GSMaP in terms of performance.
Citation: https://doi.org/10.5194/essd-2025-20-RC2
Data sets
CHM_PRE V2: A new upgraded high-precision gridded precipitation dataset considering spatiotemporal and physical correlations for mainland China Jinlong Hu and Chiyuan Miao https://doi.org/10.5281/zenodo.14632157
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
719 | 61 | 9 | 789 | 28 | 6 | 9 |
- HTML: 719
- PDF: 61
- XML: 9
- Total: 789
- Supplement: 28
- BibTeX: 6
- EndNote: 9
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1