A high-quality gap-filled daily ETo dataset for China during 1951&ndash;2021 from synoptic stations using machine learning models

Zhou, Ning Shan; Wu, Li Feng; Yang, Qi Liang; Dong, Jianhua; Yang, Ling; Li, Yue

doi:https://doi.org/10.5194/essd-2024-229

Preprints

https://doi.org/10.5194/essd-2024-229

Preprints

22 Jul 2024

| 22 Jul 2024

Status: this discussion paper is a preprint. It has been under review for the journal Earth System Science Data (ESSD). The manuscript was not accepted for further review after discussion.

A high-quality gap-filled daily ETo dataset for China during 1951–2021 from synoptic stations using machine learning models

Ning Shan Zhou, Li Feng Wu, Qi Liang Yang, Jianhua Dong, Ling Yang, and Yue Li

Abstract. The reference evapotranspiration (ETo) is essential for water-consuming in agriculture and land-water cycle research. The synoptic data from meteorological stations can provide reliable ground data for ETo estimation with the FAO-56 Penman-Monteith equation. However, the five primary variables this equation needs, including maximum temperature (Tmax), minimum temperature (Tmin), sunshine duration (SSD), wind speed (Wind), and relative humidity (RH), often experience severe data loss due to force majeure events in synoptic data. The data loss would directly introduce severe data gaps to the complex records for ETo. Machine learning algorithms can fill various data gaps with low error rates, however, to achieve high data quality, the algorithms must be selected properly to deal with the distinct types of data loss and train independently. Here, based on the data characters, we investigated and classified data gaps from the synoptic dataset into 2 major types: the common, minor data loss gaps including Tmax loss/Tmin loss/SSD loss/Wind loss/RH loss/Wind and SSD loss/Wind and RH loss, and the other 19 types of data loss which is more severe in information loss but barely occurred. Our results show that the XGBoost model achieved the best accuracy in all 3 machine learning models with high statistic levels. For the other 19 types of data gaps, the LSTM models were trained separately for each site and achieved average R², RMSE, and nRMSE at 0.9, 0.5 mm d^-1, and 38 % for the total 2419 stations. Thus, we propose a high-quality, gap-filled daily ETo dataset during 1951–2021 for China with the proportion of large errors (the data with daily ETo errors more than 1.5 mm d^-1) below 0.2 %. Our results also reveal that the entanglement degree between synoptic variables varies a lot from region to region in China. Although most research indicates that wind speed is not very important for ETo estimation with machine learning models, our findings reveal that wind speed played a more significant role in ETo estimation in most areas of China during the years before the 21st century. Still, the impact of wind speed on ETo has also been alleviated in recent years. This ETo dataset for China is available online at https://doi.org/10.5281/zenodo.11496932 (Zhou et al., 2024).

Received: 13 Jun 2024 – Discussion started: 22 Jul 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Ning Shan Zhou, Li Feng Wu, Qi Liang Yang, Jianhua Dong, Ling Yang, and Yue Li

Status: closed

RC1:
'Comment on essd-2024-229', Anonymous Referee #1, 19 Aug 2024
Reference ET is a critical component in water cycling, impacting hydrological and biogeochemical processes. The authors reconstructed a long-term history of daily ET using machine learning approaches, which could be highly influential in both the research and management communities. However, several major issues must be addressed before its publication in the ESSD. The comments are detailed below.
Comments:
There are several reference ET datasets already developed for China (e.g., Ding et al., 2021; Tang et al., 2018; Wang et al., 2017; Yang et al., 2014). The authors should thoroughly review these datasets and identify the knowledge gaps. The innovation of this dataset needs to be clarified.

The machine learning modeling is unclear. The paragraph from line 242 and the flow chart suggest that the data with missing gaps were used for gap-filling. However, it is unclear which data were used for training and validation? The authors need to provide a more detailed explanation of their machine learning process.

The results are not sufficiently presented. As a reader, I expect to see the implications from this dataset, such as the temporal trend, spatial patterns, drivers behind these patterns. The authors should explore multiple dimensions of this dataset and add some insightful discussions.

The introduction and method sections are overly lengthy and very wordy. Many parts can be condensed for clarity and conciseness. There are also numerous grammar issues, typos, and non-academic expressions that are very distracting. More important is that the content needs to be re-organized. For example, the last paragraph of introduction includes some results, which confuses the readers.

The dataset's robustness and reliability should be validated through comparisons with existing products. To build confidence among users, the authors should perform and present comparative analyses from different perspectives, demonstrating the strengths and potential limitations of their dataset against established reference datasets.

I personally think that expanding these points to a spatial map would significantly increase the value of this dataset. This dataset would become a more powerful tool for researchers and policymakers. It would make the dataset more comprehensive, offering a multi-dimensional perspective that is crucial for in-depth analysis and application.

Ding, Y., Peng, S. Spatiotemporal change and attribution of potential evapotranspiration over China from 1901 to 2100. Theor Appl Climatol 145, 79–94 (2021). https://doi.org/10.1007/s00704-021-03625-w
Tian, Y., Zhang, K., Xu, Y. P., Gao, X., & Wang, J. (2018). Evaluation of potential evapotranspiration based on CMADS reanalysis dataset over China. Water, 10(9), 1126.
Wang, Z., Xie, P., Lai, C., Chen, X., Wu, X., Zeng, Z., & Li, J. (2017). Spatiotemporal variability of reference evapotranspiration and contributing climatic factors in China during 1961–2013. Journal of Hydrology, 544, 97-108.
Yao, Y., Zhao, S., Zhang, Y., Jia, K., & Liu, M. (2014). Spatial and decadal variations in potential evapotranspiration of China based on reanalysis datasets during 1982–2010. Atmosphere, 5(4), 737-754.
Citation: https://doi.org/10.5194/essd-2024-229-RC1
RC2:
'Comment on essd-2024-229', Anonymous Referee #2, 19 Aug 2024
The MS used four machine learning algorithms to get a high-quality gap-filled daily ETo dataset for China during 1951-2021. Generlly, the subject should be fall into the scope of journal, and the this daeset is important for ETo is highly required in many field, such as meterology and agriculture. However, the quality of figures and tables, expressions, organization, and English level should be improve a lot before possible publication. My detailed comments and suggestions are as below:
Please provide the full name when abbreviations are appeared at the first time

In the introduction, the organization should be strengthened. The connection among each paragraphs are weak. It should not be a list of references. Recently literature should be more cited.

The research objectives and significance of this study in the last paragraph is not clear. Sentences in lines 128-129 is not completed.

The detailed justification for the selection and adjustment of model parameters are unclear. Needs more detailed explanations.

The quality and style of some figures and tables should be imporved.

The results section needs to improve. Too simple at present

The discussion section is weak. It should be used more recently references, and compared with relevant studies.

The limitations of the study and future research is insufficient. The authors should provide more exploration on these aspects.

There are seven climate zones in the MS. It is not clear the number of stations corresponding to each climate zone

The units in Figure 2, 4a and Table 1, 3 should be “d”, not “d^-1”. Please check for other places and revise.

Table 2 needs to improve, too ugly.

I suggest cominbe the Fig. 3 into the Table 2.

Is Figure 7 correct? please check. I suggest modifying ture into measured

Table 4 is too ugly.
Citation: https://doi.org/10.5194/essd-2024-229-RC2
CC1: 'Comment on essd-2024-229', SHAOBO SUN, 29 Sep 2024

The authors produced a reference evapotranspiration dataset for China by using gap-filled synoptic data. However, the manuscirpt was poorly written and organized. There are too many language problems. The subheads are also incomplete or improper. I try to follw the methods used to fill data gaps. However, the methods were not clearly introduce. What's the advancements of your dataset were not discussed. How does the dataset outperform existing datasets should be discussed. Moreover, details on the machine learning algorithms are redundant.
Special comments:
1. The tittle is too long and complex;
2. "Tmax loss/Tmin loss/SSD loss/Wind loss/RH loss/Wind and SSD loss/Wind and RH loss", this is difficult for readers.
3. "our findings reveal that wind speed played a more significant role in ETo estimation in most areas of China during the years before the 21st century." If possible, explain why?
4. "The ground-information-based ETo products can offer higher spatial-temporal resolution", this is odd. Ground-base estimates are at site scale.
5. Line 55: "recording method", What do you mean?
6. under limited data -using limited data
7. "machine learning algorithms could achieve high-quality regression processes undersufficient information and simulate different types of data automatically ", exaggerate machine learning models.
8. " These methods are famousfor their robustness and convenience in computation. "， exaggerate machine learning models.
9. accuracy. . - accuracy.
10. large climate models - What does this mean?
11. creat - create?
12. complex daily ETo dataset for China - What does this mean?
13. complex meteorological dataset - What does this mean?
14. " where ∆𝑀! is Ut rutrum, sapien et vulputate molestie, augue velit consectetur lectus, bibendum porta justo odio lobortis ligula.In in urna nec arcu iaculis accumsan nec et quam. Integer ut orci mollis, varius justo vitae, pellentesque leo. Ut." - These sentences should place other section?
15. information source - Strange term
16. This dataset provide - This dataset provides
17. Subheads 2.2.1 and 2.2.2 are same.
18. Line 260-290: You do not need introduce details on the machine learning algorithms, which is well-known for readers.
19. RMSE (Root Mean Squared Error) - Root Mean Squared Error (RMSE). Same problems for definiting other abbreviations
20. 195-2021 should be 1951-2021.
21. 3.2 Simulate Results; 3.3 Large error; too simple

Citation: https://doi.org/10.5194/essd-2024-229-CC1

Status: closed

RC1:
'Comment on essd-2024-229', Anonymous Referee #1, 19 Aug 2024
Reference ET is a critical component in water cycling, impacting hydrological and biogeochemical processes. The authors reconstructed a long-term history of daily ET using machine learning approaches, which could be highly influential in both the research and management communities. However, several major issues must be addressed before its publication in the ESSD. The comments are detailed below.
Comments:
There are several reference ET datasets already developed for China (e.g., Ding et al., 2021; Tang et al., 2018; Wang et al., 2017; Yang et al., 2014). The authors should thoroughly review these datasets and identify the knowledge gaps. The innovation of this dataset needs to be clarified.

The machine learning modeling is unclear. The paragraph from line 242 and the flow chart suggest that the data with missing gaps were used for gap-filling. However, it is unclear which data were used for training and validation? The authors need to provide a more detailed explanation of their machine learning process.

The results are not sufficiently presented. As a reader, I expect to see the implications from this dataset, such as the temporal trend, spatial patterns, drivers behind these patterns. The authors should explore multiple dimensions of this dataset and add some insightful discussions.

The introduction and method sections are overly lengthy and very wordy. Many parts can be condensed for clarity and conciseness. There are also numerous grammar issues, typos, and non-academic expressions that are very distracting. More important is that the content needs to be re-organized. For example, the last paragraph of introduction includes some results, which confuses the readers.

The dataset's robustness and reliability should be validated through comparisons with existing products. To build confidence among users, the authors should perform and present comparative analyses from different perspectives, demonstrating the strengths and potential limitations of their dataset against established reference datasets.

I personally think that expanding these points to a spatial map would significantly increase the value of this dataset. This dataset would become a more powerful tool for researchers and policymakers. It would make the dataset more comprehensive, offering a multi-dimensional perspective that is crucial for in-depth analysis and application.

Ding, Y., Peng, S. Spatiotemporal change and attribution of potential evapotranspiration over China from 1901 to 2100. Theor Appl Climatol 145, 79–94 (2021). https://doi.org/10.1007/s00704-021-03625-w
Tian, Y., Zhang, K., Xu, Y. P., Gao, X., & Wang, J. (2018). Evaluation of potential evapotranspiration based on CMADS reanalysis dataset over China. Water, 10(9), 1126.
Wang, Z., Xie, P., Lai, C., Chen, X., Wu, X., Zeng, Z., & Li, J. (2017). Spatiotemporal variability of reference evapotranspiration and contributing climatic factors in China during 1961–2013. Journal of Hydrology, 544, 97-108.
Yao, Y., Zhao, S., Zhang, Y., Jia, K., & Liu, M. (2014). Spatial and decadal variations in potential evapotranspiration of China based on reanalysis datasets during 1982–2010. Atmosphere, 5(4), 737-754.
Citation: https://doi.org/10.5194/essd-2024-229-RC1
RC2:
'Comment on essd-2024-229', Anonymous Referee #2, 19 Aug 2024
The MS used four machine learning algorithms to get a high-quality gap-filled daily ETo dataset for China during 1951-2021. Generlly, the subject should be fall into the scope of journal, and the this daeset is important for ETo is highly required in many field, such as meterology and agriculture. However, the quality of figures and tables, expressions, organization, and English level should be improve a lot before possible publication. My detailed comments and suggestions are as below:
Please provide the full name when abbreviations are appeared at the first time

In the introduction, the organization should be strengthened. The connection among each paragraphs are weak. It should not be a list of references. Recently literature should be more cited.

The research objectives and significance of this study in the last paragraph is not clear. Sentences in lines 128-129 is not completed.

The detailed justification for the selection and adjustment of model parameters are unclear. Needs more detailed explanations.

The quality and style of some figures and tables should be imporved.

The results section needs to improve. Too simple at present

The discussion section is weak. It should be used more recently references, and compared with relevant studies.

The limitations of the study and future research is insufficient. The authors should provide more exploration on these aspects.

There are seven climate zones in the MS. It is not clear the number of stations corresponding to each climate zone

The units in Figure 2, 4a and Table 1, 3 should be “d”, not “d^-1”. Please check for other places and revise.

Table 2 needs to improve, too ugly.

I suggest cominbe the Fig. 3 into the Table 2.

Is Figure 7 correct? please check. I suggest modifying ture into measured

Table 4 is too ugly.
Citation: https://doi.org/10.5194/essd-2024-229-RC2
CC1: 'Comment on essd-2024-229', SHAOBO SUN, 29 Sep 2024

The authors produced a reference evapotranspiration dataset for China by using gap-filled synoptic data. However, the manuscirpt was poorly written and organized. There are too many language problems. The subheads are also incomplete or improper. I try to follw the methods used to fill data gaps. However, the methods were not clearly introduce. What's the advancements of your dataset were not discussed. How does the dataset outperform existing datasets should be discussed. Moreover, details on the machine learning algorithms are redundant.
Special comments:
1. The tittle is too long and complex;
2. "Tmax loss/Tmin loss/SSD loss/Wind loss/RH loss/Wind and SSD loss/Wind and RH loss", this is difficult for readers.
3. "our findings reveal that wind speed played a more significant role in ETo estimation in most areas of China during the years before the 21st century." If possible, explain why?
4. "The ground-information-based ETo products can offer higher spatial-temporal resolution", this is odd. Ground-base estimates are at site scale.
5. Line 55: "recording method", What do you mean?
6. under limited data -using limited data
7. "machine learning algorithms could achieve high-quality regression processes undersufficient information and simulate different types of data automatically ", exaggerate machine learning models.
8. " These methods are famousfor their robustness and convenience in computation. "， exaggerate machine learning models.
9. accuracy. . - accuracy.
10. large climate models - What does this mean?
11. creat - create?
12. complex daily ETo dataset for China - What does this mean?
13. complex meteorological dataset - What does this mean?
14. " where ∆𝑀! is Ut rutrum, sapien et vulputate molestie, augue velit consectetur lectus, bibendum porta justo odio lobortis ligula.In in urna nec arcu iaculis accumsan nec et quam. Integer ut orci mollis, varius justo vitae, pellentesque leo. Ut." - These sentences should place other section?
15. information source - Strange term
16. This dataset provide - This dataset provides
17. Subheads 2.2.1 and 2.2.2 are same.
18. Line 260-290: You do not need introduce details on the machine learning algorithms, which is well-known for readers.
19. RMSE (Root Mean Squared Error) - Root Mean Squared Error (RMSE). Same problems for definiting other abbreviations
20. 195-2021 should be 1951-2021.
21. 3.2 Simulate Results; 3.3 Large error; too simple

Citation: https://doi.org/10.5194/essd-2024-229-CC1

Ning Shan Zhou, Li Feng Wu, Qi Liang Yang, Jianhua Dong, Ling Yang, and Yue Li

Data sets

A high-quality gap-filled daily ETo dataset for China during 1951-2021 from synoptic stations Ning Shan Zhou et al. https://doi.org/10.5281/zenodo.11496932

Ning Shan Zhou, Li Feng Wu, Qi Liang Yang, Jianhua Dong, Ling Yang, and Yue Li

Viewed

Total article views: 1,278 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
973	197	108	1,278	46	53

HTML: 973
PDF: 197
XML: 108
Total: 1,278
BibTeX: 46
EndNote: 53

Views and downloads (calculated since 22 Jul 2024)

Month	HTML	PDF	XML	Total
Jul 2024	84	26	6	116
Aug 2024	133	27	8	168
Sep 2024	73	10	3	86
Oct 2024	48	2	33	83
Nov 2024	55	5	44	104
Dec 2024	51	8	5	64
Jan 2025	39	8	1	48
Feb 2025	26	4	0	30
Mar 2025	16	14	2	32
Apr 2025	14	9	0	23
May 2025	12	10	2	24
Jun 2025	25	12	0	37
Jul 2025	25	7	1	33
Aug 2025	74	16	0	90
Sep 2025	278	14	1	293
Oct 2025	20	25	2	47

Cumulative views and downloads (calculated since 22 Jul 2024)

Month	HTML	PDF	XML	Total
Jul 2024	84	26	6	116
Aug 2024	133	27	8	168
Sep 2024	73	10	3	86
Oct 2024	48	2	33	83
Nov 2024	55	5	44	104
Dec 2024	51	8	5	64
Jan 2025	39	8	1	48
Feb 2025	26	4	0	30
Mar 2025	16	14	2	32
Apr 2025	14	9	0	23
May 2025	12	10	2	24
Jun 2025	25	12	0	37
Jul 2025	25	7	1	33
Aug 2025	74	16	0	90
Sep 2025	278	14	1	293
Oct 2025	20	25	2	47

Viewed (geographical distribution)

Total article views: 1,261 (including HTML, PDF, and XML) Thereof 1,261 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 28 Oct 2025

Short summary

We created a highly precise dataset for daily water needs in China from 1951–2021, using machine learning to fill data gaps at 2419 weather stations. Independent models were trained for minor gaps, and LSTM models addressed severe gaps. Our research also examined the relationships between various weather parameters affecting water needs.


Total:	0
HTML:	0
PDF:	0
XML:	0