the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A high-quality gap-filled daily ETo dataset for China during 1951–2021 from synoptic stations using machine learning models
Abstract. The reference evapotranspiration (ETo) is essential for water-consuming in agriculture and land-water cycle research. The synoptic data from meteorological stations can provide reliable ground data for ETo estimation with the FAO-56 Penman-Monteith equation. However, the five primary variables this equation needs, including maximum temperature (Tmax), minimum temperature (Tmin), sunshine duration (SSD), wind speed (Wind), and relative humidity (RH), often experience severe data loss due to force majeure events in synoptic data. The data loss would directly introduce severe data gaps to the complex records for ETo. Machine learning algorithms can fill various data gaps with low error rates, however, to achieve high data quality, the algorithms must be selected properly to deal with the distinct types of data loss and train independently. Here, based on the data characters, we investigated and classified data gaps from the synoptic dataset into 2 major types: the common, minor data loss gaps including Tmax loss/Tmin loss/SSD loss/Wind loss/RH loss/Wind and SSD loss/Wind and RH loss, and the other 19 types of data loss which is more severe in information loss but barely occurred. Our results show that the XGBoost model achieved the best accuracy in all 3 machine learning models with high statistic levels. For the other 19 types of data gaps, the LSTM models were trained separately for each site and achieved average R², RMSE, and nRMSE at 0.9, 0.5 mm d-1, and 38 % for the total 2419 stations. Thus, we propose a high-quality, gap-filled daily ETo dataset during 1951–2021 for China with the proportion of large errors (the data with daily ETo errors more than 1.5 mm d-1) below 0.2 %. Our results also reveal that the entanglement degree between synoptic variables varies a lot from region to region in China. Although most research indicates that wind speed is not very important for ETo estimation with machine learning models, our findings reveal that wind speed played a more significant role in ETo estimation in most areas of China during the years before the 21st century. Still, the impact of wind speed on ETo has also been alleviated in recent years. This ETo dataset for China is available online at https://doi.org/10.5281/zenodo.11496932 (Zhou et al., 2024).
- Preprint
(9728 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on essd-2024-229', Anonymous Referee #1, 19 Aug 2024
Reference ET is a critical component in water cycling, impacting hydrological and biogeochemical processes. The authors reconstructed a long-term history of daily ET using machine learning approaches, which could be highly influential in both the research and management communities. However, several major issues must be addressed before its publication in the ESSD. The comments are detailed below.
Comments:
- There are several reference ET datasets already developed for China (e.g., Ding et al., 2021; Tang et al., 2018; Wang et al., 2017; Yang et al., 2014). The authors should thoroughly review these datasets and identify the knowledge gaps. The innovation of this dataset needs to be clarified.
- The machine learning modeling is unclear. The paragraph from line 242 and the flow chart suggest that the data with missing gaps were used for gap-filling. However, it is unclear which data were used for training and validation? The authors need to provide a more detailed explanation of their machine learning process.
- The results are not sufficiently presented. As a reader, I expect to see the implications from this dataset, such as the temporal trend, spatial patterns, drivers behind these patterns. The authors should explore multiple dimensions of this dataset and add some insightful discussions.
- The introduction and method sections are overly lengthy and very wordy. Many parts can be condensed for clarity and conciseness. There are also numerous grammar issues, typos, and non-academic expressions that are very distracting. More important is that the content needs to be re-organized. For example, the last paragraph of introduction includes some results, which confuses the readers.
- The dataset's robustness and reliability should be validated through comparisons with existing products. To build confidence among users, the authors should perform and present comparative analyses from different perspectives, demonstrating the strengths and potential limitations of their dataset against established reference datasets.
- I personally think that expanding these points to a spatial map would significantly increase the value of this dataset. This dataset would become a more powerful tool for researchers and policymakers. It would make the dataset more comprehensive, offering a multi-dimensional perspective that is crucial for in-depth analysis and application.
Ding, Y., Peng, S. Spatiotemporal change and attribution of potential evapotranspiration over China from 1901 to 2100. Theor Appl Climatol 145, 79–94 (2021). https://doi.org/10.1007/s00704-021-03625-w
Tian, Y., Zhang, K., Xu, Y. P., Gao, X., & Wang, J. (2018). Evaluation of potential evapotranspiration based on CMADS reanalysis dataset over China. Water, 10(9), 1126.
Wang, Z., Xie, P., Lai, C., Chen, X., Wu, X., Zeng, Z., & Li, J. (2017). Spatiotemporal variability of reference evapotranspiration and contributing climatic factors in China during 1961–2013. Journal of Hydrology, 544, 97-108.
Yao, Y., Zhao, S., Zhang, Y., Jia, K., & Liu, M. (2014). Spatial and decadal variations in potential evapotranspiration of China based on reanalysis datasets during 1982–2010. Atmosphere, 5(4), 737-754.
Citation: https://doi.org/10.5194/essd-2024-229-RC1 -
RC2: 'Comment on essd-2024-229', Anonymous Referee #2, 19 Aug 2024
The MS used four machine learning algorithms to get a high-quality gap-filled daily ETo dataset for China during 1951-2021. Generlly, the subject should be fall into the scope of journal, and the this daeset is important for ETo is highly required in many field, such as meterology and agriculture. However, the quality of figures and tables, expressions, organization, and English level should be improve a lot before possible publication. My detailed comments and suggestions are as below:
- Please provide the full name when abbreviations are appeared at the first time
- In the introduction, the organization should be strengthened. The connection among each paragraphs are weak. It should not be a list of references. Recently literature should be more cited.
- The research objectives and significance of this study in the last paragraph is not clear. Sentences in lines 128-129 is not completed.
- The detailed justification for the selection and adjustment of model parameters are unclear. Needs more detailed explanations.
- The quality and style of some figures and tables should be imporved.
- The results section needs to improve. Too simple at present
- The discussion section is weak. It should be used more recently references, and compared with relevant studies.
- The limitations of the study and future research is insufficient. The authors should provide more exploration on these aspects.
- There are seven climate zones in the MS. It is not clear the number of stations corresponding to each climate zone
- The units in Figure 2, 4a and Table 1, 3 should be “d”, not “d-1”. Please check for other places and revise.
- Table 2 needs to improve, too ugly.
- I suggest cominbe the Fig. 3 into the Table 2.
- Is Figure 7 correct? please check. I suggest modifying ture into measured
- Table 4 is too ugly.
Citation: https://doi.org/10.5194/essd-2024-229-RC2 -
CC1: 'Comment on essd-2024-229', SHAOBO SUN, 29 Sep 2024
The authors produced a reference evapotranspiration dataset for China by using gap-filled synoptic data. However, the manuscirpt was poorly written and organized. There are too many language problems. The subheads are also incomplete or improper. I try to follw the methods used to fill data gaps. However, the methods were not clearly introduce. What's the advancements of your dataset were not discussed. How does the dataset outperform existing datasets should be discussed. Moreover, details on the machine learning algorithms are redundant.Special comments:1. The tittle is too long and complex;2. "Tmax loss/Tmin loss/SSD loss/Wind loss/RH loss/Wind and SSD loss/Wind and RH loss", this is difficult for readers.3. "our findings reveal that wind speed played a more significant role in ETo estimation in most areas of China during the years before the 21st century." If possible, explain why?4. "The ground-information-based ETo products can offer higher spatial-temporal resolution", this is odd. Ground-base estimates are at site scale.5. Line 55: "recording method", What do you mean?6. under limited data -using limited data7. "machine learning algorithms could achieve high-quality regression processes undersufficient information and simulate different types of data automatically ", exaggerate machine learning models.8. " These methods are famousfor their robustness and convenience in computation. ", exaggerate machine learning models.9. accuracy. . - accuracy.10. large climate models - What does this mean?11. creat - create?12. complex daily ETo dataset for China - What does this mean?13. complex meteorological dataset - What does this mean?14. " where ∆𝑀! is Ut rutrum, sapien et vulputate molestie, augue velit consectetur lectus, bibendum porta justo odio lobortis ligula.In in urna nec arcu iaculis accumsan nec et quam. Integer ut orci mollis, varius justo vitae, pellentesque leo. Ut." - These sentences should place other section?15. information source - Strange term16. This dataset provide - This dataset provides17. Subheads 2.2.1 and 2.2.2 are same.18. Line 260-290: You do not need introduce details on the machine learning algorithms, which is well-known for readers.19. RMSE (Root Mean Squared Error) - Root Mean Squared Error (RMSE). Same problems for definiting other abbreviations20. 195-2021 should be 1951-2021.21. 3.2 Simulate Results; 3.3 Large error; too simpleCitation: https://doi.org/
10.5194/essd-2024-229-CC1
Status: closed
-
RC1: 'Comment on essd-2024-229', Anonymous Referee #1, 19 Aug 2024
Reference ET is a critical component in water cycling, impacting hydrological and biogeochemical processes. The authors reconstructed a long-term history of daily ET using machine learning approaches, which could be highly influential in both the research and management communities. However, several major issues must be addressed before its publication in the ESSD. The comments are detailed below.
Comments:
- There are several reference ET datasets already developed for China (e.g., Ding et al., 2021; Tang et al., 2018; Wang et al., 2017; Yang et al., 2014). The authors should thoroughly review these datasets and identify the knowledge gaps. The innovation of this dataset needs to be clarified.
- The machine learning modeling is unclear. The paragraph from line 242 and the flow chart suggest that the data with missing gaps were used for gap-filling. However, it is unclear which data were used for training and validation? The authors need to provide a more detailed explanation of their machine learning process.
- The results are not sufficiently presented. As a reader, I expect to see the implications from this dataset, such as the temporal trend, spatial patterns, drivers behind these patterns. The authors should explore multiple dimensions of this dataset and add some insightful discussions.
- The introduction and method sections are overly lengthy and very wordy. Many parts can be condensed for clarity and conciseness. There are also numerous grammar issues, typos, and non-academic expressions that are very distracting. More important is that the content needs to be re-organized. For example, the last paragraph of introduction includes some results, which confuses the readers.
- The dataset's robustness and reliability should be validated through comparisons with existing products. To build confidence among users, the authors should perform and present comparative analyses from different perspectives, demonstrating the strengths and potential limitations of their dataset against established reference datasets.
- I personally think that expanding these points to a spatial map would significantly increase the value of this dataset. This dataset would become a more powerful tool for researchers and policymakers. It would make the dataset more comprehensive, offering a multi-dimensional perspective that is crucial for in-depth analysis and application.
Ding, Y., Peng, S. Spatiotemporal change and attribution of potential evapotranspiration over China from 1901 to 2100. Theor Appl Climatol 145, 79–94 (2021). https://doi.org/10.1007/s00704-021-03625-w
Tian, Y., Zhang, K., Xu, Y. P., Gao, X., & Wang, J. (2018). Evaluation of potential evapotranspiration based on CMADS reanalysis dataset over China. Water, 10(9), 1126.
Wang, Z., Xie, P., Lai, C., Chen, X., Wu, X., Zeng, Z., & Li, J. (2017). Spatiotemporal variability of reference evapotranspiration and contributing climatic factors in China during 1961–2013. Journal of Hydrology, 544, 97-108.
Yao, Y., Zhao, S., Zhang, Y., Jia, K., & Liu, M. (2014). Spatial and decadal variations in potential evapotranspiration of China based on reanalysis datasets during 1982–2010. Atmosphere, 5(4), 737-754.
Citation: https://doi.org/10.5194/essd-2024-229-RC1 -
RC2: 'Comment on essd-2024-229', Anonymous Referee #2, 19 Aug 2024
The MS used four machine learning algorithms to get a high-quality gap-filled daily ETo dataset for China during 1951-2021. Generlly, the subject should be fall into the scope of journal, and the this daeset is important for ETo is highly required in many field, such as meterology and agriculture. However, the quality of figures and tables, expressions, organization, and English level should be improve a lot before possible publication. My detailed comments and suggestions are as below:
- Please provide the full name when abbreviations are appeared at the first time
- In the introduction, the organization should be strengthened. The connection among each paragraphs are weak. It should not be a list of references. Recently literature should be more cited.
- The research objectives and significance of this study in the last paragraph is not clear. Sentences in lines 128-129 is not completed.
- The detailed justification for the selection and adjustment of model parameters are unclear. Needs more detailed explanations.
- The quality and style of some figures and tables should be imporved.
- The results section needs to improve. Too simple at present
- The discussion section is weak. It should be used more recently references, and compared with relevant studies.
- The limitations of the study and future research is insufficient. The authors should provide more exploration on these aspects.
- There are seven climate zones in the MS. It is not clear the number of stations corresponding to each climate zone
- The units in Figure 2, 4a and Table 1, 3 should be “d”, not “d-1”. Please check for other places and revise.
- Table 2 needs to improve, too ugly.
- I suggest cominbe the Fig. 3 into the Table 2.
- Is Figure 7 correct? please check. I suggest modifying ture into measured
- Table 4 is too ugly.
Citation: https://doi.org/10.5194/essd-2024-229-RC2 -
CC1: 'Comment on essd-2024-229', SHAOBO SUN, 29 Sep 2024
The authors produced a reference evapotranspiration dataset for China by using gap-filled synoptic data. However, the manuscirpt was poorly written and organized. There are too many language problems. The subheads are also incomplete or improper. I try to follw the methods used to fill data gaps. However, the methods were not clearly introduce. What's the advancements of your dataset were not discussed. How does the dataset outperform existing datasets should be discussed. Moreover, details on the machine learning algorithms are redundant.Special comments:1. The tittle is too long and complex;2. "Tmax loss/Tmin loss/SSD loss/Wind loss/RH loss/Wind and SSD loss/Wind and RH loss", this is difficult for readers.3. "our findings reveal that wind speed played a more significant role in ETo estimation in most areas of China during the years before the 21st century." If possible, explain why?4. "The ground-information-based ETo products can offer higher spatial-temporal resolution", this is odd. Ground-base estimates are at site scale.5. Line 55: "recording method", What do you mean?6. under limited data -using limited data7. "machine learning algorithms could achieve high-quality regression processes undersufficient information and simulate different types of data automatically ", exaggerate machine learning models.8. " These methods are famousfor their robustness and convenience in computation. ", exaggerate machine learning models.9. accuracy. . - accuracy.10. large climate models - What does this mean?11. creat - create?12. complex daily ETo dataset for China - What does this mean?13. complex meteorological dataset - What does this mean?14. " where ∆𝑀! is Ut rutrum, sapien et vulputate molestie, augue velit consectetur lectus, bibendum porta justo odio lobortis ligula.In in urna nec arcu iaculis accumsan nec et quam. Integer ut orci mollis, varius justo vitae, pellentesque leo. Ut." - These sentences should place other section?15. information source - Strange term16. This dataset provide - This dataset provides17. Subheads 2.2.1 and 2.2.2 are same.18. Line 260-290: You do not need introduce details on the machine learning algorithms, which is well-known for readers.19. RMSE (Root Mean Squared Error) - Root Mean Squared Error (RMSE). Same problems for definiting other abbreviations20. 195-2021 should be 1951-2021.21. 3.2 Simulate Results; 3.3 Large error; too simpleCitation: https://doi.org/
10.5194/essd-2024-229-CC1
Data sets
A high-quality gap-filled daily ETo dataset for China during 1951-2021 from synoptic stations Ning Shan Zhou et al. https://doi.org/10.5281/zenodo.11496932
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
417 | 73 | 99 | 589 | 15 | 16 |
- HTML: 417
- PDF: 73
- XML: 99
- Total: 589
- BibTeX: 15
- EndNote: 16
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1