A benchmark dataset for half-hourly evapotranspiration estimation in China from 2000 to 2024
Abstract. Latent heat flux (LE) provides a direct representation of terrestrial evapotranspiration (ET) and plays a critical role in hydrological cycle studies, land surface model development, and the evaluation of remotely sensed evapotranspiration products. Although flux observations based on the eddy covariance technique are widely regarded as essential benchmark data for evapotranspiration estimation, existing ChinaFlux observations are generally limited by short observation periods and extensive data gaps, which substantially constrain their applicability in long-term change analyses and multi-scale studies. To address these limitations, we developed a gap-filling and temporal prolongation framework specifically designed for half-hourly LE and established a continuous ground-based benchmark dataset covering China for the period 2000–2024 based on observations from 50 ChinaFlux sites. The framework is built upon an automated machine learning approach (AutoML-H2O) and integrates ERA5-Land reanalysis data with MODIS vegetation indices, enabling accurate gap-filling within observation periods and reliable prolongation beyond observation intervals. Comprehensive evaluations demonstrate that the AutoML framework achieves high accuracy at the half-hourly scale across different gap-length scenarios, with an overall correlation coefficient (CC) of 0.862 and a root mean square error (RMSE) of 33.75 W m-2, and it substantially outperforms conventional methods under long-gap conditions of 7 d and 30 d. The forward and backward prolongation results show high consistency (CC values of 0.902 and 0.896, respectively) and exhibit robust temporal stability under varying training data lengths. Multi-timescale validations further indicate that the prolonged LE data reasonably reproduce diurnal variations, seasonal cycles, and interannual variability from half-hourly to daily and monthly scales. Comparisons with ChinaFlux observations under strict quality control reveal good consistency across different temporal scales, underlying surface types, and climate zones. SHAP-based interpretability analysis indicates that energy supply consistently dominates LE variability, while vegetation state and water availability modulate their relative importance under different environmental conditions. Overall, we present the first continuous half-hourly ground-based LE benchmark dataset covering China for the period 2000–2024. This dataset provides essential data support for the evaluation of remotely sensed ET products, land surface model validation, and studies of regional water–energy cycles and climate change, and it is freely available via the following repository: https://doi.org/10.5281/zenodo.18194590 (Qian et al., 2026).
General comments:
Thanks to the authors for their valuable contribution to the in-situ flux data collection for Latent heat flux data from various Chinese ecosystems for major climate zones.
The manuscript describes a continuous dataset based on half-hourly ET data from eddy covariance measurements complemented by site-specific ML approaches to fill data gaps and extend the time series beyond the observational periods.
The article is appropriate to support the publication of the data set, but could be shortened in some parts as exemplary indicated below.
The dataset is unique and useful in terms of duration and treatment for Chinese ET flux data. The public availability in open access repositories of such a dataset is relevant even if the creation of the gap-filled and prolonged dataset might be repeated in case observed data together with the software codes were available.
The aggregates for daily, monthly and yearly products could be omitted, as those time series could easily be derived by the users. In case those files are published, I’d recommend to change the naming of the single files within each archive: each file name should contain the time resolution as is done for the compressed archives.
Within a framework as established for this manuscript, I would expect some more information related to uncertainty of the final flux products that could e.g. be derived from random repetitions of the procedures.
data quality:
The 25 year long ET time series is of good quality, also input data for variables that drive ET from MODIS and ERA5 seem reasonable.
Specific comments:
What about uncertainty estimates of the gap-filled ET fluxes e.g. based on random repetitions of the procedures?
Several references in the text are missing, even though listed in the ‘References’ section. Please check.
Naming of the model: AutoML-H2O or H2O AutoML or only AutoML? Please use the abbreviation consistently.
Specific remarks related to the text:
Line 26: ‘…conventional methods for long-gap conditions of 7 or 30 days, respectively.’? You might reformulate the sentence as those gap conditions are artificially introduced. Under normal conditions, data gaps vary in duration.
Line 30: does the ‘strict quality control’ relate to the EC data from measurements or to the modelled ET data?
Line 65: ‘EC observations provide half-hourly measurements of latent heat flux (LE) in combination with its driving variables,…’. (EC itself delivers only fluxes)
Line 69: ‘Taking the FLUXNET2015 dataset as an example,…’
Lines 85ff: You claim that Chinese flux sites are underrepresented in integrative flux analysis. Might this also be due to the fact that data from Chinaflux are usually not accessible for non-Chinese researchers? Data sent to the FLUXNET portal should be accessible via the FLUXNET Shuttle (https://data.fluxnet.org/). Also see Papale, D.: Ideas and perspectives: enhancing the impact of the FLUXNET network of eddy covariance sites, Biogeosciences, 17, 5587–5598, https://doi.org/10.5194/bg-17-5587-2020, 2020.
Line 131 and later line 160: is the quality control related to EC data? What are the site selection criteria?
Lines 158ff: references for the processing and quality assurance steps should be added.
Lines 160 and 161ff: references and details for ChinaFlux and FLUXNET procedures as well as previous studies should be added.
Line 178: did you use the data from the 10 sites with pre-gap-filled time series in the same way as the data from the other 40 sites? So only artificial gaps introduced? Are the gap-filled data of those sites marked as such? If these gap-filled data are used for training, ‘no new information is generated’
Lines 182-183: ‘The half-hourly LE data form the foundation for subsequent gap-filling,…’ is that what you want to say? If yes, please re-formulate accordingly
Line 193: citation for ERA5-data missing in the text (e.g.: Copernicus Climate Change Service (2022): ERA5-Land hourly data from 1950 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). DOI: 10.24381/cds.e2161bac (Accessed on 11‑11‑2025)
Line 203: I had no glue on the ERA5-Land data, but is Rn really net solar radiation, which would be Albedo? Instead, Rn is commonly used for net radiation including longwave components. Later in the text (line 571, line 577) Rn is used for net radiation.
Line 213: reference for ‘official documentation’?
Lines 210ff: citation for MODIS products
Line 216: what is the temporal resolution of MODIS products (‘relatively low’)? And is the assumption of constant NDVI and LAI ‘within each compositing period’ sufficient for fast growing plants, disturbances etc.?
Line 228: re-formulate: ‘Each flux tower site was treated…’ – it is not the tower that is of relevance.
Line 230: How was quality screening performed? According to standard FLUXNET and ChinaFlux procedures? See above
Lines 240ff: was the modelling tool adopted from somewhere else (reference!) and what is the contribution of the authors?
Line 249: Which models were selected by AutoML? Very different models? How many different models?
Lines 280ff: ERA5 data are used instead of onsite measured meteorological variables. How do those compare with locally measured data? What about the additional uncertainty?
Line 316: the repetitive sentence can be removed here (These prolonged datasets provide the basis for subsequent construction and analysis of multi-temporal-scale LE products.)
Lines 332-333: does that mean, in case there would be only one half-hour value missing within a 1 d aggregate, the 1 d value would get the flag F? Hardly any ET time series from EC measurements is complete due to unfavourable atmospheric conditions. As a result, any daily (7 day, monthly, respectively) aggregate is based on a mixture of measured and gap-filled data. How are the aggregates flagged then?
Line 432: suggestion: ‘…the 6-year training scenario represents more typical conditions and weather regimes at most sites.’
Chapters 3.3.2 and 3.3.3: this analysis and the accompanying figures seem to be redundant with very little additional information for a data paper, as the monthly values are just aggregated from the daily values which contain gap-filled data. In addition, figure 11 shows again some typical sites but with very different time periods, varying from only 2 years to 10 years. Why are these examples chosen so different if they are compared to check for seasonal variation?
Line 515: what exactly is ‘official ChinaFlux observations’? Please add reference!
Line 547: what is meant with ‘stratification’ in the context?
Lines 560ff: Especially for desert and shrubland, evaporation becomes more dominant compared to transpiration. So it is clear that vegetation-related variables explain less variability. This point might be considered here as well.
Lines 577ff: same as above, more soil is exposed, so more evaporation compared to dense canopy covers.
Figures: Readers should be able to interpret figures with the figure description. Most figures need more descriptive text.
Fig. 1b) more description needed,
Fig. 1d) length of observation periods (in years) for all sites
Figure 2: a bit overwhelming, but a good overview still. I don’t see QA/QC for EC-data (which contributes to gaps). The figures for comparison with MDS and ML methods and also the ones for performance do not add any value due to their small size. The text is not readable and legends are missing. Even if the content becomes clear for the results in the lower right corner, instead of T, F, and P for true, filled and prolonged you might use additional colour indication as T, F and P are not in the figures anyway.
Fig. 5 and 6: what is the measure for the significant bias marked by the blue boxes?
Fig. 7, a) and b): y-axis needs legend. More explanatory test in the figure description is needed, e.g. for ‘Relative density’
Fig. 8a): from the figure description it is not clear whether the bars or the lines relate to the left or right y-axis. Please provide more information in the figure description.
Fig. 9: might be removed or moved to the appendix. Instead give more details about statistics in the text in chap. 3.3.1
Fig. 10 and 11: same as for fig. 9, as daily and monthly values are just aggregates of half-hourly values in case less than 10% of missing data.
Fig. 12: ‘l’ is missing in word ‘Daily’ in c) and d)
Table A1: make sure that words in the table header are not wrapped (looks ugly)
Line 669: What is meant with: ‘..that this station provides only interpolated data.’Are these the same 10 sites mentioned above wwith gap-filled data? Are those data treated as measured data?