Comment on essd-2021-61 Anonymous Referee # 2 Referee comment on " A Harmonized Global Land Evaporation Dataset from Reanalysis Products Covering 1980 – 2017

Lu et al., 2021 present a new merged global land evaporation dataset based on three existing products. While the paper is interesting for publication is this journal, my current recommendation is for a major revision for the following reasons. First, there is the issue of considering GLEAM as an observational dataset. GLEAM is really just another data source, and the authors even show that the individual models they are comparing outperform GLEAM in terms of R and RMSD compared to flux sites. I think the comparisons with GLEAM are okay, as long as authors specifically mention that it is not used for validation. My other major concerns are some inconsistencies I see in figures, that need to be reevaluated or at least better explained (see more detailed comments below). One example in Figure 7, high NDVI values are incorrectly linked solely to humidity conditions, and the authors do not discuss the potential saturation issues at high NDVI values often seen in remote sensing datasets. The authors omit some necessary details (i.e. the use and source of NDVI is never explained in methods). Lastly, there are some grammar errors which should be addressed.

ratio analysis (Duan and Bastiaanssen, 2013) have been widely used. As for land ET, several relevant studies have evolved through simple average to complex methods including the weighted average (Hobeichi et al., 2018), reproducing flux observations combined with the original land evaporation product (Yao et al., 2017a), or seeking consistency between land evaporation and water cycle related products such as precipitation, runoff and land water storage (Aires et al., 2014;Munier et al., 2014). Diversified data merging methods, such as Kalman filtering algorithm (Pipunic et al., 2008;Liu C et al., 2013), Bayesian Model Average (BMA) and Empirical Orthogonal Function (EOF), can improve regional ET estimation by 70 merging multiple ET products (Yao et al., 2014(Yao et al., , 2016Feng et al., 2016;Zhu et al., 2016). Since land ET is a complex variable coupling energy, hydrology and carbon budget, it is difficult to accurately determine the conditional density in BMA. In addition, EOF not only requires high computational cost, but also introduces bias due to the lack of distinction between good and bad quality Pixel in the refactoring scheme. Simultaneously, their complexity affects the efficiency of calculating the weight of individual data set, which limits their wide application. Simple average (SA) (Ershadi et al., 2014) 75 and simple Taylor skill's score (STS) merging (Yao et al., 2017b) have been adopted for global ET merging. However, SA assumes the same uncertainty in each data set, which is actually unreasonable, and STS is highly dependent on the accuracy of single data, which makes it highly quality-demanding for the data sets involved in the merging process. Khan et al. (2018) analyzed the sources of uncertainties for three different ET products using triple collocation (TC) method, which estimates the random error standard deviations for three datasets of the same variable according to statistical relations, and provides an 80 uncorrelated absolute and relative error structure among data sets. It has a strict requirement for the number of data sets participating in the merging process, which affects the flexibility of data merging greatly. Lately, Jimenez et al. (2018) proposed an error variance unweighted merging method with the local weights deduced according to the variance of the differences between the outliers of the flux tower and simulated land evaporation. The above merging methods effectively reduce the uncertainties of simulations by estimating the weight of multiple products to generate reliable merged products 85 (Zhu et al., 2016). However, previous studies have mostly focused on the evaluation of ET simulation at regional scales and landscape (Yang et al., 2016).
Compared to the simple method, Reliability Ensemble Average method (REA) extracts the most reliable information from each model by minimizing the impact of "outliers" or underperforming models, subsequently reducing the uncertainty range in simulated changes, which also stands out in terms of computational efficiency. These standards are regional rather than 90 global, as most models tend to show anomalous behavior or poor performance from one region to another. The REA method also produces a quantitative measure of reliability, which increases the overall reliability of simulation changes. On one hand, REA method considers model performance, that is, the ability of models reproducing current climate, which is defined by the difference between the simulation and observation. On the other hand, the model convergence, a factor to measure the reliability of the modes is taken into consideration as well. It is the distance between the changes of the given model and that 95 of REA average. REA has been widely used for meteorological variables such as precipitation and temperature. Giorgi & Mearns (2002)  under two emission scenarios simulated by 9 atmosphere-ocean circulation models in the late 21st century. However, studies on the application of land evaporation are few. This study aims to develop a long-term high-quality global land ET using merging technology. Merging multiple single data sets is expected to reduce the uncertainties of land ET effectively. The 100 merged product can provide a basis for water cycle research and global water resources management. Hence, the systematic and in-depth studies on land evaporation merging are urgently needed.

Data types
Four widely used land ET data sets were selected for merging, including Global Land Evaporation Amsterdam Model 105 (GLEAM; Miralles et al., 2011), the fifth-generation ECMWF Re-Analysis (ERA5; Hersbach et al., 2020), the second Modern-Era Retrospective analysis for Research and Applications (MERRA2; Gelaro et al., 2017), and Global Land Data Assimilation System ET (GLDAS; Sheffield & Wood, 2007). The differences in spatial and temporal resolution among the ET products were rescaled to a daily timescale and 0.25°, with the time span from 1980 to 2017. GLEAM was used as the reference data due to its independence from other data sets participating in the merging process. The data involved in 110 merging includes ERA5, MERRA2 and GLDAS. Eddy Covariance (EC) ET was used to evaluate the merged product compared with other data sets involved in the merging process. The spatial and temporal resolutions of these ET datasets are shown in Table 1, which is briefly described below.

115
GLEAM algorithm estimates land evaporation mainly based on the parameterized physical process, which uses extensive independent remote sensing observations as the basis for calculating land evaporation and its different components including transpiration, bare-soil evaporation, interception loss, open-water evaporation and sublimation separately (Priestley and Taylor, 1972). The empirical parameters contained in this algorithm have been obtained from the findings in different fields.
On a global scale, GLEAM has been validated with the observations obtained from the eddy covariance instrument, 120 indicating that it can be used to describe terrestrial ET in different ecosystems (Miralles et al., 2011). The version of the data set used in this study is 3.2a, which spans a 38-year period through 1980 to 2017, gridded with 0.25 degree. It is available from https://www.gleam.eu/.

The fifth-generation ECMWF Re-Analysis (ERA5) ET
Followed by ERA-15, ERA-40 and ERA-Interim, the fifth generation of ECMWF reanalysis data ERA5 has been released, 125 which is envisioned to replace ERA-Interim reanalysis (Hersbach et al., 2020). Compared with ERA-Interim, some of the key climatic information of the EAR5 has been improved. The most updated version of the Earth System Model and data assimilation techniques used at ECMWF have been applied in ERA5, including more sophisticated parametrization of geophysical processes in comparison to the previous versions used in ERA-Interim. ERA5 covers from 1979 to the near real time period (on a regular basis), moreover, temporal and spatial resolution have been improved in ERA5, from 6-hourly in 130 ERA-Interim to hourly, from 79 km to 31 km in the horizontal dimension and 60 to 137 in vertical levels. ERA5 has a better balance of global precipitation and evaporation (Albergel et al., 2018). It is available from https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5/.

The second Modern-Era Retrospective analysis for Research and Applications (MERRA2) ET
MERRA2 (Gelaro et al., 2017) is an advanced atmospheric reanalysis data set, which absorbs mass of satellite data, 135 including the new observation types such as hyperspectral radiation, microwave and aerosols. The MERRA2 is unique in modern reanalysis data set that contains aerosol data assimilation (Randles et al. 2017). It combines satellite and more traditional weather observations with simulated atmospheric behavior, making an attempt to get the optimal possible estimation of the Earth system state. MERRA2 is the second version of MERRA, which has undergone several major upgrades, including an observation-based precipitation bias correction . The land surface model used in 140 MERRA2 is the Catchment Land Surface Model (Koster et al., 2000), where land evaporation is calculated as part of the energy balance at the land surface. Hourly data with a 0.625× 0.5°spatial resolution are provided by the Goddard Earth

Global Land Data Assimilation System ET (GLDAS) ET
GLDAS is a global high-resolution land modeling System based on North American Land Data Assimilation System (NLDAS), which combines data assimilation technology to produce land surface state and fluxes in real time. In order to generate better land surface products in different LSM, GLDAS generates the optimal field of surface state and flux by 150 absorbing satellite and surface observation data, taking advantage of advanced land surface modeling and data assimilation technology (Rodell et al., 2004). Different LSMs were used including Mosaic, Noah, CLM, and VIC, where only Noah has continued until now. Recently, there is more and more evidence to show that GLDAS-1.0 has serious discontinuities due to forcing data such as large precipitation and temperature errors in 1996 and 2000-2005. Therefore, monthly land evaporation data of GLDAS2 combined with Noah LSM (GLDAS2-Noah) has been used in this study, whose

Eddy Covariance (EC) ET
Latent heat flux (LE) from 182 effective flux towers across the world is used to evaluate the performance of multiple data 165 sets with different estimates, which are available from http://fluxnet.fluxdata.org/. Eddy covariance is a widely used energy flux measurement method that provides continuous measurement of interchange of water and energy (Mu et al., 2011). The period of latent heat flux is 1 to 19 years, where 2 to three 3 are the most common, shown in Fig. 1. The observed land evaporation is calculated by latent heat flux in the following Eq. (1), where λ is the conversion factor with the fixed value of 2.45MJ kg -1 . The flux towers are located in ten land cover types

Methods
In this study, we investigated whether more accurate ET results could be obtained through the weighted combination of ERA5, GLDAS and MERRA2 estimation. Ideally, the weight assigned to each product should be based on an accurate description of the uncertainty during the merging process. Therefore, three data sets have been weighted based on GLEAM, and the performance of the merged ET products have been studied at the selected sites. A set of weights need to be defined 180 for weighted combination of three products, usually based on an individual uncertainty of each product. The simplest strategy used in previous studies is to assume that all three products have the same uncertainty, thus the merged product is a simple average of each product. A more detailed strategy used in this study is to weigh the product based on their uncertainties. The expected goal is to develop a product that minimizes Root Mean Square Deviation (RMSD) and maximizes correlation with tower ET, which we call optimal in the context of our merging strategy.

Variable Coefficient
The coefficient of variation (CV), also known as the relative standard deviation in probability theory and statistics, was used to evaluate the consistency of multiple sets of reanalysis ET data. It is a statistic to measure the degree of variation in the data. The consistency decreases with the increase of CV. It is calculated from the following equation: where S represents the standard deviation and x represents the average of multiple sets of reanalysis ET data in each pixel.

190
This approach is superior to standard deviation in evaluating consistency, which can eliminate the influence of different units and/or average on the degree of variation of two or more data. In this paper, multiple sets of reanalysis land evaporation data are blended into a single product based on their performance and convergence criteria.

Reliability Ensemble Averaging
Reliability Ensemble Averaging (REA) method (Giorgi and Mearns, 2002;Xu et al., 2010) were used to combine multiple 195 sets of reanalysis ET data into a single product. Two reliability criteria were considered in the method: model performance and model convergence, in other words, the model's performance in reproducing the current climate and the convergence of simulated values between models.
In our REA method, the average ET is given by a weighted average of all the ensemble members.
where A  represents the REA averaging and Ri represents the model reliability factor defined as: The merging is extended to pixels rather than just flux tower level, based on the selection of independent GLEAM as reference to compensate for the limited EC measured ET. RB,i and RD,i in Eq. (4) is a measure of the model performance and convergence criteria respectively. RB,i is a factor to measure the reliability of the model through the bias (BET,i) between the simulated ET and the reference, that is, the larger the bias the lower the reliability of the model. RD,i is a factor to measure the reliability in the aspect of the distance (DET,i) between the simulated ET and the ensemble average, that is, the higher the 205 distance the lower the reliability of the model. The parameters m and n in Eq. (4) are used to measure the relative importance of the two criteria. In this work, assuming the importance of the two criteria is equal, m and n were assigned with 1. However, if the two criteria are given different weights, they may be different. The parameter ϵ in Eq. (4) is a measure of natural variability in 38-yr ET. In order to calculate ϵ, we estimated moving averages after linearly detrending the 38-yr time series data in each pixel. Then, the 210 differences between maximum and minimum of the moving averages are computed as ϵ. In addition, when B and D is less than ϵ, RB and RD are set to 1 respectively. In essence, Eq. (4) indicates that the model is reliable when the bias and the distance from the ensemble average are within the limits of natural variability, where RB=RD=R=1. With the bias or distance growing, the reliability of a given model decreases.

215
The error metrics of Pearson correlation coefficient (R), root mean square deviation (RMSD) and unbiased root-mean-square deviation (ubRMSD) were used to verify the blending product. The statistic values are defined as follows: where n represents the sample size; Mi and refi respectively represents multiple sets of reanalysis ET data and reference data at time i. M and ref represent the average of Mi and refi.

220
The consistency of the three data sets has been illustrated in Fig. 2, where Fig. 2a-c shows the differences of land evaporation between each data set and ensemble mean. In the high latitudes of the northern hemisphere, GLDAS-Noah2 ET is more than 20% higher than ensemble mean, while ERA5 ET is almost the same with it, even MERRA2 ET more than 20% lower. In the middle latitudes, ERA5 ET is more than 20% higher than ensemble mean, while GLDAS-Noah2 ET is more than 20% lower than it. As for MERRA2 ET, there are a few areas higher than ensemble mean. In the western part of South

225
America and parts of Oceania in the southern hemisphere, ERA5 ET is higher than ensemble mean while GLDAS-Noah2 is lower than it, MERRA2 showing no significant difference. In these regions, the three data sets are significantly different, indicating that the estimation of land evaporation is of great uncertainty and low consistency. In order to reduce the risk of   The spatial distribution of weights has been depicted in Fig. 3, representing the contribution of each data set to the merged product. It is not difficult to find that the weights are within the scope of 0.  The latitudinal distribution of the multiyear mean land evaporation of five products is shown in Fig. 4a. General consistency in spatial pattern was shown in spite of differences in intensity. However, large differences appeared in the interannual 255 variation of these products (Fig. 4b), though the long-term trends generally showed good consistency. Anomalies for all data https://doi.org/10.5194/essd-2021-61 Open Access obvious differences between the probability density distributions of multiple data sets were clearly visible. In general, the consistency between the merged product and ERA5 ET was relatively better, which may be greatly related to the climatology of ERA5 used in the merging process. Due to the discrepancies in the driving data and calculation formulations 265 for land evaporation, climatology varies from data to data.

Figure 5. a-e) Scatter plots and f) Taylor diagrams of daily Ground-measure ET and ET from the different products. Linear fits are plotted in blue and the 1:1 line is depicted.
Figure 5a-e shows the scatter diagram of multiple data sets and station-observed data at daily scale. Relatively more points 270 were found to be concentrated above the 1:1 line, which makes it clear that land evaporation was somewhat overestimated.
Among the five data sets, the correlation coefficient between REA and station-observed data was the highest, reaching 0.72, followed by ERA5, other three data sets were basically the same. Therefore, the correlation coefficient was significantly improved through REA data merging, indicating more consistent changes between REA and station-observed data. Similarly, as for RMSD, REA was the smallest, only 0.91 mm. Therefore, the deviation between REA data merging product and station-observed data was significantly reduced compared with other data, indicating that the accuracy was greatly improved.
The Taylor chart in Fig. 5f describes the ratio of standard deviation, correlation coefficient and unbiased root-mean-square deviation between five data sets and station-observed data under daily scale, with the ratio of standard deviation representing the ratio between each data set and station-observed data. The ratio of standard deviation of REA was the smallest, nearly 1, indicating almost the same with station-observed data and the smallest fluctuation among all data sets considered. The 280 ubRMSD values of multiple data sets were slightly smaller than RMSD respectively, while the rank of them remains the same, specifically REA < ERA5 < GLEAM3.2a < GLDAS-Noah2 < MERRA2. The correlation coefficient between REA and station-observed data was the highest, meanwhile the ratio of standard deviation, RMSD and ubRMSD are the lowest, indicating that REA performed optimally under all assessment criteria.   Previous studies showed that there is a close relationship between the quality of land evaporation data sets and vegetation . The correlation coefficients between multiple data sets and station-observed data under different vegetation conditions (0.3 to 0.9) were compared in order to understand how the quality of these data sets changes with vegetation. The results showed that the quality of all data sets first increased and then decreased with the increase of 300 vegetation density, with the highest quality captured when NDVI was around 0.55. It is worth noting that all data sets were in a good quality range with a correlation of more than 0.6 when NDVI was between 0.4 and 0.7. When NDVI was greater than 0.7, the case of very humid conditions, the quality of each data set was relatively low and showed a rapid decline with the increase of vegetation density. The merged product demonstrated the ability to capture land evaporation dynamics in a wide range of vegetation densities, which performed best in all data sets when NDVI ranges from 0.34 to 0.75.

305
There are unique advantages and limitations of the existing land ET data sets for specific land cover types, however, quite few are globally suitable for meteorology and hydrology. In different climatic regions, the performance of land ET products varies from the model response. Feng et al. (2018) analyzed correlation between land ET estimated based on the Budyko hypothesis and reanalysis ET products, with results showing that great problems existed in MERRA2 when describing annual variation and long-term trend of land ET in China, mainly due to the higher variance amplitude of MERRA2 than that 310 of other reanalysis products. Further, great uncertainty has been captured in semiarid, semihumid and humid regions according to MERRA2. However, Dembele et al. (2020) found that MERRA2 was still one of the best data sets in estimating land ET in Volta River basin from 2003 to 2012 despite its low spatial resolution, which is probably due to the high temporal but low spatial resolution. In contrast, ERA5 ET behaved poorly in the region. Baik et al. (2018)  showed the highest correlation in farmland, grassland, and shrub. However, GLDAS2 and GLEAM ET were found overestimated in most climatic regions and land cover classifications. The difference between model input variables can effectively explain the difference between estimated ET (Yao et al., 2017b). Specifically for Amazon tropical forests, MERRA2 tends to overestimate daily net radiation and incident solar radiation, while GLDAS2 tends to overestimate daily 320 radiation and underestimate incident solar radiation (Gomis-Cebolla et al., 2019).
In our study, GLEAM showed high performance due to the information of soil moisture integrated in the calculation of land ET, which is expected to perform well in the hydrological simulation. Khan et al. (2020)     Oceania. The reduction trend in Amazon has been observed in all data sets, with MERRA2 showing the most significant and intense one. The decreasing trend in the Congo basin was detected in ERA5 and MERRA2, while an opposite one was observed in GLEAM3.2a. Burnett et al. (2020) found Congo basin becoming sunnier and less humid in recent years through the analysis of environmental data. In general, GLEAM is fairly close to land ET in tropical Africa (Schuttemeyer et al., 355 2007;Opoku-Duah et al., 2008;Andam-Akorful et al., 2015;Liu et al., 2016), while MERRA2 ET had the maximum temporal variability over Congo basin (Burnett et al., 2020;Crowhurst et al., 2020). The increase in the east of North America, west of Europe and south of Asia were detectable in all data sets. The increasing trend in the north of Oceania was also detected in GLEAM3.2a and MERRA2, but not in ERA5.
Varying degrees of uncertainties exist in models based on satellites according to their theories, structural assumption and 360 parameterization of the inputs. These limitations are mainly affected by changes in landscape, climatic and hydrological conditions (Xu et al., 2015). Changes in environmental conditions and extensive vegetation types in regional and global ET estimation can lead to great uncertainties in ET products (Yilmaz et al., 2012;Khan et al., 2018). Apart from a few ET products, at present, the validation and analysis have been rarely adopted to reduce uncertainties. A few researches have made some attempts to reduce the uncertainties in hydrological applications  (Zhu et al., 2016), so far no such performance has been achieved yet. An emerging new technology REA method, which has ability to combine different ET products (GLDAS, GLEAM, ERA5 and MERRA2) and is becoming increasingly successful to resolve the issue of hydrological uncertainties and holds greater significance for in-depth assessment.
The uncertainties during the merging of ET products are driven by various factors including input errors, scaling effect and merging algorithm. Input errors are derived from single ET product and EC ground measurements. EC ground measurement determines the accuracy of the merged ET products, as it is considered to be the "true" value used to calibrate single product, which persists an error of about 5-20%. They are usually found relatively accurate for ET acquisition (Foken et al., 2006).
The uncertainties from scaling effects are caused by the mismatch between the spatial resolution of the model and the tower footprint. The uncertainties of merging algorithms are caused by differential calculations of the weight of each product. In addition, there are other factors that lead to uncertainties. For example, uncertainties in remote sensing and meteorological 375 data may affect the calculation of modeled ET. With regard to the model estimation, some common modeling assumptions such as estimation of potential evaporation, and shared inputs such as surface radiation make ET estimates from models quite dependent, making the predicted errors correlated (Jimenez et al., 2017).
In general, reasonable information about hydrological variables on a global scale can be extracted from satellite-and reanalysis-based data sets, which is useful in regions lacking sufficient observations (Kim et al., 2018). For the performances 380 of reanalysis and satellite data are discrepant under different mechanisms, the merging method can be used to verify the complementary strategy well to reflect the strengths of both. However, the complex structure of these merging methods affects the efficiency of calculating the weight and limits their wider application. In contrast, the simplicity, computational efficiencies and reliable accuracy of REA method make it more preponderant when considering multi-source data sets (Giorgi & Mearns, 2002). On the basis of previous studies, this study focuses on the global performance of land ET merged 385 product. To reduce the complexity, we introduced REA method to improve land ET estimation and use it as benchmark by merging three reanalysis data sets produced by separate algorithms. This method considers not only the performance of single model, but also the convergence of models involved in the merging process. Compared with a single product, REA merged ET products were found to outperform with significantly reduced root mean square error (RMSE = 0.05~0.27mm per day on average) by. However, these merging methods only take into consideration the combination and relationship 390 between each product and the reference data set. Although they are statistically significant, data sets both participating in the merging process and used as reference have not been improved substantially. Therefore, the performance of REA method is highly dependent on the weight of individual data set, which are calibrated with reference data. Future research needs to determine the physical mechanism of the inherent error of each product and strengthen the global quantification of ET products by combining the surface residual energy balance and water balance method without using any reference data set 395 (Yao et al., 2017b;Baik et al., 2018).

Data availability
All data used in this study are freely available with the links given in section 2. A convenience copy of the merged global land evaporation product available at the time this paper was created has been registered with Zenodo and is available at https://doi.org/10.5281/zenodo.4595941 (Lu et al., 2021). Both daily and monthly dataset are provided.

Conclusion
In conclusion, we used CV as the indicator to select the merging regions with high data consistency, and the regions with low consistency were excluded from the merging scope, including the north of North America, west of South America, desert regions of mid-latitude Africa and Asia. We merged three land ET data sets, ERA5, GLDAS and MERRA2, respectively using REA method to generate a set of long sequence global daily ET data with a spatial resolution of 0.5 405 degree and a time span of 38 years. The quality of the merged product was found significantly improved when compared with the three individual data sets selected for merging. Both the correlation coefficient and RMSE results suggested that under different vegetation conditions, the quality of the merged product outperformed well among all data sets.
The spatial distribution of the merged product was found highly consistent with other four data sets, indicating that the product successfully captures the spatial difference of land ET effectively. In terms of variation trends of global land ET, 410 conclusions differ from numerous researches due to uncertainties of data used, and no agreement could be reached. Our merged product shows that there is a significant decreasing trend in Amazon plain in South America and Congo Basin in central Africa, and an increasing trend in the east of North America, west of Europe, south of Asia and north of Oceania.
This is the first time we have used REA method more precisely in land ET data merging by avoiding the likely errors to be generated by physical mechanism. 415 Based on model weighting, the method of combining information into a new data set reflects the uncertainty behind climate change predictions, which should be explored in depth. A simple and flexible framework is provided for the exploration according to REA method. As we expected, the error expectation was non-stationary. The weight of each day can be estimated by running a time window centered on the day for making the weight change over time. Shorter time windows produce more dynamic weights, nonetheless, the values may be noisier due to fewer samples available to estimate variability 420 in the time series. Under the framework, cross-variable merging can be realized, that is, multiple related variables can be included for weight calculation (Xu et al., 2010). In future studies, the quality of merged products could be improved by adopting an appropriate time window to calculate the dynamic weight and considering more relevant variables to further reduce the uncertainty of the merged product.