A global long-term (1981-2000) land surface temperature product for NOAA AVHRR

, Land Surface Temperature (LST) plays an important role in the research of climate change and various land surface processes. Before 2000, global LST products with relatively high temporal and spatial resolutions are scarce, despite of a 10 variety of operational satellite LST products. In this study, a global 0.05°×0.05° historical LST product is generated from NOAA AVHRR data (1981-2000), which includes three data layers: (1) instantaneous LST, a product generated by integrating several Split-Window Algorithms with a Random Forest (RF-SWA); (2) orbital drift corrected (ODC) LST, a drift corrected version of RF-SWA LST; (3) monthly averages of ODC LST. For an assumed maximum uncertainty in emissivity and column water vapour content of 0.04 and 1.0 g/cm 2 , respectively and evaluated against the simulation data set, the RF-SWA method 15 has a Mean Bias Error (MBE) of less than 0.10 K and a Standard Deviation (STD) of 1.10 K. To compensate the influence of orbital drift on LST, the retrieved RF-SWA LST was normalized with an improved ODC method. The RF-SWA LST were validated with in-situ LST from Surface Radiation Budget (SURFRAD) sites and water temperatures obtained from the National Data Buoy Center (NDBC). Against the in-situ LST, the RF-SWA LST has a MBE of 0.03 K with a range of -1.59 K – 2.71 K and STD is 1.18

for correcting AHVRR LST orbital drift, which fits a DTC model to component temperatures of neighbourhood pixels and was reported to achieve good accuracy. 100 As a part of the Global LAnd Surface Satellite (GLASS) product suite , the objective of this study is to develop a long-term global LST product (1981LST product ( -2000 from historical NOAA AVHRR datasets. Section 2 describes the simulation datasets used for developing consistent SWAs, the input datasets for LST product generation, and the in-situ datasets for validating the retrieved LST. In section 3, a practical approach for generating a single optimized LST product is proposed, which integrates several well-established SWAs through the Random Forest, which is termed RF-SWA. Finally, the retrieved 105 RF-SWA LST is normalized with an improved orbital drift correction method. Furthermore, emissivity estimation for bare soil is improved by using ASTER Global Emissivity Dataset (GED) and yields more accurate estimates of land surface emissivity in section 3.3. Section 4 describes the results and provides implementation details of the LST retrieval method, LST validation, and give an example of the LST product. Data availability shows in section 5. Conclusions and outlooks are provided in section 6. 110

AVHRR datasets
The advanced very-high-resolution radiometer (AVHRR) is a sensor onboard NOAA polar-orbiting satellite series. The orbital period is 101.4 minutes and designed over-pass time at the equator is between 13:30 and 14:30 (solar time) depending on the 115 satellite. The second AVHRR version (AVHRR/2) has five spectral channels, including a visible band (0.55-0.68 μm), a nearinfrared band (0.75-1.1 μm), a middle-infrared band (3.55-3.93 μm), and two thermal bands (10.5-11.3 μm and 11.5-12.5μm). Figure 1 shows the spectral responses of the two AVHRR thermal channels of NOAA-07/09/11/14. Nadir spatial resolution of the TIR channels is 1.1 km×1.1 km, and scan angles range between -55° and 55°. AVHRR covers the Earth's surface twice daily and has been widely used to generate various local or global land/sea surface parameters, e.g. the normalized difference 120 vegetation index (NDVI) and SST (Casey et al., 2010). In this study, the AVHRR datasets from Long-Term Datasets Records (Pedelty et al., 2007) (LTDR, https://ltdr.modaps.eosdis.nasa.gov/) are used, including AVH02C1 and AVH13C1, for which spatial resolution has been processed to 0.05°×0.05° (Table 1). Those two datasets include the top-of-atmosphere BT of the TIR channels, NDVI, view zenith angle (VZA), view time, and quality control (QC) flags, which provide a reference for distinguishing pure and cloudy pixels. 125

ASTER Global Emissivity Dataset (GED)
ASTER onboard the Terra satellite and has five TIR channels (Fig. 1). The ASTER GED used in this study was generated from clear sky ASTER TIR data between 2000 and 2008 with the TES algorithm and the water vapour scaling atmosphere correction method (Hulley et al., 2015). The products are output at 3″ (~100 m) and 30″ (~1 km) spatial resolution on 1°×1° tiles. Channel temporal mean emissivity, LST, and NDVI, as well as their standard deviation, global DEM, and land-sea mask, are part of the 130 GED. In this study, the ASTER GEDv3 with a 1-km spatial resolution was used to determine the global background emissivity of bare land.

Atmospheric profiles and forward simulation datasets
Global forward simulation datasets with good representativeness are necessary for developing and evaluating LST retrieval algorithms. This requires a reliable atmospheric profile dataset as input. In this study, the well-established SeeBor V5.0 (Borbas 135 et al., 2005) and TIGR2000 V1.2 (Chedin et al., 1985) atmospheric profiles were used to construct the forward simulation datasets. Zhou et al. (2019b) derived a global atmospheric profile dataset (GAPD) by screening the SeeBor V5.0 atmospheric profiles and removing cloud-contaminated and redundant profiles. The GAPD dataset has been used for developing the LST retrieval algorithm for NOAA-20/VIIRS and Sentinel-3/SLSTR (Liu et al., 2019b;Yang et al., 2020): it contains 549 global profiles with a column water vapour content (CWVC) range of 0.014 -7.939 g/cm 2 and near-surface air temperature (NSAT) 140 range of 224.25 K -309.05 K. Here, the GAPD was used to generate a training dataset (TRA-G): globally representative observation conditions were simulated by varying the viewing geometry and land surface characteristics over a realistic range for a limited profile dataset (Zhou et al,2019b), i.e. for each profile 10 surface temperatures (Ts), 15 view zenith angles (VZA), and 48 land surface emissivities (LSE, ε) were set. Specifically, Ts was set relative to NSAT with the difference (Ts-NSAT) covering the range of -16 K to +20 K at an interval of 4 K; VZA was set to values from 0° to 70° at an interval of 5°; emissivity 145 was obtained from Johns Hopkins University (JHU) spectral emissivity library by convolving the emissivity spectra with the spectral response functions of NOAA-07/09/11/14 AVHRR (Fig. 1); the corresponding emissivity ranges are provided in Table   2. For the remaining 4761 SeeBor clear-sky profiles (ATP-S) and 506 TIGR clear-sky profiles (ATP-T), the corresponding simulations were performed and used as evaluation datasets VAL-S and VAL-T, respectively. In contrast to GAPD, for each profile in ATP-S (ATP-T), we randomly set 10 (10) VZAs between 0° and 70°. The corresponding LSE has been assigned 150 according to the LCT over which a profile is located (Snyder et al., 1998) and Ts for VAL-S and VAL-T was set to the corresponding NSAT. Table 3 summarizes the three profile datasets and the corresponding simulation datasets. More details can be found in Zhou et al. (2019b) and Yang et al. (2020).

Ancillary data used for LST retrieval
Four ancillary datasets were used for LST retrieval: NSAT, CWVC, LCT, and soil type. The MERRA-2 reanalysis dataset 155 (M2T1NXINT) provides NSAT and CWVC (variables in datasets: T2M and TQV, respectively) with 0.5°×0.625° spatial resolution and hourly temporal resolution; nearest neighbour sampling was used to match up with AVHRR pixel and over-pass time. AVHRR LCTs were obtained from the University of Maryland (UMD) dataset (Defries and Hansen, 2010), which provides 14 LCTs (0:Water; 1:Evergreen Needleleaf Forest; 2:Evergreen Broadleaf Forest; 3:Deciduous Needleleaf Forest; 4:Deciduous Broadleaf Forest; 5:Mixed Forest; 6:Woodland; 7:Wooded Grassland; 8:Closed Shrubland; 9:Open Shrubland; 160 10:Grassland; 11:Cropland; 12:Bare Ground; 13:Urban and Built). The spatial resolution of the UMD LCT dataset is 1 km × 1 km. To adapt its resolution of AVHRR, the dominant LCT within each 0.05° grid was used as the LCT for AVHRR. The soil type dataset employed for estimating AVHRR LSE is provided by the United States Department of Agriculture, which is mainly based on the world soil map of FAO-UNESCO. Its spatial resolution is 2′ (~ 0.03°) and the soil type of each AVHRR pixel was also set to the dominant type. 165

In-situ datasets
In-situ measurements from the Surface Radiation Budget (SURFRAD) network and the National Data Buoy Center (NDBC) were used to validate the retrieved AVHRR LST. The details and geographical distribution of the selected in-situ sites are provided in Table 4 and Fig. 2. SURFRAD was established in 1993 and focuses on validating Earth's radiation budget. Quality control is an integral part of the design and operation of the SURFRAD network, which results in datasets of high quality and 170 well-defined measurement uncertainties (https://www.esrl.noaa.gov/gmd/grad/surfrad/). Therefore, SURFRAD data have been widely used for validating satellite-retrieved LST products (Guillevic et al., 2014;Liu et al., 2019b;Martin et al., 2019;Wang and Liang, 2009) and other fields. Six sites providing in-situ data between 1995 and 2000 were selected. At these sites, upwelling and downwelling longwave radiances are measured with highly accurate Eppley Precision Infrared Radiometers (PIR; wavelength: 4-50 μm) at an observation interval of 3 minutes. The PIRs were set up ~10 m above the ground, giving 175 them a field of view (FOV) covering approximately 70×70 m 2 (Guillevic et al., 2014). Historical data from the NDBC (https://www.ndbc.noaa.gov/historical_data.shtml) provide hourly samples of bulk water temperature, it is measured with electronic thermistors and highly accurate and quality controlled by NDBC (https://www.ndbc.noaa.gov/qc.shtml).
Considering the thermal homogeneity of the water surface, buoy temperatures are usually representative of the satellite pixel scale, even if it covers large areas. To avoid mixed land-water pixels, only buoys at least 20 km from the coastline were selected. 180

Methodology
LST retrieval algorithm from TIR remote sensing, especially with SWAs, is a well established and validated method. However, no single algorithm performs best under all conditions, even if it generally achieves good accuracy (Yu et al., 2009). This suggests that a more stable and robust LST retrieval algorithm may be obtained by integrating various individual LST retrieval algorithms. In this study, the Random Forest (RF) ensemble method (Breiman, 2001) was utilized for integrating multi-LSTs 185 (mLSTs) obtained with several-SWAs into a global AVHRR LST product. First, widely used candidate SWAs were trained and evaluated; these SWAs have been studied in previous work (Yang et al., 2020;Zhou et al., 2019b), and are shown in Table 5 for readers' convenience. Second, estimates of land surface emissivity were improved by combining the NDVI threshold method and ASTER GED. Third, the LSTs from the trained candidate SWAs were integrated with the RF method: thus, the approach is termed RF-SWA. Then, the instantaneous RF-SWA LST was normalized to 14:30 (solar time) using an improved 190 orbital drift correction (ODC) method and the RF-SWA LST and ODC LST products were validated against in-situ LST. Finally, for the user's convenience of users, a monthly averaged LST was also generated from the ODC LST with a sample averaging procedure.
Each forward simulation yields channel-specific top-of-atmosphere radiances and BTs in dependence of NSAT, CWVC, and VZA. To simulate BTs measured by satellites more realistically, Gaussian-distributed noise with a noise equivalent differential temperature (NE∆T) of 0.12 K, of which is the design goals for the AVHRR TIR channels, was added to the simulated BTs. 200 More details on the simulations are provided in Zhou et al. (2019b).
Multiple regression was performed on the simulated training datasets, TRA-G, to determine the coefficients of the candidate SWAs in Table 5. The TRA-G dataset was divided into 480 groups based on NSAT, CWVC, VZA, and Ts-NSAT as follows: (i) the atmospheres were divided into Cold-ATM and Warm-ATM with a NSAT threshold of 280 K; (ii) the data were divided into CWVC classes with an interval of 0.5 g/cm 2 . This resulted in 3 subgroups of Cold-ATM and 13 subgroups of 205 Warm-ATM; (iii) the VZAs were divided into intervals of 5°; (iv) based on Ts-NSAT, the data were divided into two subgroups e.g. [-16,4] K and [-4, 20] K, approximately representing daytime and nighttime cases, respectively. Based on regression against these training datasets, look-up tables (LUT) with coefficients for each candidate SWA were established. The candidate algorithms were then analyzed w.r.t. the standard error of the estimate (SEE) and coefficient of determination (R 2 ) and a sensitivity analysis was performed for the main input parameters, e.g. LSE and CWVC, to test the stability and accuracy of 210 the trained SWAs. Being consistent with the uncertainty level in Zhou et al. (2019b), the various uncertainty sources were grouped into 2 levels: (i) L1: |δε11|max≤ 0.02, |δε12|max≤ 0.02, and |δCWVC |max≤1.0 g/cm 2 ; (ii) L2: |δε11|max≤ 0.04, |δε12|max≤ 0.04, and |δCWVC |max≤1.0 g/cm 2 . These uncertainties will be added to ε11, ε12, and CWVC as random noises. Datasets without added uncertainty were labelled as L0. All trained candidate algorithms were evaluated against the simulation datasets VAL-S and VAL-T. 215

Multi-LSTs ensemble
Based on the training and evaluation results (see section 4.1), a multi-LST ensemble method is proposed, which hopefully achieves a more stable retrieval by integrating the most robust and stable SWAs. The method used to integrate the selected SWAs is the Random Forest (RF) method proposed by Breiman (2001). Compared to detailed analytic expressions for explaining complicated nonlinear relationships, the RF method has several advantages, including the ability to process large 220 databases with high efficiency, unbiased estimation, and especially minimizing the risk of overfitting (Hutengs and Vohland, 2016). Therefore, the RF method has been widely used in remote sensing applications, e.g. land cover classification (Rodriguez-Galiano et al., 2012), land surface parameter downscaling (Zhao et al., 2018), and estimating vegetation cover parameters (Mutanga et al., 2012).
The RF method utilizes an ensemble of many decision trees. In the implementation of the RF ensemble method, a random 225 vector Θk is selected from the input training datasets (mLSTs, LST) with the Bootstrap sampling method. Here, k is the number of samplings; mLSTs are the LSTs retrieved with the individual SWAs, i.e. the predictors; LST, i.e. the target variable is known from the forward simulations. The sample size of each sampling is two-thirds of the observations; for each sampling, a tree is grown using the training set and Θk, which results in a tree predictor T(Θk). Finally, the LST predicted with the RF is formed by averaging over the k trees (Eq. 1), 230 (1) Along with the predicted LST, the importance of each variable can be calculated using the residual sum of squares (RSS), which usually has larger values for more influential mLSTs. Additionally, the simple average (SA) method and Bayesian Model Averaging (BMA) method are implemented for comparison. To cover the real natural variability as much as possible, datasets TRA-G (L0, L1, and L2), VAL-S (L0), and VAL-T (L0) are used as training datasets for the LST ensemble model. The 235 remaining datasets VAL-S and VAL-T at uncertainty levels L1 and L2 are used for evaluating the ensemble model. For the later generation of global LST products, only mLSTs from the selected SWAs are needed.

Estimating LSE
LSE is a key parameter in retrieving LST from TIR remote sensing data. Depending on the spectral channels of the sensor and the available temporal sampling, there are various LSE estimation methods, e.g. the NDVI-threshold method (Sobrino et al., 240 2008), land cover-based (LC-based) method (Snyder et al., 1998;Wan, 2014), TES method (Gillespie et al., 1998), day-night method (Becker and Li, 1990b), and Kalman filter method (Li et al., 2013b;Masiello et al., 2015). The NDVI-threshold and LC-based methods are widely used in retrieving LST (Sobrino et al., 2008;Wan, 2014). However, those methods require that the emissivity of the land cover or the vegetation and bare (background) soil is known. In the TIR, emissivity spectra of dense vegetation are relatively similar and, therefore, can be taken from spectral libraries; these spectra can then be convolved with 245 the sensor's spectral response functions to obtain channel effective emissivities. In contrast, the emissivity of bare soil varies considerably, mainly due to variations of its components, roughness, water content, and surface structure. Therefore, this study employs a practical and robust method that combines the ASTER GED and the NDVI-threshold method to determine LSE.
First, the land surface is classified into pure bare soil, mixture of vegetation and bare soil, and pure vegetation. The emissivity in mixed areas (ɛλ) is obtained as the weighted sum of vegetation emissivity (ελ,v) and bare soil emissivity (ελ,s), 250 where the fraction of vegetation cover (fv) determines the weights (Carlson and Ripley, 1997;Hulley et al., 2015): here, the fv can be calculated as Eq. (3), where NDVImax and NDVImin are the thresholds for separating into vegetation areas, mixed areas, and bare soil areas. In order 255 to obtain globally consistent fv values, NDVImax and NDVImin were set to 0.5 and 0.2 (Sobrino et al., 2001), respectively.
According to Eqs. (2) and (3), the ASTER thermal channel emissivity of bare soil can be calculated as, where AST , ,v AST and ,s AST are the ASTER emissivity for the observation, dense vegetation, and bare soil in channel j (j=10,…,14), respectively. The ASTER thermal channel emissivities for dense vegetation is given in Meng et al. (2016). 260 In order to convert bare soil emissivities from ASTER spectral channels to AVHRR spectral channels, the following linear relationship is fitted to channel emissivities obtained from the JHU bare soil spectral library (Salisbury, 1991): ,s = 0 + 1 10,s AST + 2 11,s AST + 3 12,s AST + 4 13,s AST + 5 14,s AST (5) where ,s AVH (i=11, 12) is the AVHRR bare soil emissivity in channel centred at i and (k=0,…,5) are coefficients (Table   6). 265 Figure 3 illustrates LSE estimation, which consists of two main parts: Part I describes how static bare soil emissivity is obtained. After preparing the ASTER GED datasets, global mean NDVI and channel emissivity maps are obtained and mean fv is calculated via Eq. (3). In combination with the LUT for ASTER vegetation emissivity from Meng et al. (2016), an initial global ASTER bare soil emissivity map is obtained via Eq. (4).
However, due to regions with persistent cloud cover and over areas with dense vegetation (i.e. no visible bare soil fraction), 270 the obtained global emissivity maps for bare soil still have considerable data gaps. These missing values in the bare soil emissivity maps are filled with the average emissivity of the same soil type within the 3×3 neighbourhood pixels. If there is no neighbour valid pixel for averaging, the neighbourhood is enlarged until all data gaps are filled.
Part II describes the estimation of the daily dynamic emissivity. Firstly, the global ASTER background bare soil spectral channel emissivities are converted to AVHRR spectral channels via Eq. (5). Then, AVHRR channel emissivities are obtained 275 via Eq. (2) with NDVI values from the AVH13C1 dataset. Vegetation emissivities are taken from a look-up table (Table 7), which is based on AVHRR LCTs and vegetation emissivities from Pinheiro et al. (2006). Furthermore, emissivities of built-up areas and water are used for separating these areas from other non-vegetated areas.

Orbital Drift Correction
The orbital drift of the NOAA-series satellites is a serious limitation for applications of AVHRR LST. Therefore, an orbital 280 drift correction (ODC) would be highly useful and beneficial for many users. The actual overpass times of the NOAA-series afternoon satellites are between 13:00-17:30. In order to include the four-afternoon satellites (Table 1) (284 + DOY)) (Elagib et al., 1998).
Similar to the component emissivity in Eq. (2), LST can be approximated as the weighted sum of the component 290 temperatures of the vegetation and bare soil areas (Quan et al., 2018): where Tveg and Tsoil are the component temperatures of vegetation and bare soil, respectively.

Generation of LST products 310
The product generation executable (PGE) code includes four Modules. Module I is for generating the multi-LST with the selected SWAs. Three different types of input data enter this module: (i) the satellite data: BTs from AVH02C1, NDVI from AVH13C1, bare soil emissivity (see section 3.3), and AVHRR LCTs from UMD; (ii) look-up tables: coefficients of the SWAs (see section 3.1), emissivities of vegetation, water, and built-up areas (see Table 7); and (iii) ancillary data: NSAT and CWVC from MERRA and land-sea mask. The QC flags in AVHR02C1 are also used to identify cloudy pixel. If a pixel contains cloud 315 or cloud shadow, its LST is not calculated. Therefore, the output of Module I is multi-LST under clear sky conditions. Module II is for integrating the multi-LST with the trained RF ensemble model. The inputs include the multi-LST from Module I and the RF ensemble model; the output is the ensemble LST, which is termed RF-SWA LST. Module III is for normalizing the LST affected by orbital drift to 14:30 solar time. In this Module, the input datasets include the RF-SWA LST and NDVI; the latter is used for calculating the fraction of vegetation. The output of Module III is orbital drift corrected LST, 320 which is termed OCD LST. Module IV is for generating monthly average LST: the module first groups ODC LST by month, sums the valid LST in each month up, and divides them by the respective number of valid LST. The output from this Module is monthly averaged ODC LST. All LST data are stored in standard HDF-EOS format. Table 8 shows the variables provided in the three types of LST data files.
Before the validation was performed, in-situ LST and AVHRR LST were accurately matched up in terms of geolocation and acquisition time (nearest-neighbour interpolation and, depending on the site, time differences of less than 3 min, 15 min or 30 min). Furthermore, VZAs were limited to less than 40°. Additionally, three-sigma (3σ) filtering (Eq. 11) was employed to remove the samples contaminated by undetected clouds (Göttsche et al., 2016;Pearson, 2002). 335 where xk are the LST differences between the retrieved and in-situ values; x med is the median of the residuals. Matchups with residuals greater than x med +3S or less than x med -3S are regarded as outliners.

Training results and selection of SWAs 340
For NOAA-07/11 AVHRR, the candidate SWAs in Table 5 were already trained and evaluated by Zhou et al. (2019b). Here, the SWAs are additionally trained and evaluated for NOAA-09/14 AVHRR. Generally, the SWA training results for the four sensors are consistent with each other. Therefore, only the result for NOAA-14 AVHRR is listed here. The candidate algorithms OV1992, FO1996, and FOW 1996 show the worst regression accuracy regardless of atmospheric conditions with standard errors of the estimate (SEE) higher than 1.49 K, 1.48 K, and 1.32 K, respectively. For Warm-ATM, the SEE of PP1991 increases 345 rapidly with increasing CWVC, while it shows good accuracy for Cold-ATM with SEE between 0.33 K and 0.75 K. The SEE values for UC1985 and MT2002 were larger than those of most other SWAs, even though they are still lower than for OV1992, FO1996, FOW1996, and PP1991: therefore, these six SWAs were disregarded in the further analysis. For the remaining 11 SWAs, a sensitivity analysis was performed for the TRA-G simulation dataset with uncertainties levels L1 and L2. The results showed that SO1991 and CO1994 are sensitive to uncertainties in LSE and CWVC. Consequently, these two SWAs were also 350 excluded from the candidate algorithm list. More details on the training and sensitivity analysis are provided in Zhou et al.

(2019b).
The nine remaining SWAs for NOAA-09/14 AVHRR were then tested with the simulation datasets VAL-S and VAL-T.
For completeness, Fig. 4 shows these results together with those obtained for NOAA-07/11. It can be seen that the retained nine SWAs have low RMSE values, which range between 0.38 K and 0.49 K for VAL-S and between 0.47 K and 0.68 K for 355 VAL-T. Since the atmospheric profiles used to generate VAL-S and VAL-T are globally distributed, we conclude that these nine SWAs should perform well globally. The results for VAL-S in Fig. 4 reveal that BL-WD and WA2014 show the highest overall accuracy, followed by BL1995, PR1984, and VI1991. The RMSE values of these four SWAs are ~0.48 K. For VAL-T, BL1995 and BL_WD show the highest accuracy, followed by WA2014, VI1991, and PR1984; in this case, the RMSE of these four SWAs is ~0.60 K. For all nine SWAs, the accuracy decreases as the VZA increases. While BL-WD achieves the highest 360 accuracy, no obvious differences between the other eight SWAs are observed. From the 17 LCTs over which the atmospheric profiles are located, taking VAL-T for NOAA-14 AVHRR as an example, BL-WD performs best for nine LCTs, VI1991 and BL1995 for three LCTs. In contrast, there is no LC type over which PR1984, SR2000, GA2008, and UL1994 perform best. It is because those SWAs show different sensitivity to the emissivity, of which was set by the LCTs of profiles located. When assessing the effect of different atmospheric conditions, in Cold-ATM the highest accuracy is found for BL1995 and BL-WD. 365 In Warm-ATM, when Ts-NSAT is within [-4, 20] K, BL-WD performs the best for CWVC below 2.5 g/cm 2 , while PR1984 performs the best when CWVC exceeds 4.5 g/cm 2 . When Ts-NSAT is within [-16,4] K, WA2014 shows the best performance for CWVC below 3.5 g/cm 2 ; with increasing CWVC, BL1995 and BL-WD show the highest accuracy. Overall, it is found that no single SWA achieves the highest accuracy under all conditions.

Multi-LSTs ensemble 370
The nine SWAs were integrated with the RF ensemble method. For comparison, the simple averaging (SA) method and Bayesian Model Averaging (BMA) method were also employed. In contrast to Zhou et al. (2019b), here we used LST retrieved with SWAs trained with TRA-G (L0, L1, and L2) and VAL-S/T (L0) to simulate a more realistic situation with uncertainty.
Generally, the MBE of the RF ensemble method and the BMA model method is negligible (of the order of 10 -4 K or less), while the MBE of SA method and single SWA is larger (of the order of 0.1 K). It can be concluded that the two ensemble 375 methods (i.e. RF and BMA) similarly reduce systematic error. In terms of training accuracy, the RF model shows obvious advantages with a STD of less than 0.50 K for the four NOAA AVHRR sensors while the STD of SA and BMA is larger and varies between 1.27 and 1.35 K. Figure 5 highlights the importance of variance for forming the RF ensemble: the most important SWA is BL1995 with an importance value of 0.67, 0.64, 0.68, and 0.83 for NOAA-07, 09, 11, and 14, respectively.
The second most important SWA for NOAA-07, 09, and 11 is ULW1994, while WA2014 is the second most important SWA for NOAA-14. SR2000 is also of some importance for the ensemble process. Figure 5 confirms that the most important SWA, i.e. BL1995, is consistent with the most accurate SWA under different atmospheric conditions. Figure 6 shows the SEEs of the three ensemble methods for different CWVC zones and VZA subranges for NOAA-14 AVHRR. Compared to the BMA and SA models for all atmospheric conditions and VZAs, the RF ensemble model achieves an obvious improvement in LST accuracy. For Cold-ATM, the SEE of RF increases slowly with increasing CWVC and VZA 385 and varies from 0.21 K to 0.45 K. In contrast, the SEEs of BMA and SA show larger variations for both, increasing CWVC and VZA, and range from 0.72 K to 1.23 K. For Warm-ATM and CWVC less than 3.0 g/cm 2 , there is no obvious increase in SEE with increasing CWVC or VZA. However, SEE increases noticeably with increasing VZA when CWVC exceeds 3.0 g/cm 2 , especially at VZA larger than 35°. However, the SEE of RF is always smaller than that of BMA and SA: RF SEE only exceeds 1.0 K when CWVC is larger than 5.0 g/cm 2 and VZA exceeds 60°. Under the same conditions, the SEE of BMA and 390 SA is larger than 2.0 K. Therefore, it is concluded that the RF ensemble method achieves a higher training accuracy than the BMA and SA methods, with a RF training accuracy of less than 1.0 K under most conditions.
To assess the stability and sensitivity of the RF model, the LST estimated with RF, BMA, and SA method were evaluated against the VAL-T and VAL-S datasets at uncertainty levels L1 and L2. Figure 7 shows the evaluation of the three methods for NOAA-14 AVHRR. For VAL-S at L1, STD and RMSE of about 0.7 K are found for all three methods; however, the biases of 395 RF (MBE=-0.04 K) and BMA (MBE=-0.03 K) are smaller than that of SA (MBE=-0.11 K) and negligible (i.e. less than ±0.1 K). For VAL-S at L2, the bias for all three methods is negligible. However, considerable improvements are obtained with the RF method in terms of STD/RMSE, which is about 0.25K lower than for BMA and SA. For VAL-T at L1, the RF method has a slightly larger bias (MBE=-0.1 K) than the SA/BMA methods; however, its STD/RMSE is smaller. For VAL-T at L2, the three methods have a similar bias. However, RF has a significantly smaller STD/RMSE of 1.02/1.03 K than SA (1.41/1.42 K) 400 and BMA (1.38/1.39 K). Similar results were found for NOAA-07/09/11 AVHRR.

Validation of RF-SWA LST against in-situ LST
First, the generated RF-SWA LST was validated against in-situ LST from SURFRAD sites. Figure 8 shows a scatterplot between RF-SWA LST and SURFRAD in-situ LST and some statistic indicators, i.e. MBE, RMSE, STD, R 2 , and N (i.e. sample size). High correlations are found between RF-SWA LST and in-situ LST with a R 2 range of 0.91-0.96. MBE varies between 405 -1.59 K and 2.71 K and RMSE between 2.25 K and 3.86 K. Compared to LST products for MODIS, AATSR, and VIIRS, which were also validated against SURFRAD in-situ LST (Duan et al., 2019;Liu et al., 2019b;Martin et al., 2019), RF-SWA LST shows a similar accuracy and precision. It should be noted that the large MBE at BND, GWN, and TBL are probably due to a lack of in-situ LST representativeness at the satellite scale, e.g. BND, a seasonal bias variation is observed. During the dormancy season, the surface within the ground radiometer's FOV and the corresponding AVHRR pixel are both fairly 410 homogeneously covered by bare soil and grassland, which leads to smaller LST differences. In contrast, during the growing season, most of the area within the AVHRR pixel is covered by cropland and the fraction of vegetation cover depends on the crop's growth stage: especially in the early growing and harvesting season, there are many bare areas between crop rows, which causes larger LST differences. If only BND matchups during the dormancy season are considered, the corresponding MBE and RMSE between RF-SWA LST and in-situ LST reduce to 1.56 K and 2.58 K, respectively. At GWN, the land cover 415 within the ground radiometer's FOV is also grassland; however, the corresponding AVHRR pixel includes several nearby forest areas. Therefore, the daytime LST observed on the pixel scale tends to lower than the in-situ LST. At TBL, RF-SWA LST is lower than in-situ LST for in-situ LST larger than 300 K: this may be explained by a larger vegetation area southeast of the site, which is included in the AVHRR pixel, while the ground radiometer's entire FOV is covered by bare soil. Figure 9 shows scatterplots between RF-SWA LST and NDBC lake surface water temperature (LSWT) and some statistic 420 indicators for buoys in the East Pacific, the Big Lakes, Gulf of Mexico, and western Atlantic. As shown in Table 4, the number of buoys for each area differs. The RF-SWA LST shows a good correlation with in-situ LSWT with R 2 ranging from 0.94 to 0.98. The plots also show low systematic errors and high precision, e.g. MBEs are less than 0.26 K and RMSE ranges from 0.77 K to 0.89 K. Overall, the validation results resemble those obtained for the simulation datasets in section 4.2. Furthermore, the validation results meet WMO's requirements for applications of LST/LSWT in different fields, e.g. an uncertainty of 2.0 425 K for Agricultural Meteorology and of 1.0 K for Climate Monitoring (WMO, 2020).

Orbital Drift Correction
The retrieved RF-SWA LST were normalized for the orbital drift of the NOAA-series satellites using the orbital drift correction method described in section 3.4. Since water surface temperature varies relatively slowly, only the retrieved surface temperatures over land were normalized. The orbital drift corrected LST (ODC LST) was then validated against the same in-430 situ data as in section 4.3. Figure 10 shows a boxplot of the residuals (TAVHRR-Tin-situ) for the ODC LST and RF-SWA LST. The plot shows that the bias of the ODC LST over the 6 sites is similar to that of the uncorrected RF-SWA LST. From the six SURFRAD sites, BND has the highest positive bias, while GWN and TBL show negative biases. Following the explanation in section 4.3, this is probably due to less representative in-situ measurements. The standard deviations (STD) of the ODC LST residuals at the six SURFRAD sites are 3.62 K (BND), 2.34 K (DRA), 3.38 K (FPK), 3.45 K (GWN), 2.57 K (PSU), and 435 3.69 K (TBL). The STD variations (ODC LST -RF-SWA LST) ranges from 0.06 K to 1.15 K. This indicates that the ODC LST maintains the good accuracy of RF-SWA LST and its performance primarily depends on surface conditions. This is understandable because the improved ODC method uses adjacent pixels to compensate for the lack of temporal information.
Nevertheless, the improved ODC method provides a practical way to correct the effect of orbital drift on LST retrieved from NOAA AVHRR data. 440 The LST show an obvious annual variation as seasons change with Earth's revolution around the Sun. In March and September ( Fig. 11 a and c), the Sun is overhead near the equator, which then receives most of the solar energy. However, the highest LST are observed north and south of the equator, i.e. over the Sahara and Australia. This is due to the equatorial regions' dense 445 coverage with tropical rainforests, e.g. in the Amazon and Congo basins. The lowest LST are observed in the northern hemisphere around 45°N and around the Qinghai-Tibet Plateau. In June (Fig. 11 b), the Sun is more overhead in the northern hemisphere and the area with the highest LST is located around 30°N, e.g. over the Sahara Desert, the Arabian Peninsula, and the Iran-Pamir Plateau. At the same time, LST also increases north of 45°N. In December (Fig. 11 d), the area with the highest LST is mainly located over Oceania and parts of South America. It should be noted that the large areas with invalid data, which 450 are mainly observed at latitudes larger than 45°, are caused by the strict cloud filtering algorithms, which frequently recognize snow and ice as cloud, and polar night when no visible data are available to calculate NDVI (related to LSE). Furthermore, in In order to demonstrate the temporal consistency between satellites, Figure 12 shows time series of monthly averaged ODC LST from 1981-2000 for the Amazon basin, the Arctic pole, and the Tibetan plateau (areas are shown in Fig. 1): no 455 significant orbital drift or inconsistencies can be seen, indicating that the ODC method adequately normalized the retrieved AVHRR LST. The larger annual variations over the North pole (Fig. 12 b) are related to the specific variation of solar radiation in high latitude areas, i.e. polar day and polar night. For the Amazon basin (Fig. 12 a), the North Arctic pole (Fig. 12 b), and the Tibetan plateau (Fig. 12 c), the linear regressions (blue lines) show different trends with rates of 0.048±0.024 K/year (p-value=0.046), 0.087±0.221 K/year (p-value=0.695), and 0.081±0.103 K/year (p-value=0.433), respectively. However, these 460 rates may be affected by averaging over large areas and by the frequently missing data due to clouds. Therefore, more in-depth analyses, especially with in-situ observations and reanalysis data, are needed (Liu et al., 2008;Rigor et al., 2000;Schneider and Hook, 2010;Wu et al., 2013).
The 17 trained candidate SWAs generally showed consistent results for the four sensors. The candidate algorithms 475 OV1992, FO1996, FOW1996, PP1991, UC1985, and MT2002 had larger SEE than the other SWAs while SO1991 and CO1994 were sensitive to uncertainties in LSE and CWVC. Therefore, these SWAs were rejected. The nine remaining SWAs were evaluated based on the simulation datasets VAL-S and VAL-T. The results show that the trained nine SWAs have RMSE ranging between 0.38 K and 0.55 K for VAL-S and between 0.53 K and 0.69 K for VAL-T. Since the atmospheric profiles used to simulate and evaluate were chosen to be globally representative, we conclude that these nine SWAs should perform well 480 globally.
The RF ensemble method was then been applied to the nine selected SWAs. Compared to individual SWAs, sample averaging, and the BMA ensemble method, the RF ensemble method showed the best accuracy when evaluated against the simulation datasets. The RF ensemble algorithm yielded an accuracy better than 0.8 K for a maximum LSE uncertainty of 0.02 and a maximum CWVC uncertainty of 1.0 g/cm -2 ; the accuracy was still better than 1.10 K when the maximum LSE uncertainty 485 increased to 0.04. Based on these results, the algorithm theoretically satisfies the target accuracy requirement of WMO, i.e. an accuracy better than 1.0 K at a spatial resolution of 5 km. Furthermore, it is concluded that the RF method outperforms the SA and BMA methods and has the greatest potential for improving LST retrieval accuracy.
The RF-SWA LST and ODC LST are validated against in-situ LST from SURFRAD sites and NDBC buoys. Against SURFRAD LST, the MBE of RF-SWA LST varies from -1.59 K to 2.71 K and its STD varies from 2.26 K to 2.76 K, which 490 is similar to LST products retrieved from other sensors, e.g. MODIS. Against NDBC data from 1981-2000, RF-SWA LST also shows good accuracy and precision: with its small MBE (less than 0.10 K) and a STD ranging from 0.84 to 1.05 K, its performance against in-situ water temperature is similar to that for the simulated datasets. When validated against the same SURFRAD LST, the MBE of ODC LST ranges from -1.05 K to 3.01 K, which is similar to the MBE obtained for RF-SWA LST; its STD increases and ranges from 2.34 K to 3.69 K. Overall, it is concluded that both RF-SWA LST and ODC LST 495 achieve similar accuracy.
The generated global AVHRR LST is well suited to meet the needs of many applications and studies, e.g. global climate change, radiation budget, and energy balance, mapping of land cover change. However, further research should address the following points: first, the developed LST products were validated against in-situ LST data from North America, while they need to be validated globally, e.g. against AsiaFlux measurements, historical air temperature, reanalysis data, etc. Second, 500 ODC LST is obtained at a single overpass time, which required using prior knowledge on temporal parameters. If additional information on LST would be available, e.g. from modelling datasets, geostationary satellite datasets, and AVHRR nighttime datasets. Third, over some areas, there are many invalid values, e.g. southwest China, which frequently experiences cloudy and rainy weather. It is expected that future work can utilize recent progress in generating global all-weather LST products (Martins et al., 2019;Zhang et al., 2020) to help integrating multi-source data, e.g. passive microwave brightness temperature 505 and reanalysis data.  Tables   Table 1 Details    Becker and Li (1995) GA2008 = 0 + 1 11 + 2 ( 11 − 12 ) + 3 ( 11 − 12 ) 2 + ( 4 + 5 + 6 2 )(1 − ) + ( 7 + 8 )∆ Galve et al. (2008) Note: subscripts 11 and 12 denote channels centred at approximately 11μm and 12μm, respectively, while T11 and T12 and ε11 and ε12 are their associated BTs and LSEs; ε= (ε11+ε12)/2, Δε=(ε11-ε12); Ai are coefficients; T0 in PP1991 is 273.15 K; w is 785 CWVC and θ is VZA.