An all-sky 1 km daily land surface air temperature product over mainland China for 2003–2019 from MODIS and ancillary data

. Surface air temperature (T a ), as an important climate variable, has been used in a wide range of fields such as ecology, hydrology, climatology, epidemiology, and environmental science. However, ground measurements are limited by poor spatial representation and inconsistency, while reanalysis and meteorological forcing datasets suffer from coarse spatial resolution 10 and inaccuracy. Previous studies using satellite data have mainly estimated T a under clear-sky conditions, or with limited temporal and spatial coverage. In this study, an all-sky daily mean land T a product at 1 km spatial resolution over mainland China for 2003–2019 has been generated mainly from the Moderate Resolution Imaging Spectroradiometer (MODIS) products and the Global Land Data Assimilation System (GLDAS) dataset. Three T a estimation models based on random forest were trained using ground measurements from 2384 stations for three different clear-sky and cloudy-sky conditions. The random 15 sample validation results showed that R 2 and root mean square error (RMSE) values of the three models ranged from 0.984 to 0.986 and 1.342 K to 1.440 K, respectively. We examined the spatiotemporal patterns and land cover type dependences of model accuracy. Two cross-validation (CV) strategies of Leave-Time-Out (LTO) CV and Leave-Location-Out (LLO) CV were also used to evaluate the models. Finally, we developed the all-sky T a dataset from 2003 to 2009, and compared it with the China Land Data Assimilation System (CLDAS) dataset at 0.0625° spatial resolution, China Meteorological Forcing Data 20 (CMFD) dataset at 0.1° spatial resolution, and GLDAS dataset at 0.25° spatial resolution. Validation accuracy of our product in 2010 was significantly better than other datasets, with R 2 and RMSE values of 0.992 and 1.010 K, respectively. In summary, the developed all-sky daily mean land T a dataset has achieved satisfactory accuracy and high spatial resolution simultaneously, which fills the current dataset gap in this field and plays an important role in the studies of climate change and hydrological cycle. This dataset is freely available at http://doi.org/10.5281/zenodo.4399453 (Chen et al., 2021b) and the University of 25 Maryland (http://glass.umd.edu/Ta_China/) currently. A sub-dataset that covers Beijing generated from this dataset is also available at

atmosphere, but usually require additional processes to obtain Ta. The Moderate Resolution Imaging Spectroradiometer (MODIS) atmospheric profile product has been used for this purpose (Bisht and Bras, 2010;Borbas and Menzel, 2017;Famiglietti et al., 2018;Zhu et al., 2017). Generally, traditional statistical methods were commonly used but have reported low accuracy. In recent years, machine learning methods, particularly deep learning methods, such as support vector machine (Zhang et al., 2016), artificial neural network (Jang et al., 2004;Zhang et al., 2016), M5 model trees (Emamifar et al., 2013), 70 random forest (RF) models (Noi et al., 2017;Xu et al., 2014;Zhang et al., 2016), cubist models (Meyer et al., 2016;Noi et al., 2017;Rao et al., 2019), and advanced deep learning methods (Shen et al., 2020), have been gradually applied to Ta estimation from satellite data because of their stronger learning ability to capture the complex nonlinear relationship between various factors.
Most LST-based Ta estimation methods mentioned above are suitable only for clear-sky conditions as the current LST 75 datasets are mainly derived from satellite thermal infrared radiances (TIR) that are susceptible to cloud contamination Ma et al., 2020). Currently, there are two main strategies for estimating all-sky Ta based on LST: one is to first derive Ta from the available LST and then fill the Ta gaps (Rosenfeld et al., 2017;Zhang, 2017); the other is to first fill the LST gaps to develop a seamless product and then estimate the all-sky Ta (Kilibarda et al., 2014;Li et al., 2018;Rao et al., 2019). For example, Zhang et al. (2017)

estimated Ta under clear-sky conditions based on MODIS LST, and the Atmospheric 80
Infrared Sounder (AIRS) standard Ta products were used to fill the cloudy-sky pixels after a downscaling process, with a mean absolute error (MAE) of 1.2 K and a root mean square error (RMSE) of 1.6 K overall. According to the research conducted by Kilibarda et al. (2014), the 8-day composite LST was interpolating into a daily dataset and then combined with topographic layers and geometric temperature trend to interpolate the all-sky daily Ta, and the results reported an RMSE value of 2-4 ℃.
In addition, Zhu et al. (2017) developed a parameterization scheme to estimate all-sky instantaneous daytime Ta only relying 85 on MODIS atmospheric profile product. They first established the relationship between LST and Ta under clear-sky conditions, and then estimated Ta under cloudy-sky conditions based on the established relationship, with RMSE values ranging from 2.50 ℃ to 2.56 ℃.
Currently, several studies have been conducted to develop all-sky Ta datasets based on remotely sensed data. For instance, Li et al. (2018) used a 3-step hybrid gap-filling method to attain seamless LST first, and then developed daily geographically 90 weighted regression (GWR) models to interpolate Ta using gap-filled LST and elevation, and finally developed a 1 km daily minimum/maximum Ta dataset in urban and surrounding areas in the conterminous U.S. for 2003-2016. The cross-validation results reported that the RMSE values were 2.1 °C and 1.9 °C for daily minimum and maximum Ta, respectively. In the recent work conducted by Yao et al. (2020), the MODIS 8-day composite LST was averaged to obtain monthly mean LST, and then combined with enhanced vegetation index (EVI), solar radiation, topographic index and other features to establish a cubist 95 model for generating 1 km monthly maximum/mean/minimum Ta products in China, and the RMSE of the estimated monthly mean Ta was 0.629 °C. Rao et al. (2019) first filled the gaps of LSTs, and then used the gap-filled LSTs and some radiation products to build cubist models for estimating all-sky daily mean Ta, with an RMSE of 1.87 °C. Finally, a 0.05° × 0.05° daily 4 mean Ta product over the Tibetan Plateau for 2002-2016 was developed. In addition, there exists multiple reanalysis and meteorological forcing datasets covering large areas or global areas, which are usually generated by data assimilation or data 100 interpolation, such as the Global Land Data Assimilation System (GLDAS) (Rodell et al., 2004), Modern-Era Retrospective Analysis and Research and Application, version 2 (MERRA-2) (Gelaro et al., 2017), China Meteorological Forcing Data (CMFD) (Yang and He, 2019), and China Land Data Assimilation System (CLDAS) (Shi et al., 2011). However, these datasets have coarse spatial resolution (generally ≥ 0.1° except for CLDAS with a spatial resolution of 0.0625°) and regional inaccuracy, which may limit their potential to accurately capture the spatial heterogeneity of Ta in the urban and mountainous areas and 105 lead to uncertainties for applications at local to regional scales (Jang et al., 2014;Li et al., 2018;Zhu et al., 2017). To our knowledge, there are a lack of long time series all-sky Ta products covering vast areas with both high spatial and temporal resolution currently.
The main objective of this study is to develop an all-sky 1 km daily mean land Ta over mainland China for 2003China for -2019 integrating satellite data products, model simulations, and ground measurements. For the first time, assimilated Ta was applied 110 to supplement and substitute MODIS LSTs and provide the initial values of model prediction. In order to solve the issue of missing LST, a simple temporal gap-filling method was used to fill the gaps of MODIS LSTs first. Considering the differences in the relationship between Ta and other features under different weather conditions, we divided all data pairs into three types of weather conditions: (1) clear-sky conditions; (2) cloudy-sky conditions case Ⅰ; (3) cloudy-sky conditions case Ⅱ, and then established three machine learning models to estimate daily mean Ta under different weather conditions. The structure of this 115 paper is organized as follows: Section 2 describes the study area and used data; Section 3 summarizes the overall research method; Section 4 reports the validation results and discusses the model performance; Section 5 compares the developed dataset with the existing datasets; and Sect. 6 presents the overall conclusion.

Meteorological station data 120
This study was conducted in mainland China. Station observed daily mean Ta from 2003 to 2019 were collected from 2384 standard meteorological stations in mainland China for model training and validation. During the production process of this dataset, it experienced strict quality control. Figure 1 shows the study area and the geographical location of the meteorological stations used in this study. Each dot represents a station, and different colors correspond to different land cover types. The land cover data used in the study is Finer Resolution Observation and Monitoring of Global Land Cover (FROM-GLC) version2 125 (2015_v1), which is a 30 m resolution global land cover maps (Gong et al., 2013).

Remotely sensed data 130
Satellite datasets used in this study are listed in Table 1.  (Wan et al., 2015).
Three all-sky products from the Global LAnd Surface Satellite (GLASS) products suite Liang et al., 135 20202021) were used, including the GLASS 1 km 8-day surface broadband albedo (ALB) product GLASS02A06 , GLASS 0.05° daily downward shortwave radiation (DSR) product GLASS05B01 (Zhang et al., 2019), and GLASS 1 km 8-day leaf area index (LAI) product GLASS01A01 (Xiao et al., 2014). For the ALB product, we used black-sky albedo of shortwave (BSA_sw), visible (BSA_vis), and near-infrared (BSA_nir) bands. As radiation products, DSR and ALB determine the shortwave solar radiation received at the surface and the fraction of total radiation reflected and absorbed by the surface, 140 respectively.

Methods
The overall framework of this study is shown in the Fig. 2. Firstly, all datasets from 2003 to 2019 were pre-processed into 145 identical spatial and temporal resolutions. Second, we filled the gaps of MODIS LSTs and then divided all data pairs into three weather conditions according to the gap-filling results. Next, the values of all datasets were extracted by the nearest neighbour

Data pre-processing
Because the spatial and temporal resolutions of all datasets were not completely consistent, we pre-processed all remotely sensed datasets and reanalysis datasets from 2003 to 2019 into identical 1 km and daily spatial and temporal resolutions, respectively. DSR, elevation and assimilated Ta were resampled to the spatial resolution of 1 km by the nearest neighbour method. As LAI and ALB datasets both have an 8-day temporal resolution, we first combined them into a time series, and then 160 interpolated the time series by linear interpolation method to obtain the daily datasets. For GLDAS assimilation data with a 3hourly temporal resolution, we averaged all assimilated instantaneous Ta in a day to acquire the assimilated daily mean Ta for all days.
Then, the values of all datasets were extracted by the nearest neighbour method according to the geographical locations of stations and then matched with the in situ Ta to obtain data pairs. Next, we used a temporal gap-filling method to fill the 165 MODIS LST gaps and divided all data pairs into three weather conditions according to the gap-filling results. The detailed gap-filling method and strategy for weather conditions division is described in the Section 3.2. Then, the data pairs under different weather conditions from 2003 to 2016 were randomly divided into training, validation, and test sets (ratio: 3:1:1).
Among them, training set was used for model training, validation set was used to determine the best model parameters, and test set was used to evaluate the final model performance. 170

Strategies for LST gap-filling and weather conditions division
MODIS LSTs were produced under strict quality control, with each pixel marked as either a clear-sky or cloudy-sky observation. Pixels under cloudy-sky conditions, had missing LST value, because of which LST-based method could not be applied to estimate Ta. In this study, a simple multi-temporal method was used to fill the MODIS LST gaps. First, we set a time threshold (± 2 days), and the missing pixel value was replaced by the clear-sky value of the nearest date within the set 175 time threshold. If no clear-sky pixel was found within the time threshold, the missing pixel was not filled to avoid introducing a high uncertainty caused by a huge temperature change between dates with a large difference. This multi-temporal method was used to fill the gaps of all four MODIS LSTs each day.
Considering the differences in the relationship between Ta and other features under different weather conditions, we divided data pairs into clear-sky conditions and cloudy-sky conditions according to the LSTs gap-filling results. When all four LSTs 180 in a day were all under clear-sky conditions, the data pair was identified as being under clear-sky conditions; otherwise, it was identified as being under cloudy-sky conditions. To control the uncertainty introduced by LST gap-filling, cloudy-sky conditions were divided into two cases: case Ⅰ and case Ⅱ. In particular, a data pair was identified as being under cloudy-sky conditions case Ⅰ when there were LST gaps in the data pair and the gaps could be filled through the method mentioned above.
If the LST gaps could not all be filled, the data pair was identified as being under cloudy-sky conditions case Ⅱ. Therefore, we 185 finally divided all data pairs into three types of weather conditions: (1) clear-sky conditions, (2) cloudy-sky conditions case Ⅰ, and (3) cloudy-sky conditions case Ⅱ. The detailed criteria for dividing weather conditions are shown in Fig. 3. Next, we established three machine learning models (clear-sky model, cloudy-sky model Ⅰ, and cloudy-sky model Ⅱ) and 190 trained separately for different weather conditions. Daily LSTs were used in models for clear-sky conditions (clear-sky model) and cloudy-sky conditions case Ⅰ (cloudy-sky model Ⅰ), but not for cloudy-sky conditions case Ⅱ (cloudy-sky model Ⅱ).
GLDAS assimilated Ta, GLASS DSR, GLASS ALB, GLASS LAI, elevation, and temporal and locational information were also used in all three models as input features. For clear-sky model, the utilized features included four clear-sky LSTs in a day.
The qualification for a pixel of a given day to be judged as clear-sky may be harsh, but this ensured the use of completely 195 clear-sky LSTs. The features of cloudy-sky model Ⅰ included gap-filled LST(s), which increased the availability of LST, but the simple gap-filling strategy also introduced errors to the models. To avoid instilling a high uncertainty caused by a large temperature change between dates with a large difference, cloudy-sky model Ⅱ did not use LST to estimate Ta.

Random forest
The RF method is an ensemble learning method based on Classification And Regression Tree (CART) proposed by Breiman 200 et al. (1984). Since it was proposed, it has attracted the attention of quite a few fields and been applied to various applications in remote sensing in recent years (Gislason et al., 2006;Ham et al., 2005;Li and Zha, 2019;Xu et al., 2014).
A decision tree is a tree-like prediction model composed of nodes and directed edges. In each internal node of the decision tree, the sample set is segmented by selecting the optimal splitting feature until the segmentation termination condition is reached. Each path from the root node to the leaf nodes of a decision tree forms a classification. There are many algorithms 205 for decision tree, such as ID3 (Quinlan, 1986), C4.5 (Quinlan, 1992), and CART. These algorithms all adopt the top-down greedy algorithm, and each internal node chooses the feature with the best classification effect to split, to achieve the goal of dividing samples into subsets that are as homogenous as possible, with the fastest speed. In the generation algorithms of ID3 and C4.5 decision tree, information gain or information gain rate is used as the criterion to judge the optimal segmentation.
Another type of optimal segmentation criterion is Gini impurity, which is utilized in the CART decision tree. In the RF model, 210 multiple CART decision trees are included. The bagging method (Breiman, 1996) is used to generate independent identically distributed training sample sets for each tree and train on them.
Although the application of RF at present is mainly focused on classification, it can be also used in regression analysis effectively, which can usually achieve higher accuracy than traditional regression analysis methods. The training and prediction process of the RF regression model is shown in Fig. 4. First, the bootstrapping method is used to acquire k datasets 215

Model training and validation
During the model training process, training set was used for model training, validation set was used to determine the models 225 with the optimal hyper-parameters.
Compared with artificial neural network, RF regression model does not need to carry out complicated parameter tuning work and changing some insignificant parameters of the RF model may not cause substantial fluctuations in model performance.
The two most critical hyper-parameters, ntree and mtry, need to be determined during training. Among them, ntree refers to the number of decision trees in the RF model. Increasing ntree is conducive to improving the model performance and stability, 230 but also affects the computational efficiency of the program. Mtry refers to the maximum number of features used in a single decision tree. When mtry is less than the total number of features, the segmentation of a node is determined based on partial features that are randomly selected rather than all features. Increasing mtry allows nodes to consider more features when splitting, but also reduces the diversity of individual trees, thus increasing the risk of overfitting. Therefore, both parameters need to be properly balanced and selected, and we used the validation set to evaluate the model performance with different 235 combinations of parameters to obtain the optimal hyper-parameters.
Assuming the total number of features of a sample is m, the values of mtry include log2m, sqrt(m) and m, and ntree is set to 5-200. To analyse the RF model performance sensitivity to hyper-parameters, the RMSE values of the three models for different weather conditions were calculated when setting different parameters, and the result is shown in Fig. 5. It can be seen from the results that with the change of model parameters, the three models showed similar variation patterns. With the increase 240 of ntree, the RMSE value decreased gradually until it became almost constant (when ntree ≥ 100). Continue increasing of ntree made very little contribution to improving the model performance but affected the computing efficiency. For mtry, we can see that using partial features (mtry = log2m or sqrt(m)) performed significantly better than using all features (mtry = m). Overall, setting mtry to log2m and sqrt(m) presented similar performance, and the setting of sqrt(m) performed slightly better than log2m when ntree was larger than 175. Therefore, we set ntree to 200 and mtry to sqrt(m) in all models. 245 RMSEtrue, and the result is used as FI, as shown in the Eq. (2). A large FI value means that the model performance decreases 255 significantly after shuffling this feature, which indicates that this feature has a great impact on the accuracy of prediction results. On the contrary, if the model performance does not deteriorate significantly, it is obvious that this feature has less influence in the prediction process, or that other linearly dependent features are included in the model to make this feature redundant.
where ypre refers to model prediction result, and yobs refers to the corresponding station observation. RMSEtrue is the RMSE of the validation set, and RMSEi refers to the RMSE of the validation set after feature i is shuffled.
The Ta predicted by the models was compared with the corresponding station observations. RMSE, MAE, and R 2 were selected as criteria for model evaluation. In order to comprehensively evaluate the performance of the models, we adopted 265 three model validation strategies: random sample validation, LTO CV, and LLO CV. For random sample validation, test set (1/5 of the total data from 2003 to 2016 selected randomly) was used to evaluate the performance of the final Ta estimation models. The results were grouped by elevation range, land cover type, and month to evaluate the model performance under 13 different situations. For LTO CV and LLO CV, we divided all data pairs into 14 groups according to calendar year and 7 groups according to geographical location. In each iteration, one group of data was used for validation, and the other groups 270 of data were used as the training set for model training. The modeling and validation process were repeated 14 and 7 times until each year's data and each cluster of data was validated. These two CV strategies have been used in some studies to evaluate the performance of spatiotemporal models in unknown time or unknown space (Liu et al., 2020;Ploton et al., 2020;Xiao et al., 2018).

Overall accuracy and model comparison
Approximately 3/5 and 1/5 of the data pairs from 2003 to 2016 were randomly selected for training and tuning the models, respectively, and the remaining 1/5 of the total data pairs were used to evaluate the performance of the final Ta estimation models. Validation statistics of models for different weather conditions and the overall accuracy of all estimated daily mean Ta are shown in Table 2. The three models presented similar validation statistics, with R 2 , MAE, RMSE, and bias ranging from 280 0.984 to 0.986, 1.033 K to 1.100 K, 1.342 K to 1.440 K, and 0.012 K to 0.051 K, respectively. The overall R 2 , MAE, RMSE, and bias of the estimated all-sky Ta were 0.985, 1.068 K, 1.409 K, and 0.03 K, respectively. Compared with the in situ Ta, the estimated Ta of all models showed a high correlation with little difference, confirming the great potential of RF method to estimate all-sky daily mean Ta over a wide spatial and temporal range. In addition, to further investigate the distribution of the prediction results and the differences between the three models, density scatter plots of the estimated Ta against the in situ Ta for the three models are shown in Fig. 6. In the three density scatter plots, most points were very concentrated near the 1:1 line, which also confirmed that these three models have achieved satisfactory accuracy in estimating daily mean Ta under different weather conditions. Among all the models, the clear-sky model had the highest stability and overall accuracy statistically, with the highest R 2 and the lowest MAE and RMSE. It could 290 predict Ta under clear-sky conditions from less than 250 K to more than 300 K accurately and steadily. Compared with the clear-sky model, cloudy-sky model Ⅰ had a relatively large error, which demonstrated that the LST gap-filling strategy adopted in this study introduced errors into the model to some extent, thereby increasing the uncertainty in estimating Ta under cloudysky conditions case Ⅰ. The accuracy of the cloudy-sky model Ⅱ was statistically similar to that of the clear-sky model, and it could predict a moderate temperature range close to 275 K with satisfactory performance. However, it can be seen from the 295 density scatter plot for cloudy-sky model Ⅱ that some discrete points deviated from the 1:1 line in the low-temperature range, which indicated that there may be much uncertainty in predicting the low-temperature range, especially at temperatures less than 260 K. 300 Figure 6. Density scatter plots of the estimated Ta against the in situ Ta for three models.
Many studies have proved that land cover type and elevation have a significant impact on the heterogeneity of Ta (Benali et al., 2012;Good et al., 2017;Lin et al., 2012;Marzban et al., 2017). Therefore, to comprehensively analyse the performance of the Ta estimation models, we grouped the results by land cover type and elevation range, and then compared the model performance for different groups. The model performance for different land cover types are listed in Table 3. All models 305 showed relatively good performance (RMSE < 1.5 K) for cropland, shrubland, water, and impervious surface while RMSE values were higher for grassland and bare land, which was consistent with the findings of Shen et al. (2020). The model performance for different elevation ranges is also listed in Table 4. With the increase of elevation, RMSE values of all models had a certain upward trend. However, as shown in the Fig. 7, the elevation of the stations used in this study is mainly distributed in the range 0-2000 m, so the quantity of training samples in this elevation range have an absolute superiority, while the 310 samples of higher elevation (elevation > 2000 m) occupy only a small part. The problem of class imbalance may contribute to the relatively large errors when predicting Ta at high elevation. In addition, factors such as complex and varied topography, vertical variation in Ta, and scale differences between remotely sensed image pixels and station observation data points will lead to high difficulty and uncertainty in Ta estimation at higher elevations (Rao et al., 2019).

Figure 7. Elevation histogram of stations used in this study.
We further evaluated the error distribution of the three models at the stations. Due to the absence of in situ Ta of some ground 320 stations on some days, only the stations that recorded more than 20 days for all three weather conditions were taken to include.
And the results of 2320 valid stations were finally obtained, shown in Table 5. In general, the models showed good performance at most stations, with a mean RMSE value of 1.383 K. Moreover, there were 97 % stations with RMSE values less than 2 K and only 1 of the 2320 statistical stations with an RMSE value greater than 3 K. The clear-sky model also had the best performance at the station scale, with the lowest mean RMSE of 1.231 K. And 508 stations had RMSE values less than 1 K, 325 2286 stations had RMSE values less than 2 K, while only 2 stations had RMSE values greater than 3 K. For cloudy-sky model Ⅰ, the mean RMSE reached 1.432 K. RMSE values of 2256 stations were less than 2 K, and only one station had an RMSE greater than 3 K. For cloudy-sky model Ⅱ, the mean RMSE was 1.440 K, close to cloudy-sky model Ⅰ, and 121 stations had RMSE values less than 1 K. However, 13 stations had RMSE values greater than 3 K for cloudy-sky model Ⅱ, and most of these stations had RMSE values less than 3 K for the other two models. 330 For model comparison, as expected, the clear-sky model that used absolutely clear-sky LSTs performed better than cloudysky model Ⅰ and cloudy-sky model Ⅱ in almost every aspect and presented the highest stability. Cloudy-sky model Ⅰ, which contained gap-filled LSTs, did not perform as well as the clear-sky model because although the time threshold (± 2 days) of the LST gap-filling method was relatively small, the LST value of a missing pixel of a date may be replaced by a clear-sky 335 value with a difference of up to 2 days. However, the LST can vary considerably in just a few days, so the LST gap-filling process can introduce large errors into the model, thus affecting the accuracy of Ta estimation. Surprisingly, the cloudy-sky model Ⅱ that did not use LST features achieved a comparative accuracy with the clear-sky model (RMSE = 1.396 K vs. 1.342 K) statistically. However, when we further analysed the model performance in specific situations, we detected the differences in the performance of the three models. There may be considerable uncertainty for cloudy-sky model Ⅱ in predicting the low 340 temperature range, especially at less than 260 K. Notably, the cloudy-sky model Ⅱ performed poorly on wetlands with an RMSE of 2.063 K, while both clear-sky model and cloudy-sky model Ⅰ performed well on this type of land cover. This may be because wetlands are a mixture of water and land, with diverse complex ecological environments. Using LST can significantly improve the Ta estimation accuracy of this land cover type.
In summary, because of the strong correlation between Ta and LST, adding daily LSTs as features to models can improve 345 the model stability and robustness. In the absence of LST, assimilated Ta can be used as a substitute for LST to provide an initial value or first guess for the model to estimate Ta with acceptable accuracy when combined with other features. However, the resolution of the reanalysis product is relatively coarse, and some local details were ignored when sampling from a larger scale (0.25° × 0.25°) to a smaller scale (1 km × 1 km), thus causing certain uncertainties for cloudy-sky model Ⅱ to predict low-temperature range or some regions, especially some specific land cover types or regions with complex terrain. Overall, 350 none of the three models showed significant differences in the model performance, and the model performance discrepancies for different land cover types and elevation ranges were acceptable. The proposed models can perform well in different situations and are suitable for Ta estimation under different weather conditions.

Cross-validation
In addition to random sample validation, two CV methods were used to further evaluate model performance. For LTO CV, we 355 divided the data pairs from 2003 to 2016 into 14 groups by calendar year. In each iteration, 13 groups of data were used as training set for model training, and the remaining one group of data was used for validation. The modeling and validation 18 process were repeated 14 times until each year's data was validated. The results are shown in Fig. 8. The RMSE values of validation results for different groups of data ranged from 1.359 K to 1.665 K. The minor difference between the LTO CV results proved that these models have good extensibility in time. 360

Figure 8. Density scatter plots of LTO CV results for three models.
Then, for LLO CV, we divided 7 clusters in the Chinese region by using the similar separation strategy of Xiao et al. (2018).
Stations used in this study were divided into different clusters according to their spatial locations, and all data pairs were 365 divided into 7 groups according to the cluster of station. In each iteration, 6 groups of data were used as training set and the remaining one group of data was used for validation. The modeling and validation process were repeated 7 times until the data of each group was validated. The total validation results of the models under three weather conditions are shown in Fig. 9, with RMSE values ranging from 1.615 K to 1.957 K. As expected, the error of LLO CV increased relative to random sample validation. This is because the relationship between Ta and other features varies with geographical location. The prediction 370 error of the northwest and southwest clusters was larger than that of other clusters. RMSE values of these two clusters exceeded 2.5 K under cloudy-sky conditions Ⅱ while RMSE values of the other clusters were about 1.5 K. This is consistent with the analysis of the spatial distribution of model accuracy in section 4.4 of the manuscript. The meteorological stations in northwest and southwest China are distributed discretely and far away from other stations in China, leading to a large difference between the training set and the test set, and ultimately resulting in the relatively poor performance in the LLO CV strategy in these 375 two regions. Furthermore, the LLO CV results of the cloudy-sky model Ⅱ were worse than those of the clear-sky model and cloudy-sky model Ⅰ, indicating that LSTs help to reduce the spatial overfitting of the models.

Feature importance analysis
To quantitatively evaluate the contribution of each feature included in the RF models, the FI of every feature for the three models was calculated by permutation method described in Section 3.4, and then ranked. To reduce the impact of contingency on the experimental results, we repeated the experiment 30 times and took the average value of all experimental results as the final FI of each feature for each model. The FI results are shown in Fig. 10, with the importance decreasing from top to bottom. 385 The grey line indicates the FI range of each feature for multiple repeated experiments. All features are divided into four types and represented by different colors, among which the blue rectangles represent MODIS LSTs, the orange rectangles represent GLDAS assimilated Ta, the red rectangles represent radiation products including DSR and ALB, and the green rectangles represent other features. For clear-sky model, Terra nighttime LST was of the highest importance (FI = 2.92), followed by assimilated Ta (FI = 2.48), indicating that the prediction accuracy of the clear-sky model was significantly reduced after permuting these two features.
They were followed by Aqua nighttime LST (FI = 1.3) and two daytime LSTs (FI = 0.49 and 0.21, respectively). For cloudysky model Ⅰ, assimilated Ta ranked first (FI = 4.59), followed by Terra nighttime LST (FI = 1.03). For cloudy-sky model Ⅱ that 400 did not include LST as features, assimilated Ta played a more importance role (FI = 6.65) than it did for cloudy-sky model Ⅰ.
The FI of radiation products and other features were all less than 1 for all the models, showing that they only slightly improved the model performance.
The energy exchange between the land surface and the near-surface atmosphere takes the form of longwave radiation, evapotranspiration and turbulent exchange, or other phenomena. LST and land surface emissivity (LSE) determine the 405 longwave radiation in land surface radiation and energy budgets . Thus, there is a strong and complicated physical correlation between LST and Ta. It can be seen from Fig. 8 that all four daily LSTs, especially nighttime LSTs, had relatively high FI for both clear-sky model and cloudy-sky model Ⅰ. Among all the daily LSTs, nighttime LSTs outweighed daytime LSTs, and Terra nighttime LST was of higher importance than Aqua nighttime LST, which was consistent with the findings of many studies (Benali et al., 2012;Li and Zha, 2019;Zhang et al., 2011). This phenomenon is largely due 410 to the fact that the pass time of Terra was at an approximate local solar time of 10:30 p.m. during the night, when the measured LST is closer to daily mean Ta. In Lin's study, the MAE between LST and Ta during the day and during the night were calculated separately, finding that there was better agreement between LST and Ta during the night (Lin et al., 2012). In addition, because of the lack of solar radiation and its influence on the thermal infrared signal, remotely sensed nighttime LST products usually have higher stability (Benali et al., 2012;Vancutsem et al., 2010). 415 Assimilated Ta also mattered considerably for Ta estimation models. Its FI was second only to Terra nighttime LST for clearsky model and highest for cloudy-sky model Ⅰ and cloudy-sky model Ⅱ. For cloudy-sky model Ⅰ, originally missed LSTs were replaced with clear-sky values of a near date, and the error introduced by this simple LST gap-filling strategy resulted in a decrease in the overall LST accuracy, thereby leading the FI of assimilated Ta to exceed that of LSTs. Compared with the cloudy-sky model Ⅰ, assimilated Ta was of higher importance with a FI of 6.65 for cloudy-sky model Ⅱ, indicating that it 420 became the absolute dominant factor in Ta estimation when LST was not included in the Ta estimation model. Cloudy-sky model Ⅱ also achieved satisfactory accuracy in the validation results. This demonstrates that although the spatial resolution of the assimilated Ta is relatively coarse, it can be the supplement and substitute of MODIS LSTs and provide the initial value or first guess for models to predict Ta with a higher resolution.
Radiation products and other features helped to improve the accuracy of Ta estimation models to a small extent. Among 425 them, latitude, longitude, elevation and day of year had relatively high importance in all three models. Latitude and longitude determine the relative position of the sun influencing day length, and thus the distribution of total solar radiation the surface receives throughout the year, which in turn affects the patterns of Ta (Benali et al., 2012). Elevation affects how the ground is heated and how much radiation energy is absorbed by the atmosphere, resulting in vertical variations in Ta. In addition, the relationship between Ta and LST has great heterogeneity in different regions and at different times and is greatly affected by 430 23 surface characteristics and atmospheric conditions. The day of year helps to explain the seasonal changes in atmospheric physical conditions, chemical composition, and surface characteristics to distinguish the different relationships between Ta and LST in different seasons and then improve the accuracy of Ta estimation (Yao et al., 2019;Zhang et al., 2011). For LAI, DSR, and ALB, it is likely that other collinear features in the models made the information provided by them redundant, so their FI was relatively low in the Ta estimation models. However, in the analysis of the results of some stations, it is found that adding 435 radiation features to the models helped improve the Ta estimation accuracy on some days. The radiation features can play a supplementary role in the case of some other features that do not perform well. Therefore, we finally decided to retain the radiation features in the Ta estimation models.

Spatial distribution of accuracy
The RMSE value was calculated for each meteorological station that recorded more than 20 days for all three weather 440 conditions. To obtain a deeper understanding of the spatial distribution of model performance, the RMSE spatial distribution of stations for the three models was mapped, as shown in Fig. 11. It is evident that the model performance varied at different geographical locations for all three models. The clear-sky model presented the most stable results in different regions compared with cloudy-sky model Ⅰ and cloudy-sky model Ⅱ, with RMSE values of all stations ranging from 0.566 K to 3.453 K. The RMSE range of cloudy-sky model Ⅰ was 0.823-4.370 K, and that of cloudy-sky model Ⅱ was 0.809-4.198 K. The spatial 445 patterns of cloudy-sky model Ⅰ and cloudy-sky model Ⅱ were generally similar, but for cloudy-sky model Ⅱ there were more stations with good performance (RMSE < 1 K) and poor performance (RMSE > 3 K), showing relatively poor stability.
Overall, the stations in central, eastern, and southern China presented high levels of accuracy for all three models, with RMSE values of most stations in these places less than 1.5 K. Most stations with large RMSE values were located in southwest, northwest, and northern China, which was consistent with the results of Shen et al. (2020), and the RMSE values of cloudy-450 sky model Ⅱ in these positions were larger than those of clear-sky model and cloudy-sky model Ⅰ. On the one hand, the spatial heterogeneity of model performance is largely because of the uneven distribution density of meteorological stations. As can be seen from the geographical locations of the meteorological stations used in this study in Fig. 1, it is obvious that stations in central, eastern, and southern China are densely distributed, while stations in northern and western China are relatively rare, which may contribute to the uneven distribution of model performance. Additionally, the terrain environment in central, eastern, 455 and southern China is not complex, while high elevation and some climate types will increase the uncertainty of Ta estimation in northern and western China. The climate types of stations with poor performance were mostly temperate continental and plateau mountain climates, and the land cover types were mainly bare land and grassland. It can be seen from Table 4 that cloudy-sky model Ⅱ showed relatively poor performance for these two land cover types. Therefore, there was a certain uncertainty when only assimilated Ta and other features except LSTs were included to predict Ta in places with these climate 460 and land cover types. Overall, although the spatial distribution of the model performance was relatively uneven, the Ta estimation models for different weather conditions all showed satisfactory performance.
(c) RMSE spatial distribution for cloudy-sky model Ⅱ. Figure 11. RMSE spatial distribution of stations for three RF models.

Seasonal distribution of accuracy 470
The model performance at the monthly scale was also evaluated, and the RMSE monthly distribution for the three models is shown in Fig. 12. The RMSE range of the clear-sky model was 1.109-1.508 K, cloudy-sky model Ⅰ was 1.178-1.692 K, and cloudy-sky model Ⅱ was 1.056-1.777 K. It is obvious that there was temporal heterogeneity in the model performance, and the estimation accuracy presented similar seasonal variation patterns for all three models. The RMSE values were lower in summer and autumn, and higher in spring and winter, reaching a peak in February and reaching a bottom in July or August. 475 We can conclude that models performed better in warm days, with RMSE values of all three models below 1.22 K in July and August. This finding was consistent with the validation results at the monthly scale of Yao et al. (2019) and Li and Zha (2019).
This phenomenon may be partly due to the fact that China is vast in territory with a latitudinal difference between the northernmost station and the southernmost stations of about 30°, so the range of Ta is wider in cold days than in hot days. Monthly differences in model performance also indicated that the relationship between Ta and other factors varied seasonally 480 and may have been more consistent in the same month. It was confirmed in the research of Yao et al. (2019) that modeling data of the same month together could achieve more accurate results. Therefore, although day of year was used in the modeling in this study, this temporal difference was not completely eliminated. Modeling the datasets of all seasons together in this 26 study may increase the temporal heterogeneity of accuracy. It is worthwhile to consider grouping the data of the same month to establish monthly models in the future, which may be conducive to further improving the accuracy of Ta estimation. 485 Figure 12. RMSE monthly distribution for three RF models.

Comparison with existing datasets
For a more comprehensive evaluation of the estimated daily mean Ta, we compared it with three reanalysis and meteorological forcing datasets including CLDAS, CMFD, and GLDAS in terms of validation statistics and spatiotemporal patterns. The 490 station observations in 2010 were used to validate the accuracy of these four Ta datasets. It should be noted that we estimated daily mean Ta for the period ending at local midnight rather than 24:00 UTC. To ensure the time consistency, we calculated the average value of all simulations on a local day as the daily mean Ta for the reanalysis and meteorological forcing datasets.
The statistical results and the density scatter plots are shown in Table 6 and Fig. 13, respectively. It can be seen that compared with the reanalysis datasets, the RF Ta presented the highest consistency with the station observations, with the best 495 performance in all accuracy assessment criteria (R 2 , MAE, RMSE, and bias values were 0.992, 0.680 K, 1.010 K, and 0.063 K, respectively). The points in the density scatter plot of the RF Ta were more concentrated near the 1:1 line. CLDAS Ta and CMFD Ta both showed near zero bias with the station observations, but their RMSE values were both close to 2 K. GLDAS 27 Ta reported slight underestimation (bias = 0.900 K). In general, this comparison confirmed the applicability of RF method in Ta estimation and the higher accuracy of our estimated Ta compared to the reanalysis products. 500  In addition, the spatiotemporal patterns of these four Ta datasets were compared. We calculated the monthly mean Ta in 505 2010 for all datasets. The RF monthly mean land Ta mappings over mainland China in February, May, August, and November 28 2010 are shown in Fig. 14 (a-d). The CLDAS (Fig. 14 (e-h)), CMFD (Fig. 14 (i-l)), and GLDAS (Fig. 14 (m-p)) monthly mean Ta mappings in the same months are also shown in Fig. 12. The spatial resolutions of RF, CLDAS, CMFD, and GLDAS monthly mean Ta are approximately 0.01° × 0.01°, 0.0625° × 0.0625°, 0.1° × 0.1°, and 0.25° × 0.25°, respectively. We used GLDAS assimilated Ta and GLASS LAI in Ta estimation, which have no value in most water bodies, so Ta of these areas was 510 also not estimated.
As can be seen from Fig. 14, it's clear that these four datasets basically showed a high degree of consistency in the spatiotemporal patterns over mainland China overall. China has a vast territory and its topography is high in the west and low in the east. The spatial patterns of Ta over mainland China present great seasonal heterogeneity. In winter, the sun shines directly in the southern hemisphere and the northern hemisphere receives less solar energy consequentially. The Ta in northern 515 China and Tibetan Plateau are generally low, and the Ta difference between the north and the south exceeds 50 K. On the contrary, in summer, as the sun shines directly in the northern hemisphere, Ta in most parts of China are generally high except for the Tibetan Plateau, with little Ta difference between the north and the south. As an expectable consequence of higher spatial resolution, the RF Ta mappings were capable of providing more details about the Ta spatial patterns than the reanalysis and meteorological forcing Ta, especially in mountainous areas with complicated terrain. GLDAS Ta presented an obvious 520 pixel effect because of the relatively coarse spatial resolution. In summary, the all-sky daily mean land Ta product developed in this study has achieved satisfactory accuracy and high spatial resolution simultaneously, which can reveal the seasonal variation trend and the spatial patterns of Ta over China well. This product can provide a long time series of daily mean Ta with the spatial resolution of 1 km over mainland China, which fills the current dataset gap in this field. Moreover, this product is also conducive to observing and analysing the climate characteristic of China and plays an important role in the studies of 525 climate change and hydrological cycle. In order to make this big dataset easier to understand and use, we made a provincial sub-dataset with a smaller geographic coverage. An all-sky 0.01° daily Ta product over Beijing (2003Beijing ( -2019 was generated from the developed dataset over mainland China after resampling and clipping, and it is publicly available at http://doi.org/10.5281/zenodo.4405123 (Chen et al., 2021a).

Conclusion
Ta is a key variable in climate and global change research. In this study, we developed an all-sky 1 km daily mean land Ta product for 2003-2019 over mainland China mainly based on MODIS and GLDAS data using the RF method. An efficient temporal gap-filling method was first used to fill MODIS LST gaps under cloudy-sky conditions. We predicted Ta under three 550 different weather conditions separately: clear-sky conditions (when the daily LSTs are all clear-sky), cloudy-sky conditions case Ⅰ (when the daily LST gap(s) can be filled), and cloudy-sky conditions case Ⅱ (when the daily LST gap(s) cannot all be filled). The validation results using station measurements (1/5 of the total data from 2003 to 2016 selected randomly), which were not used for model training, showed that R 2 values were 0.986, 0.984, and 0.984, RMSE values were 1.342 K, 1.440 K, and 1.396 K for clear-sky model, cloudy-sky model Ⅰ, and cloudy-sky model Ⅱ, respectively. In general, the models showed 555 excellent performance at most stations, with a mean RMSE of 1.383 K, and there were 97 % stations with RMSE values less than 2 K and only 1 of 2320 stations with an RMSE value greater than 3 K. In addition, we examined the spatiotemporal patterns and land cover type dependences of model accuracy and concluded that model performance under all conditions was acceptable overall, despite some heterogeneity under different conditions. The relative contributions of different features to models were also quantitatively analysed, and it was found that LST and assimilated Ta were of great significance in Ta  560 estimation. Finally, we compared the Ta dataset in 2010 with CLDAS, CMFD, and GLDAS datasets, finding great consistency in the spatiotemporal patterns. The estimated Ta in 2010 reported significantly higher accuracy against the station observations, with R 2 , RMSE, and bias values of 0.992, 1.010 K, and 0.063 K, respectively.
Overall, this study developed a robust scheme to use machine learning method to estimate all-sky daily mean Ta over a large spatial and temporal range. This approach can be applied globally. The generated all-sky Ta product have achieved a high 565 degree of accuracy compared with the existing datasets, which fills the current dataset gap in this field and plays an important role in many scientific fields such as climate change, hydrological cycle, and energy balance. Future work should focus on developing better LST gap-filling methods, experimenting with more advanced deep learning methods that take into account the spatial and temporal dependence of Ta.

Author contributions 570
SL and YC contributed to the design of this study and developed the overall methodology. HM, BL, and YC collected and pre-processed the data. YC carried out the experiments. YC, BL, TH, and QW produced the product. YC wrote the first draft.
All authors revised the manuscript.