Deep‐learning‐based harmonization and super‐resolution of near‐surface air temperature from CMIP6 models (1850–2100)

Future global temperature change will have significant effects on society and ecosystems. Earth system models (ESM) are the primary tools to explore future climate change. However, ESMs have great uncertainty and often run at a coarse spatial resolution (usually about 2°). Accurate high‐spatial‐resolution temperature dataset are needed to improve our understanding of temperature variations and for many other applications. We apply Super resolution (SR) in computer vision using deep learning (DL) to merge 31 ESMs data. The proposed method performs data merging, bias‐correction, and spatial downscaling simultaneously. The CRU TS (Climate Research Unit gridded Time Series) data is used as reference data in the model training process. To find a suitable DL method, we select five SR methodologies with different structures. We compare the performances of the methods based on mean square error (MSE), mean absolute error (MAE) and Pearson correlation coefficient (R). The best method is used to merge the projected monthly data (1850–1900), and monthly future scenarios data (2015–2100), at the high spatial resolution of 0.5°. Results show that the merged data have considerably improved performance compared with individual ESM data and their ensemble mean (EM), both spatially and temporally. The MAE shows significant improvement; the spatial distribution of the MAE widens along the latitudes in the Northern Hemisphere. The MAE of merged data is ranging from 0.60 to 1.50, the South American (SA) has the lowest error and the Europe has the highest error. The merged product has excellent performance when the observation data is smooth with few fluctuations in the time series. This work demonstrates the applicability and effectiveness of the DL methods in data merging, bias‐correction and spatial downscaling when enough training data are available. Data can be accessed at https://doi.org/10.5281/zenodo.5746632.

that the merged data have considerably improved performance compared with individual ESM data and their ensemble mean (EM), both spatially and temporally. The MAE shows significant improvement; the spatial distribution of the MAE widens along the latitudes in the Northern Hemisphere. The MAE of merged data is ranging from 0.60 to 1.50, the South American (SA) has the lowest error and the Europe has the highest error. The merged product has excellent performance when the observation data is smooth with few fluctuations in the time series. This work demonstrates the applicability and effectiveness of the DL methods in data merging, bias-correction and spatial downscaling when enough training data are available. Data can be accessed at https://doi.org/10.5281/zenodo.5746632.

K E Y W O R D S
climate change, CMIP6, data-merging, deep-learning, temperature 1 | INTRODUCTION From the Sixth Assessment Report by the Intergovernmental Panel on Climate Change (IPCC), global air temperature increased by 1.09 C from 2011 to 2020 compared to 1850-1900(Masson-Delmotte et al., 2021. Global warming has significant impact on the natural environment, mankind, the water cycle (Yamaguchi et al., 2020), food safety (Hasegawa et al., 2021), extreme disaster (Quante et al., 2021;Touma et al., 2021), socioeconomic (Naumann et al., 2021), diversity of species (Avaria-Llautureo et al., 2021) and other fields. Besides, the most direct influence is the decline of ice cover in the ocean. In past decades, the South Pole experienced an unprecedented warming speed (three times more than the whole earth) (Clem et al., 2020) and ice cover in Arctic gone through a decrease of average 12% every decade (Zhukov and Gushcha, 2020). The loss of mass ice will also bring an extra increase of temperature at a long time scale (Wunderling et al., 2020). If temperature have an extra increase 2 C will result in an 20 cm rise of the global ocean (Jevrejeva et al., 2016). In this background, the society and scientific research groups pay a lot of attention to the global temperature changes (Choi et al., 2020). Hence, in order to improve our understanding of the spatial and temporal changes of future temperature, a more precise future temperature dataset at high spatial resolution is needed.
The traditional meteorological stations can provide long time series point data. However, with the development of remote sensing technology, the number of satellites specially used for meteorological observation are increasing; this provides a large number of meteorological satellite observation data. The satellite observation data have relatively shorter time series which are not useful for future climate change prediction. The climate model and Earth system models (ESMs) that consider comprehensive physical and biological processes act as major tools for projecting future climate variations. The projected data of ESMs provide avenues to investigate climate change and the potential influence of those changes in global and regional contexts (Zhukov and Gushcha, 2020;Yuan et al., 2021). To address the increasing range of scientific questions posed by several research communities, the coupled model inter-comparison project (CMIP) team continues revising the experimental design and organization of CMIP . The latest is the Phase6 CMIP (CMIP6) that has a substantial augmentation of model numbers and range of experiments. Compared with the previous Phase5 CMIP (CMIP5), ESMs of CMIP6 have spatial resolution improvements, making progress in physical parameterizations; it also includes extra Earth system processes and components (Eyring et al., 2019). The response of the Earth system to variations in radiative forcings, as well as humanity's response through technology, lifestyle, economies and policy have implications for the environment and society through climate change. (Moss et al., 2010). Shared socioeconomic pathways (SSPs) (van Vuuren et al., 2014;Riahi et al., 2017) are new scenarios based on the anticipated challenges to adaptation and mitigation applied in CMIP6 (O'Neill et al., 2016). The SSPs include five alternative pathways for socioeconomic development, including sustainable development (SSP1) , middle-of-the-road development (SSP2) , regional rivalry (SSP3) , inequality (SSP4)  and fossilfueled development (SSP5) . These new scenarios can help to understand climate and socioeconomic futures better.
There are many studies analysing future climate changes based on the earth system model (ESM) data (Parsons et al., 2020;Gillett et al., 2021). These studies generally highlight three main problems using ESM data: (1) spatial downscaling of ESMs data to generate data at higher spatial resolutions; (2) bias reduction or correction of ESMs; (3) merging of the multi-ESMs to obtain improved accuracy. The output of ESMs have certain errors and raw low spatial resolutions due to limited computational resources, several simplified hypothesis and uncertainties in model structures and parameterizations (Phillips and Gleckler, 2006). Current ESMs are run at coarse spatial resolutions and can only provide outputs at spatial resolutions ranging from 0.5 to 3.75 (the majority are at about 2 ); these spatial resolutions are not sufficient for elaborate studies (White and Toumi, 2013). Spatial downscaling methods are widely used to improve the resolution of ESMs (Shrestha et al., 2014;Baghanam et al., 2020). It can be divided into statistical and dynamical downscaling techniques (Kannan and Ghosh, 2013).
Dynamical downscaling is mainly based on the highresolution regional climate model (RCM), using the forced lateral boundary from the global climate model (GCM) (Adachi and Tomita, 2020). Statistical downscaling builds the relationship between the ESM and the local-scale meteorological data to generate highresolution data (Maraun and Widmann, 2018). Because the dynamical downscaling methods are complex and has high computational cost, statistical downscaling approaches are more commonly used. The most popular techniques of statistical downscaling are regression-based approaches such as the multiple and generalized linear regression models (Asong et al., 2016;Das and Akhter, 2019); they have low requirement of computational resources and realize simply. Despite the resolution enhancements, considerable bias still exists in both statistical and dynamical downscaling methods (Miao et al., 2015).
To reduce the bias, many bias correction approaches have been developed. These include statistical characteristics bias correction (Ho et al., 2012;Fang et al., 2015), probability distribution bias correction (Jakob Themeßl et al., 2011) and non-stationary bias correction methods (Miao et al., 2016). The downscaling techniques get closer results to observation data through the bias correction methods (Fan et al., 2021). However, the correction and downscaling methods bring some uncertainties. Lafferty et al. (2021) found the loss of production from extreme disasters was underestimated by the bias-corrected and downscaled ESM data. In different applications of the ESM data bias correction, different bias correction methods get the best performance. For instance, Gudmundsson et al. (2012) found two nonparametric transformation approaches with better performance compared with nine other methods in precipitation bias correction in Norway. A regression technique outperforms 11 other methodologies in terms of correlation in temperature bias correction in Spain (Gutiérrez et al., 2013). Some limitations exist in current ESM bias correction approaches; the ESM mean bias correction only considers the ESM mean bias and quantile correction brings an extra bias to the spatial gradient of variables (Colette et al., 2012;Bruyère et al., 2014). Additionally, majority of bias correction methods are applied to individual model datasets, which introduces greater uncertainty in future climate projection scenarios.
Considering different ESMs have different strengths which can be potentially combined to complement each other, many efforts have been made to develop methods to merge multiple ESMs to obtain improved outputs. Due to the instability of individual ESMs, the ensemble mean (EM) method has been considered in previous works as a simple and effective way to merge ESMs (A. P. Weigel et al., 2008). The evaluation results of the CMIP6 model reveal that no single model performs best in all regions under different evaluation rules (Papalexiou et al., 2020). Because ESMs have varying resolutions and different advantages in different regions, EM method has better precision in analysing future temperature dataset both regionally and globally (Fan et al., 2020;Lovino et al., 2021;You et al., 2021;Tang et al., 2021a;2021b). Although the EM method is easy and effective, there is still room for improvement.
With the rapid increase of "big earth data" such as earth observation and model simulation data in recent years, DL based methods play an increasingly important role in Earth Science because of their better capacity in processing big data (Reichstein et al., 2019). DL approaches have been successfully used in many fields of Earth Science (Toms et al., 2020;Yuan et al., 2020), such as data fusion and downscaling (Huang, 2020), land cover mapping (Tong et al., 2020), information reconstruction (Barth et al., 2020;Kadow et al., 2020;Tang et al., 2021a;2021b), information prediction (Ham et al., 2019;Zheng et al., 2020) and environmental parameter retrieval (Tian et al., 2021;Zang et al., 2021). Those DL models all have the comparable result with traditional methods. For instances, Ham et al. (2019) used a deep learning model consisted of convolution network to predicate the ENSO events which has the better performance than the dynamical forecast systems. Kadow et al. (2020) used the DL model to reconstruct the missing temperature dataset and the result outperformed the principle component analysis (PCA) and Kringing interpolation methods. The DL algorithms also have great potential to improve the accuracy of ESM data Feng et al., 2022).
The traditional methods need to solve downscaling, bias-correction and data merging separately in different processes, introducing extra uncertainties into future projection climate data. New DL methods provide new opportunities to address challenges with the use ESMs. Robust DL super-resolution (SR) algorithms in computer vision address these challenges simultaneously. Unlike the traditional methods, we can input various coarse resolution temperature datasets and get one high-resolution temperature dataset; these models can learn information from different ESMs and perform data merging, bias correction and data downscaling synchronously, and reduce uncertainty. An early deep learning-based approach named SRCNN proposed by Yoon et al. (2015) indicated that this method performed considerably better. Afterwards, many DL approaches have been proposed in SR applications and increasing efforts are being made to improve the performances of these algorithms (Shi et al., 2016;Seif and Androutsos, 2018;Blau et al., 2018;Chen et al., 2020). More different scenarios and historical ESMs temperature from CMIP6 are now available. The output data from CMIP6 models will reach an estimated 30 petabytes (Stockhause and Lautenschlager, 2017). The increasing data quantity can provide enough training samples for the deep learning models.
In this work, we train five different DL networks based on the CRU TS temperature and 31 ESMs temperature over the period of 1901 to 2014. We compare the DL networks and identify the best model to merge the different land temperature under SSP1-2.6, SSP2-4.5, SSP3-7.0, and SSP5-8.5 scenarios during 1850-1900 and 2015-2100.

| CMIP6 model data
In this study, we retrieve global air temperature from 31 CMIP6 ESMs that provides both historical and future data. The most commonly used in previous studies is the first realization of each model (Norris et al., 2021); so, we also use the first realization simulations (r1ilp1f1, except where unavailable). These models are summarized in   (Lu et al., 2021;Feng et al., 2022). In CMIP5 future climate simulations, the Representative Concentration Pathway (RCP), including RCP2.6, RCP4.5, RCP6.0 and RCP8.5, represent the radiative forcing which will reach about 2.6, 4.5, 6.0 and 8.5 WÁm −2 in 2100, respectively. In CMIP6, three new RCPs (RCP1.9, RCP3.4 and RCP7.0) have been proposed to meet more research needs. We design five sets of experiments: one historical simulation from 1850 to 2014 and four future projection scenarios (SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5) from 2015 to 2100 which are combined SSP and RCP. All GCM air temperature dataset are used as input data for the model. The raw resolutions of most ESMs are low. The DL SR model is effective in downscaling data of integer multiples such as 2, 4 or 8. The raw resolution of ESMs are very different; we therefore unify their resolutions before further analyses. About 1/3 of the total of ESMs used in this work have raw resolutions less than 2 . To retain more information in the original data, we resample all ESMs data to 2 × 2 resolution using bilinear interpolation. The input data are 31 ESM temperature dataset and the output is one merged data at 0.5 × 0.5 resolution. We split the data into 80% training set  and 20% validation set (1993-2014).

| Observation data
The CRU TS (Climate Research Unit gridded Time Series) is an extensive monthly land climate dataset which includes temperature and precipitation (http://doi.org/10/ gbr3nj) over global land, except the Antarctica, with a high spatial resolution at 0.5 × 0.5 from 1901 to 2019. It was made from the analysis of more than 4,000 independent weather station and has high credibility (Harris et al., 2020). From the birth of the first version of CRU TS data in 2000, it has been widely used in many applications such as data evaluation (Fan et al., 2020), climate variability analysis (Wang et al., 2013) and independent climate mode (Renard and Tilman, 2019) due to its high quality. Due to the resolution mismatch of the observation data in the ocean such as ERSST (Huang et al., 2017a) and the CRU dataset time-series, our study concentrates only on the land. We use the 2 m air temperature from the latest version of the CRU TS as the reference data to train our model.

| Methods
In this study, we aim to establish a relationship between the historical CMIP6 ESMs temperature data and the observation data based on the DL models from 1901 to 2014. In essence, deep learning methods is a data merge and downscaling technology based on a nonlinear relationship. We then use the optimal model to predict the future data under four scenarios (SSP1-2.6, SSP2-4.5, SSP3-7.0, and SSP5-8.5) and extend the historical data from 1850 to 1900. The DL super resolution models mainly consist of a feature extraction part and up-sample part. First, the features of the input data are extracted which are up-sampled to the target resolution as output.
The flowchart of this study is presented in Figure 1 ResNet solves network depth limitation due to the vanishing gradient problem (He et al., 2016). It achieved great success and is widely applied in different fields of computer vision. An example is the image super resolution method SRResNet (Ledig et al., 2017) with structural improvements later by proposed Wang et al. (2019a) and named EDSR. The most obvious feature is the removal of the batch normalization layers in the network as seen in Figure S1a. Besides improved performance, this new baseline model requires few parameters, saving about 40% computational time.
DenseNet is a novel network that also deals with the vanishing gradient problem. The main structure is that the input of each next layer is the output of all previous layers; this means every layer has direct access to the final output. It has a narrow internal structure and few parameters (Huang et al., 2017b). Because most deep CNN-based SR networks do not fully utilize hierarchical features from the initial low-resolution (LR) image, an RDN with a creative residual dense network based on ResNet and DenseNet ( Figure S1b) is proposed to deal with this problem (Xu et al., 2018).
An attention mechanism is an indispensable part of natural language processing (NLP) algorithms in applications such as speech recognition and question answering (Vaswani et al., 2017). Attention mechanism solve a problem that long distance information is very difficult to remember. After the creation of the transformer network structure, mainly based on attention blocks (Vaswani et al., 2017), attention mechanism became popular in the other fields of computer vision, such as image SR. Researchers invented a residual in residual (RIR) structure to form a very deep network, adding channel attention to a residual block as shown in Figure S1c (Zhang et al., 2018). RCAN can learn low and high frequency information and adaptive rescaling of channel characteristics by considering the interdependence between channels.
The GAN comprises two models: a generative model G to capture the feature and a discriminative model D to estimate the result from the G. The task of D is to minimize the probability of G making an error (Goodfellow et al., 2014). To retain finer texture detail at large upscaling factors, an image SR method-SRGAN-is proposed based on GAN (Ledig et al., 2017). However, the result of the SRGAN is often accompanied by unpleasant artefacts; the ESRGAN is designed to improve the performance of this network (Wang et al., 2019c).
A CNN has a fixed model geometric structure that restricts geometric transformations. To enhance model transformation, the DCN is proposed for the enlargement of the spatial sampling regions . The EDVR applies the DCN and Pyramid structure to design a Pyramid, Cascading and Deformable (PCD) network as illustrated in Figure S2 (Wang et al., 2019b). Although this method is proposed for video restoration, it also achieves a state-of-art performance in video SR. Unlike the single image SR algorithm, the EDVR needs a time series sequence input to learn spatial and temporal information and output the central frame image. Climate data have a strong relationship in the temporal dimension; so we also use the method to have an attempt to merge the ESM data.

| Evaluation metrics
Training and validation metrics To evaluate the different SR models, we adopt the same loss function for all networks. In the training process, the mean square error (MSE) is used as the loss function. The equation is described as follows: where N is the number of samples, Y n is the observation temperature and P n is the output of model data. mean absolute error (MAE), Pearson correlation coefficient (R) and MSE are used for a more comprehensive results evaluation in the validation phase. MAE and R are calculated as follows: The parameter meaning is the same as formula (1). Besides, Taylor diagrams (Taylor, 2001) are used to compare the observation, merged data and model data. The R, standard deviations of error and unbiased root-meansquare deviation (ubRMSD) are calculated as follows: Besides some common error metric for the evaluation, we also apply a novel SPAtial-EFficiency metric (SPAEF) which consists of the Pearson correlation coefficient, coefficient of variation and histogram intersection. A single metric is unable to reflect the whole information pattern. The three individual parts of SPAEF supplement each other in a meaningful way (Koch et al., 2018).
where β is the ratio of the coefficient of variation between the observation data and the model simulated data. σ sim and μ sim are standard deviation and mean value of model output respectively; σ obs and μ obs are standard deviation and mean value of model output respectively. γ is the histogram intersection for the histogram K of the observation data and the histogram L of the output of model, each containing n bins. To compare different units variables and to guarantee bias insensitivity, the z score method is used to calculate γ .The closer this value is to 1, the better the data quality.

| Model training settings
Due to the lack of suitable ocean data, we use zero padding to cover the ocean regions to reduce the effect during the training process. We divide global land into six regions (AUS, SA, NA, EUR, AFR and ASIA) as shown in Figure 2. Considering the boundary effects which means the output of the DL models will have more errors in the boundary regions because of the convolution struc- ture, we reserve duplicate areas between different zones. In the training process, we introduce the land-ocean mask to eliminate ocean interference in the results. For a fair comparison and the selection of a suitable model, we use the same parameter settings such as the loss function (MSE), learning rate and training epochs for all models. The learning rate is 0.0001 at the beginning of the training and we use the learning rate attenuation strategy (learning rate divide by 3 per 20 epochs) to reach a better result. We set 100 epochs for each model  The evaluation metrics for the validation phase of all models in various regions are shown in Table 2. It is noteworthy that three metrics (MSE, MAE and R) have little differences between various models. The EDSR achieve the best performance in all regions except AUS. The EDVR has the best scores for the MAE and MSE in AUS, and is slightly better than the EDSR; however, the R of EDSR is better than that of EDVR. All regions have very high R values with relatively large differences in MSE and MAE values. The MSE and MAE values for EUR and ASIA are larger compared with other regions. The SA and AFR regions have the least error. NA and ASIA have relatively higher error results compared with the rest of regions. In summary, the EDSR model outperforms the other DL approaches; therefore, we choose EDSR for the data merging. To understand the spatial error distribution, we calculate the mean MAE by pixel of all model outputs in the validation dataset. We also compute the mean spatial deviation by the EM data and compare with the DL results. For an intuitive comparison, we use bilinear interpolation to resample all the ESM model data to a 0.5 × 0.5 resolution, same as the observation and merged data. Figure 4 shows the spatial distribution of the MAE for various methods in the validation phase. Figure 4a-e shows the results of DL models and Figure 4f reveals the mean error of EM methodology. The results of the DL models show significant improvement in almost every pixel compared with the EM results. Overall, the performance of a DL model is to a great extent determined by the input ESM data. From the EM results, the whole Northern Hemisphere, including Asia, Europe and North American show high errors (MAE is more than 2 in most regions), indicating that the raw model data is not accurate. In the Southern Hemisphere, the situation is much better; only some sub-regions show high errors (MAE exceeds 2) with most areas having relatively low errors (MAE less than 1.5). Remarkably, the DL model results have similar spatial patterns of MAE. Majority of the regions have considerably small errors except the upper northern latitudes. An evident regularity can be observed; the error becomes increasingly larger along the latitudes in North Hemisphere. The Tibetan plateau region is one of the polar regions in the world, very hard to simulate. From our results, a significant improvement of the error in the Tibetan plateau area is observed. The same conclusion can be obtained that the EDSR is the optimal model in our task. Although DL methods have similar spatial distributions of MAE, a few high MAE regions are observed in a detailed inspection of the EDSR model result in Figure 5a. We randomly select six pixels in individual continents to evaluate performance in the temporal dimension. The time series is shown in Figure 5. From Figure 5, the merged data has a remarkably better performance than the EM data compared to the observation data. Generally, the peak high and low temperatures tend to have higher bias over the time series. Six pixels can be classified into 3 group based on the results. Figure 5a,d is in one group; the observation data have a lot of small fluctuations besides the cyclic variation; the merged data, notwithstanding, reduce the bias but not reconstruct the wave motion of the whole series. The second group consists of Figure 5b,c; the observation data exhibits a relatively smooth period and the merged data has a good fit with the reference data. Figure 5e,f comprises the last group; the observed data also have a smooth period; however, there are fluctuations in place of the high and low in the time series. The merged data learns this wave despite the considerable errors in these places. The merged data have excellent results where the observation data have few fluctuations. On the contrary, where the observation data have a lot of waves, the performance of the merged data decreases.

| RESULTS AND DISCUSSION
We use the Taylor diagrams to further evaluate the DL merged data, EM data and the observation. Figure 6 depicts the Taylor diagrams for results of the monthly mean temperature from 1993 to 2014 in different datasets. It is not surprising to find that the merged data achieve stable and superior performance (low bias and high R) in all regions compared to other data. Overall, all ESMs and EM data have very high R, exceeding 0.95. The EM data have a relatively better performance than the individual models globally. On the continental scale, there are clear differences between individual ESMs. In Asia and North America, as shown in Figure 6c,f, the performances of individual ESMs have no obvious variations and get results similar to the global. In other continents, the entire ESMs results show a dispersion trend and a relatively low performance. For R, the EM data also achieve excellent results, second to the merged data. However, for standard deviation and ubRMSD, the performance of the EM data is not as good as the merged data, even not better than single ESM data in some areas. The worst result is observed in Africa as shown in Figure 6g. All ESMs with R less than 0.95 and ubRMSD larger than 0.5 have more scattered distributions.
The Taylor diagrams for mean temperature show that the DL merged data has the best performance. Besides, we also use the SPAEF to compare DL data with individual ESM and EM data. The SPAEF of the merged data, EM data and different ESMs data depicted in Figure 7. It is obvious that the merged data has the highest performance on global and continental scales. On the contrary, the SPAEF of EM and ESMs data have different outcomes in global and regional areas. From Figure 7a, the EM data also have excellent SPAEF scores, second to the merged data globally. On the continental scale, the performances of EM data are not stable, similar to the performances of the ESMs data. In Figure 7c,g, the SPAEF of EM data in Asia and Africa are obviously lower than that of some single ESM results. It is noteworthy that the range of SPAEF is not restricted between 0 and 1 like some other correlation metrics. Generally, global scale scores of all data exceed 0.6; nonetheless, several negative values appear in some regional results of Asia, Australia, Europe and North America as seen in Figure 7c-f. Almost all data SPAEF performances decrease, moving from global to regional scale. Three ESMs (KACE-1-0-G, MCM-UA-1-0 and MIROC-ES2L) data achieve almost negative SPAEF in Australia, Europe and North America; this is mainly due to the raw low resolution (less than 2 × 2 ) of these models as shown in Table 1.
Based on the ESMs data, we use the best DL model not only to merge the future temperature dataset but also the historical data from 1850 to 1900. From the Figure 8a, we can see anomalies of the merged data and all CMIP6 models from 1850 to 2100, using the observation data (1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014) as the reference period. Obviously, the highest temperature increase occurred in SSP-5.85. The uptrends are also observed in SSP3-7.0, SSP2-4.5 and SSP1-2.6 scenarios. Overall, the EM data tend to be lower in temperature than the observation data, while the DL merged data have results closer to the global observation data. The merged data is slightly higher than the EM data in both historical period and future scenarios. The consistency of F I G U R E 5 Time series temperature of the observation, DL merged data and EM data in one pixel during validation; (a)-(f) represent the pixels in different continents. The locations of these pixels are (77.75 W, 7.25 S), (87.75 E, 31.25 N), (148.25 E, 35.25 S), (10.25 E, 46.75 N), (44.75 W, 68.75 N) and (13.75 E, 19.25 S), corresponding to (a)-(f) respectively [Colour figure can be viewed at wileyonlinelibrary.com] the results prove the stability of the DL model in this work. We therefore conclude that the merged data is credible. In Figure 8b, we use the EM and DL data to calculate the temperature variations respectively in different periods. In order, we select the mean historical temperature (1851-1870) and mean future temperature (2081-2100) to match the reference period of 1995 to 2014. The results of the DL data show slight temperature increase in historical and different future scenarios than the EM data. This phenomenon reveals the EM data have high estimation of temperature increase; it is observed in South America from the regional results in Figure S4 and Table S1. Moreover, the decadal trend of the four future scenarios in Figure S5 showing an upward trend.
From the evaluation metrics, the DL merged data outperforms the individual ESM and EM data in all aspects. Overall, the performance of EM data is second to the DL merged data although poorer than single models on a regional scale. Deep-learning methods are data-driving algorithms (Reichstein et al., 2019) and their performances are F I G U R E 6 The global and regional monthly mean temperature; Taylor diagrams are used to compare each of the ESMs, the observation, DL merged data, and EM data in the validation phase. The vertical coordinate is the standard deviation. Green concentric circles of the dashed lines are ubRMSD. The angular coordinate shows the R [Colour figure can be viewed at wileyonlinelibrary.com] F I G U R E 7 SPAEF comparison of CMIP6 models, observations, DL merged data and EM data from 1901 to 2014 on global and continental land [Colour figure can be viewed at wileyonlinelibrary.com] to a large extent determined by the input data. From our results and other studies (Fan et al., 2020), the errors of raw ESM data in the Northern Hemisphere are very high (MAE of most regions exceeds 2). The DL merged data have great improvements in Northern Hemisphere except for some regions far north where high errors exist (MAE exceeds than 1.5). It is worth noting that the bias in the Tibetan plateau area is relatively low in the DL merged data(MAE less than 1) under some raw model have a 5 under estimate in this area (Fan et al., 2020). The outcome proves the effectiveness of the DL model bias correction during spatial downscaling and date merging.
For the pixel time series, the DL merged data cannot achieve a good outcome when the temperature has a lot of small waves in a big cycle and in the other condition it shows a good performance. All ESMs do not perform well in Africa compared to other continents from the time series evaluation ( Figure 6g); this is mainly due to the limited availability of ground measurements for CRU in Africa (Collins, 2011). The merged data have a relatively low progress in the temporal dimension compared to the spatial dimension. This is because the selected DL model is a single-image DL approach which does not consider information in a temporal dimension. The temperature is showing an obviously increase trend under historical and four future scenarios, this result is similar with other result (Fan et al., 2020;Wu et al., 2022). The DL merged data have showed a relatively small rise than the EM data globally. From the regional scale, researchers found that EM mean temperature data underestimates temperature in three main river basins (Yellow River basin, Yangtze River basin and Southwest River basin) of China (Lu et al., 2022).
In previous studies, the DL methods based on simple architectures were reported to have good performance in the spatial downscaling (Huang, 2020). The DL downscaling method has been applied successfully in the CMIP6 soil moisture datasets of China, the DL method also performed a large improvement than the raw CMIP6 soil moisture datasets (Feng et al., 2022). In our work, we apply methods with relatively complex architectures. The best model according to the validation metrics is not the newest algorithm and does not outperform the other methods in other computer vision tasks. Besides, the video SR algorithm considers information in both spatial and temporal dimensions (Chan et al., 2021); this method should have been ideal to analyse the spatio-temporal relationship of climate data and achieve better results compared with a method that considers spatial features only. However, the EDVR did not achieve the best performance in our work. This outcome calls for more comparative experiments to be done for specific applications of computer vision in other disciplines.
Our work used DL methods to generate highresolution future temperature datasets under four scenarios (SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5). However, there are some limitations of this work. First, the observation data we use is a reanalysis gridded data collected from meteorological stations; some errors that cannot be avoided affect the accuracy of this work. In the future work, weather station data need to be combined with the reanalysis data to achieve better results. Second, no suitable ocean observation data are available to match the spatial resolution of the CRU temperature datasets and time series simultaneously; we only merge high-resolution data in the land area. Additionally, the lack of SST also makes us unable to consider SST in the model via the training process. Taking the ocean into account in future studies will most likely improve the model outputs because the DL F I G U R E 8 Time series of global land annual mean temperature anomalies of CMIP models, DL merged data and observation data (a), and temperature change in historical and different future scenarios calculated by EM data and DL data (b). The solid lines in (a) are EM data of CMIP6 models and the shaded areas are the standard deviation [Colour figure can be viewed at wileyonlinelibrary.com] models can learn extra information between the ocean and land. Third, our work does not explicitly take advantage of the temporal information of climate data. In other applications such as temperature reconstruction (Kadow et al., 2020), making full use of temporal information is a challenge which is worth looking into in future studies.

| CONCLUSIONS
In this study, we use deep learning (DL) methods to generate global land temperature datasets (except Antarctica) at a higher spatial resolution (0.5 ) based on 31 different CMIP6 ESMs. Our methods can perform bias correction, spatial downscaling and data merging simultaneously. Five different DL methods are evaluated and the optimal model is selected to generate the historical data from 1851 to 1900, and future scenarios simulation data (SSP1-2.6, SSP2-4.5, SSP3-7.0 and SSP5-8.5) from 2015 to 2100. The merged data have a remarkably better quality compared with the individual ESMs and the ensemble of all ESMs in both spatial and temporal dimensions. The error metrics (e.g., R and ubRMSD) also show that the merged data have better accuracy in both global and regional areas.
The merged data have great improvement in spatial error. Besides the high latitudes of Asia and North America, most areas have a low MAE (less than 1 ). It means the merged data take full advantage of the raw CMIP6 model data. From the pixel time series, the DL merged data can achieve excellent performance when the observation is smooth with few fluctuations. Overall, the merged data progresses in the temporal dimension. The merged data have better results in spatial distribution than in temporal variation since the DL model mainly focuses on spatial features.
Our work is one of the first to perform spatial downscaling, bias correction and data merging simultaneously. Our results demonstrate the successful application of DL SR models for data merging and downscaling tasks when enough training data are available. The generated merged data with improved accuracy have great potential for many applications such as spatial-temporal change of future climate, and the influence of climate change on water resources and agriculture.