STEAD: A high-resolution daily gridded temperature dataset for Spain

. Using 5,520 observatories covering the whole territory of Spain (about 1 station per 90 km 2 considering the whole period), a daily gridded maximum and minimum temperature was built covering a period from 1901 to 2014 in peninsular Spain and 1971-2014 in Balearic and Canary Islands. A comprehensive quality control was applied to the original data and the 10 gaps were filled on each day and location independently. Using the filled data series, a grid of 5 x 5 km spatial resolution was created by estimating daily temperatures and their corresponding uncertainties at each grid point. Four daily temperature indices were calculated to describe the spatial distribution of absolute maximum and minimum temperature, number of frost days and number of summer days in Spain. The southern plateau showed the maximum values of maximum absolute temperature and summer days, while the minimum absolute temperature and frost days reached their maximums at northern 15 plateau. The use of all the available information, the complete quality control and the high spatial resolution of the grid allowed for an accurate estimate of temperature that represents a precise spatial and temporal distribution of daily temperatures in Spain. STEAD dataset is publicly available at http://dx.doi.org/10.20350/digitalCSIC/8622 and can be cited as Serrano-Notivoli et al. (2019).


Introduction 20
Despite a clear improvement over the last decades in meteorological measurement techniques, the inclusion of automated systems with near-real-time information submission, or the increasing number of stations with a growing number of recorded variables, the existing climatic information is still unrepresentative in many territories. The low density of stations in isolated areas and the great variability in the number and location of observations over time represent a substantial problem. Despite these problems, or perhaps due to them, different teams dedicated great effort to creating reliable gridded climatic datasets 25 covering large time periods. Between all the climate variables, temperature datasets are among the most popular, such as the CRUTEM (1850-2017) (Jones et al., 2012), the dataset of Willmott and Matsura (2001) covering 1900-2014period, WorldClim (1970-2000 (Fick and Hijmans, 2017), GISTEMP (1880GISTEMP ( -2017 (Hansen et al., 2010), or BEST (1850-2017) (Rohde et al., 2013). In this regard, temperature has been also widely studied in Spain, in terms of its spatio-temporal distribution (e.g. Peña-Angulo et al., 2016) and temporal trends (e.g. González-Hidalgo et al., 2015;2018). Nevertheless, most of the existing works addressed coarse temporal scales or used individual stations for detailed regions (e.g. Villeta et al., 2018).
Monthly gridded datasets have enabled a better understanding of the climatic dynamics of the planet, especially as an element of quantification and validation of climate change, due to their ability to reproduce the mid-frequency variability of temperature. However, most of the methodologies used in those works are not suitable for addressing the variability of 5 temperature at the daily scale, due to the higher spatial and temporal variability and because of the larger number of input stations required to build a reliable dataset. Although climate change in respect to temperature is often quantified in terms of changes in its mean values, most of the risks attributed to climate change are, however, related to temperature extremes that occur at shorter time scales, such as the daily scale. The study of climate change signals in temperature extremes is therefore still largely pending due to the absence of reliable daily datasets representative of most of the territories. This absence of daily 10 gridded datasets, with several remarkable exceptions (e.g. Cornes et al., 2018;Lussana et al., 2018), is linked to: i) an absence of contrasted reconstruction methodologies owing to the different temporal and spatial structure of daily data in contrast to monthly or annual values; and ii) an absence of contrasted quality control protocols for daily time series.
In addition, the reliability of a dataset not only depends on the resolution but on the consistency of the data. Quality control processes are crucial to create trustworthy datasets and, although the many existing approaches (e.g. Haylock et al., 2008;15 Klein-Tank et al., 2002, Klok andKlein-Tank, 2009) apply basic procedures, some others go beyond and check for spatial consistency (e.g. Lussana et al., 2018;Feng et al., 2004), which is recommended when using high-density networks.
The same problems intervene, though more severely, in global datasets. At the sub-regional and local scales, the understanding of high-resolution climatic variability is of key importance in a context of global change, and these datasets often are not adequate to address specific research questions such as extremes or small variations affecting other components of the natural 20 system, due to a low spatial or temporal resolution.
Daily scale in temperature information is of key importance in many areas. The E-OBS dataset (Cornes et al., 2018), at a maximum spatial resolution of 0.1 degree, is the best example of daily gridded dataset for large international areas thanks to the integration of thousands of transboundary climate data. However, it does not pretend to be comprehensive for specific regions (Van Den Besselaar et al., 2015) and a deeper analysis with more information is required for higher spatial scales. The 25 Spanish territory exhibits a great climatic variability with very different regimes in a relatively small area that leads to high risks such as increments in the frequency and magnitude of extreme events. Currently, there are only two daily gridded datasets available for Spain: E-OBS (the Spanish part of the European dataset) and Spain02 (Herrera et al., 2016). Although both of them have been checked for their reliability, and are useful for specific purposes, they have limitations that prevent several climatic analyses. For instance, in their construction they did not consider all the available information but only a few stations 30 as basis for creating the grid (229 and 250, respectively), prioritizing the longest data series over a higher spatial density. This approach is suitable for wide-ranging temperature studies, yet insufficient when addressing small areas with great variability.
The experience acquired in the SPREAD dataset (Serrano-Notivoli et al., 2017a) development set the basis for a solid and reliable daily gridded precipitation datasets creation. Using the same framework with a complete renewal of the core calculations, we developed a new methodology for daily temperature datasets reconstruction and grids creation.
This article introduces the STEAD (Spanish TEmperature At Daily scale) dataset, a new high-resolution daily gridded (maximum and minimum) temperature dataset for Spain covering the period 1901-2014 for peninsular Spain and1971-2014 for Balearic and Canary Islands. Based on the available quality-controlled temperature information in Spain (more than 5,000 5 stations), we used the same spatial resolution as SPREAD, its corresponding precipitation dataset. We propose: 1) a methodology for an exhaustive quality control; and 2) a reconstruction methodology using all the available information and based on local regression models.
Section 2 describes the input data and section 3 explains the methodology used to apply the quality control, fill the gaps in the original series and the gridding process. Section 4 shows the results of the method applied to the Spanish temperature network 10 as well as the validation of the reconstruction and gridding procedures. Also, a brief description of four climatic indices based on daily temperature is shown. Results are discussed in section 5 and summarized in the conclusions at section 7 after the specification of the availability of the dataset in section 6.

Input data
We used the full total of available 5,520 observatories covering the whole territory of Spain, which was divided in three areas 15 to compute the grid: 1) Peninsular Spain (492,175 km2) with 5,056 stations covering the period 1901-2014; 2) Balearic Islands  The mean number of stations per year increased all over the studied period ( Figure 2b). The first years of the 20 th century had only a few stations available (Brunet et al., 2006;González-Hidalgo et al., 2015) with a great distance between them. Then, the number increases with the break of the Civil War (1936 -1939) until the decade of the 1990s. Until then, all the information 5 sourced from AEMET and from that, MAGRAMA stations began to register data until the end of the period (2014). As noted in González-Hidalgo et al. (2015), the mean distance between stations (65.9 km for the whole period) barely changed from middle century as well as their mean elevation (between 500 and 550 m a.s.l.). Before that, the mean altitude experimented hard changes due to the removing or relocation of existing stations and new incorporations. . This higher inter-and intra-annual variability in the first years than the rest of the period showed that the few available stations were very different between each other. The variability is reduced from 1950 onwards while the number of stations is increased.
We used a 5 x 5 km regular grid covering the whole peninsular Spain, Balearic and Canary Islands to estimate maximum and 10 minimum temperature values from the quality-controlled and serially-complete original series. The predictor parameters (i.e. latitude, longitude, altitude and distance to the coast) for each grid point were computed as the median of all the possible values of those parameters, covering an area of 5 x 5 squared km in which the grid point is the centroid. Despite the differences in the data availability through time, the methodological process creates spatial references that are standardized with the temporal structure of the series to avoid biases or incoherencies. In this regard, the chosen spatial resolution accurately reflects the local 15 characteristics of daily temperature in most of the temporal period, while the provided uncertainty values help to understand the reliability of the estimates when the original data have higher variability.

Methods
The first stage is a quality control of the original dataset to remove the most obvious wrong data. From this starting point the process (Figure 3) is based on the computation of reference values (RV), which are computed for each location and day and 5 then compared with the original values to assess the quality of the data. After a process of standardization, new values are estimated for those days without observations (or removed in the quality control process) to obtain serially-complete data series. In a final stage, the complete series are the basis to create new data series for specific pairs of coordinates that may or not form a regular grid, including a measure of uncertainty for each location and day. The key stages of the methodological process (calculation of RV, quality control, gap filling and gridding) are the same to that used to create the SPREAD dataset 10 (Serrano-Notivoli et al., 2017a). However, the method basics are completely different since the RV creation has been refined, the quality control has been adapted to temperature data, and the gap filling and gridding processes include now an improved standardization procedure.

Initial quality control
The initial quality control (iQC) includes five basic criteria over maximum (Tmax) and minimum (Tmin) temperature: i) 5 Internal coherence, ii) removal of months containing less than 3 days of data, iii) the removal of those days out of range considering: Tmax >= 50 ºC or Tmax<= -30 ºC and Tmin >= 40 ºC or Tmin <= -35ºC; iv) removal of all days in a month with a standard deviation equal to zero (suspect repeated values in the series); and v) removal of all days in a month if the sum of the differences between maximum and minimum temperatures is equal to zero (suspect duplicated values in TMAX and TMIN). 10

Reference values (RV) as key process for quality control and reconstruction
Further steps in the quality control and the reconstruction process are based on the computation of reference values (RV). RV are obtained by using a k-nearest neighbors regression approach, which is applied to maximum and minimum temperature and to each day and location independently. The RV are estimated through the combined use of Generalized Linear Mixed Models (GLMM) and Generalized Linear Models (GLM). The predictors (independent variables) were latitude, longitude, altitude and distance to the coast. The inclusion of these four independent variables and the building of independent models for each site and day considering only the neighborhood of the site of interest allows for large flexibility and enables capturing local features that may not be captured using other methods which result in larger spatial and temporal regularization. 5 The methodological procedure is as follows: i) Rough Monthly RV, which are average monthly estimates (i.e. a climatology), obtained using GLMM and all the available data; ii) then, Fine Monthly RV, which are monthly time series of temperature, computed using GLM and data from only the 15 nearest neighbours, and including the Rough Monthly RV obtained in the previous stage as a covariate; iii) finally, Daily RV are computed using GLM and data from the 15 nearest daily observations, plus the Fine Monthly RV of the corresponding month as added covariate. The whole process is explained in detail in the 10 following sections.

Rough Monthly RV (rmRV)
Monthly time series of daily temperature means and standard deviations were computed. Only the months with complete daily observations were used to fit the model. We used latitude, longitude, altitude and distance to the coast as fixed factors, and the year as random factor. The introduction of the year as random factor allows for isolating the fact that one particular year might be colder or warmer than the average on the whole dataset, and thus eliminates the random variability arising from the fact 5 that the time period with observed data changes from station to station. The model, fitted independently for each month of the year and for each one of the four dependent variables defined above, can be represented as: 10 where is a known vector of observations with mean ( ) = and variance ( ) = -; is an unknown vector of fixed effects; is an unknown vector of random effects, with mean ( ) = and variance-covariance matrix ( ) = ; and and are known model matrices containing the values of the fixed and random variables for the observations . The models were fit by the maximum likelihood method using the R package lme4 (Bates et al., 2015).
Once the model parameters , and are obtained, best linear unbiased predictions (BLUPs) of the mean and standard 15 deviation of daily temperature are calculated for each station (i), year (y) and month (m). At this stage a global model is fit, since all the data are used to fit the model and therefore the coefficients are assumed to be constant in space, an assumption that it is a rough simplification of reality. On the other hand, this configuration allowed us to include all the data for estimating the random year effect. The obtained estimates of mean and standard deviation for maximum and minimum temperature represent highly spatially regularized patterns of monthly temperature, and do not consider local spatial variability in the 20 influence of the covariates.

Fine Monthly RV (fmRV)
In a second stage, monthly time series of the mean and standard deviation of daily temperature were computed again, but using a local (k-nearest neighbors) regression approach. For each station, the model was fit to data from the 15 nearest observations of each month plus the rmRV values calculated in the previous step. Inclusion of the latter as if they were legitimate 25 observations incorporates a certain amount of spatial regularization that helps alleviating a problem that may arise when using a purely local regression approach, i.e. an excess of spatial variability, especially in areas where the model extrapolates (in latitude, longitude, altitude or distance to the coast) with respect to the neighboring locations. A Generalized Linear Model was thus built for each station and month: were ′ is the local neighbourhood dataset, including the Rough Monthly RV with mean ( ′) = ′ ′; ′ is an unknown vector of local fixed effects; ′ is a known model matrix containing the values of the covariates at the 15 neighbouring sites; and is an unknown random error, which in the case of the mean temperature was assumed to be normally distributed with zero mean, ~ (0, ′ -) , and in the case of the temperature standard deviation was modelled as following a Poisson 5 distribution, thus taking only positive values. The obtained estimates of mean and standard deviation for maximum and minimum temperature incorporate the local variability that was lacking in the estimations of the previous step.
An example of rmRV and fmRV for a specific month is shown in the supplemental material ( Figure S2).

Daily RV (dRV)
In a third stage, daily maximum and minimum temperatures were predicted based on the 15 nearest observations and the fmRV 10 for the corresponding month, using once again a GLM with Gaussian link: where ′′ is the local daily neighbourhood dataset, including the fmRV with mean ( ′′) = ′′ ′′ and variance 20

Quality control
After the initial quality control and the RV calculation, we have the original dataset without the most obvious anomalies and an estimate for each observation. Clearly, the iQC is not enough to remove inconsistencies in temporal and spatial fields. Here is presented a novel approach of an exhaustive quality control over daily temperature data based on paired comparisons 25 between observations ( =,>,?,@ ) and standardized predictions ( _ =,>,?,@ ). All stages of this process are carried out independently for maximum and minimum temperature. What we call deep quality control (dQC) considers similarities between observations and estimates through: i) a correlation analysis between daily observations and predictions at each analyzed location, year and month and ii) a quantification on how the differences between daily observed and predicted values (anomalies) are spatially and temporarily distributed. 30 The process is iterative, which means that when dQC finishes in the first run, and the suspect detected data are removed from the original dataset, the RV are computed again over this dataset and the dQC runs again. This is iterated until dQC does not detect any suspect data.

Correlation analysis
This analysis is based on the correlation between observations and standardized predictions, and it is independent for each series of daily observations in each month. Considering a single site ( ), month ( ) and year ( ) the correlation ( =,?,@ ) between observed and predicted daily data is computed and then compared with the correlation of its 15 nearest stations in the 10 same month and year ( =,?,@ ). The index ZCOR ( =,?,@ ) is then computed as the number of standard deviations that the observed correlation deviates from the observed at their neighbours (5) At this point, the correlations and their deviations are used to remove all daily data from the site and month if:

a)
=,?,@ ≤ 0 or, b) =,?,@ > 0 =,?,@ < 0 H =,?,@ T < 0.001 being the p-value of =,?,@ . A negative =,?,@ indicates a lower observed correlation than the neighbours and H =,?,@ T < 0.001 indicates that it is highly unlikely that the correlation is plausible in the referred spatial and temporal context. 5 This part of the quality control procedure aims to detect, amongst others, (partially) repeated sequences of data, duplicated months or sequences in consecutive years, shifted dates in series or, for instance, sequences of data extremely abnormal in their spatial context. All of these anomalies, which are hard to detect with classical approaches, can be potentially identified in this stage.

Daily differences 10
Using the differences between the observations and the standardized predictions ( =,>,?,@ = =,>,?,@ − _ =,>,?,@ ), two types of anomalies are computed: • Spatial anomaly: Each difference is compared with the differences of their 15 nearest stations ( _ =,>,?,@ ). The index _ =,>,?,@ is then computed as the number of standard deviations that the observed difference deviates from their neighbours (6) The daily data from the site and month is removed if the absolute value of the mean of both spatial and temporal anomalies is higher than the value representing the probability of < 0.0001 (8): When the suspect data has been removed using the daily similarities and differences criteria, the RV are computed again and the quality control process starts over. This procedure is repeated until no suspect data is detected and removed (see Figure S6  25 in supplemental material).

Gap filling
Once quality control process is finished, a final set of RV are computed from the cleaned dataset for those locations and days with missing data. These RV corresponding with days without observations will fill the gaps, completing the series of original debugged observations ( _ =,>,?,@ ).

Gridding and uncertainty 5
With the aim of obtaining a gridded product (a regularly distributed set of data series over space), new RV are created for each location ( ), month ( ) and year ( ) of the grid in three steps: 1) Grid fmRV ( _ =>,?,@ and _ =>,?,@ ) are created using the filled dataset ( _ =,>,?,@ ); 2) Grid dRV ( =>,>,?,@ ) are created using the original filled dataset ( _ =,>,?,@ ) and the computed mean monthly references; 10 3) Finally, the estimates are standardized using the standard deviation monthly references (9) In addition to the estimates of temperature for each grid point (in the second step of gridding process), we computed their corresponding uncertainty, which was calculated as the standard error of the difference between the predicted and the observed 15 values of the 15 neighbours in each day and location.

Validation
The validation process consisted in the comparison between the observations and the estimates computed for each one of those observations. The assessment was carried out through seven statistical and graphical analyses: i) A graphical comparison and Pearson correlation coefficient calculation of the means of all the 5,520 stations considered in 20 the study. Also, the 95 th percentile of maximum temperature and 5 th percentile of minimum temperature were considered to ascertain the accuracy of the extremes' prediction; ii) a graphical representation of the Pearson correlation frequencies, by months, to show the agreement between observations and estimates; iii) a graphical representation of counts of temperature values, by categories based on absolute values. This is useful to show 25 potential biases in specific ranges of temperature; iv) a collection of statistical tests to compare observations and estimates by altitudinal ranges using daily values. The tests include the mean of observations (OBSm), the mean of estimates (PREDm), the mean absolute error (MAE), the mean error (ME), the ratio of means (RM) and the ratio of standard deviations (RSD); v) same as (iv) but by months instead of elevations; 30 vi) a graphical representation of the count of temperature differences between observations and estimates; and vii) a graphical representation of the temporal evolution of mean annual uncertainty.

Example of applications: daily spatial distribution and uncertainties of temperatures
Based on the gridded dataset created from the original data and with the described method, we computed four indices to show the potential applications of the grid: 1) the mean annual maximum value of daily maximum temperature; 2) the mean annual minimum value of daily minimum temperature; 3) the average annual count of days when daily minimum temperature is below 0 ºC (frost days); and 4) the average annual count of days when daily maximum temperature is over 25 ºC (summer days). All 5 the indices were presented with their corresponding uncertainty estimate.

Quality control
Although the quality control was carried out separately, it removed approximately a 7.4% of the original daily values both in maximum and minimum temperatures ( Table 1). The initial quality control process (iQC) removed a sum of 59 days out of 10 range (less than 0.01% of the total) and 1,349 months (0.53%) containing 4,308 days (0.04%) that did not fulfill with the minimum standards set at the beginning (see section 3.1). Furthermore, the deep quality control (dQC) removed between 4.5 and 5.6% of the months and days considering the similarities between the observations and the estimates, being the number of removed data slightly higher in minimum temperatures. Most of the correlations in removed data were negative or very low  The removed values do not necessarily correspond with climatic extremes but with values that are out of the spatio-temporal context of its neighboring observations. The fact that the maximum frequency of removed data matches with the average of maximum and minimum temperatures (Figure 6a and d) suggests that there is no bias in the suspect data detection and, indeed, the deletions are due to errors in original data (which we intend to detect) and not to climatic extremes. Furthermore, when 5 looking at the removed data by differences between observations and estimates (Figure 6b and e), it is noted that the maximum frequency of deletions corresponds to differences near to ±10 ºC, which is not unusual if we think that, probably, those removals are due to recording or transcription errors, related with missing decimals.
Despite the fact that the magnitudes of some of the removed data do not represent anomalous values (Figure 6c and f), they correspond to significant anomalies in their spatial and temporal context. Beyond the magnitude in absolute terms, the 10 differences between observations and estimates suggest, with an < 0.0001, that those values are very unlikely to be representative in their spatio-temporal context.  Using the reconstructed series, we built a 5x5 km spatial resolution gridded dataset of maximum and minimum temperature.
The values were estimated for 1901-2014 period in peninsular Spain and for 1971-2014 period in Balearic and Canary Islands.
A measure of uncertainty was added to each day and grid point of the dataset.

Quality-controlled dataset: Observations -estimates comparison 5
Daily temperatures were estimated at the same location and days as the original data series but without considering the original values in each case using a leave-one out cross validation (LOO-CV). The comparison between the estimated temperatures and the observations showed very high correlation considering the average by stations for maximum ( Figure 7a

Figure 7. Comparison between observations and estimates, by stations (n = 5520), of the mean maximum temperatures (a) and their 95 th percentiles (b) and of the mean minimum temperatures (c) and their 5 th percentiles comparison (d). Dashed lines represent ±1 standard deviation of the data.
The mean Pearson correlations between the daily observations and the estimates, by months, were 0.87 and 0.82 in maximum 5 and minimum temperature, respectively ( Figure 8). However, more than 80% of the months in maximum temperature and more than 68% in minimum got a correlation higher than 0.8. Low correlations (Pearson < 0.5) represented 3% and 5% of the months in maximum and minimum temperature, respectively. The frequency of observed temperature and their estimates (Figure 9) showed a good general agreement. Although maximum temperature was slightly overestimated in lower values (from 0 to 10 ºC), it was slightly underestimated in higher ones (from 20 to 35 ºC). The higher differences in minimum temperature were found in low values (an overestimation from -5 to 0 ºC) and in mid values (an underestimation from 10 to 20 ºC).

Figure 9. Comparison of frequencies by categories between observed (solid) and predicted (transparent) maximum (red) and minimum (blue) temperatures.
In regard of monthly aggregates (Table 2)  The number of stations decreases as the elevation increases (Table 3). In Spain, only 1.8% of temperature stations are over 2,000 m a.s.l. while a 37.40% are below 300 m a.s.l. This great difference, also shown in precipitation (Serrano-Notivoli et al., 2017a), necessarily affects the estimation of the variable. A slight underestimation was observed in TMIN from 1,300 to 2,000 m a.s.l. and an overestimation in TMAX from 1,500 m a.s.l. The figures showed also a good agreement at all elevation ranges 5 with the largest differences at high elevations (slight overestimation in TMAX and underestimation in TMIN). The MAE values were increased along with the elevation in TMAX from 0.65 to 1.21, while in TMIN were more constant. The ME also experimented an increase with the elevation in TMAX, but in TMIN all the values were near to zero.   Approximately, a 70% of the differences between observations and estimates both in TMAX and TMIN were lower than 1 ºC 15 (Figure 10a), which assures the feasibility of the predicted series. Also, most of the spatial and temporal anomalies (approximately 80%) were lower than 1 (Figure 10b and c). The uncertainty of the estimates showed a decreasing temporal evolution (Figure 11a) from the 1960s, while a positive trend 5 was found in the first half of the period, especially in the first 15 years, coinciding with the moment of less observations and higher distance between them (see Figure 2b, c). The values in maximum temperature were lower than minimum until the 1950s an then they were similar during a couple decades until they converge at the end of 1980s, when they diverged and the uncertainty in maximum temperature decreased at a higher rate than minimum. Likewise, the annual mean error (Figure 11b) showed a great variability until the end of 1940s with similar values in TMAX and TMIN, when they separate being TMAX 10 over TMIN in the rest of the series. Since the 1970s, an approach between the two series is shown, being nearer to zero TMAX than TMIN, which is always in negative values. maximum temperature. The uncertainty of this variable (Figure 12d) was lower than the previous one with almost all the Spanish territory below 1 ºC. The lowest values were found at northwest IP and the highest ones in coastal areas of Mallorca.  The mean annual number of summer days (Figure 13c) showed a similar spatial pattern than the maximum temperature but with a stronger effect of the orography. The highest values were found in the Guadalquivir Valley with more than 150 days, as well as in southern part of the Mediterranean coast and eastern Canary Islands. The lowest number of summer days (< 25) coincided with highest elevations of Central and Iberian Range, Pyrenees and Sierra Nevada. Also, all along the Cantabric coast showed values lower than 75 summer days. The uncertainty related to this index (Figure 13d) was higher than the frost 5 days, with a clear gradient from less than 2 days in central southern IP to more than 3 days in all northern IP, the Iberian Range, the Canary Islands and most of the inner areas of the Balearic Islands. Although the occurrence of summer days in both groups of islands is relatively high, they obtained considerable values of uncertainty due to the high variability of temperatures between stations in these small areas.

Discussion
This work introduces two important novelties in regard of high-resolution climatic analysis: i) a new methodology to reconstruct in situ temperature data series over time and space, and ii) a new daily gridded temperature dataset for Spain. 15 The method, which is based on reference values (RV) computed from nearest observations instead of reference series (RS), follows the protocol developed for precipitation reconstruction in Serrano-Notivoli et al. (2017b). However, we included the distance to the coast as a source of variation of the local models in addition to the three used in the previous work (altitude, latitude and longitude). This parameter has been proved to be important for temperature estimation (Fick and Hijmans, 2017) and lets the models be more flexible. 5 One of the most valuable keys of the approach presented here is the use of all the available climatic information, which is crucial for a high-resolution output due to the observations network density has a major influence in gridded datasets results, controlling the skill of the final estimate of the variable . This is especially true in high percentiles, with a disproportionate effect in extreme values and, therefore, in extreme indices (Hofstra et al., 2010). Hence, a method using all the information instead of longest data series seems appropriate. Indeed, there are several temperature estimation methods in 10 literature, and the choice of one or another is not a trivial matter since the gridded dataset will be built from estimates. The inference or interpolation of any climatic variable in different locations from the recording sites always implies some kind of variation in final estimates regarding the observations. The aim is, therefore, using an approach minimizing these errors.
Previous comparatives of interpolation methods do not conclude on any definitive one. For instance, Shen et al. (2001) make a review of daily interpolation methods resolving that almost all of them smooth the data, and Jarvis et al. (2001) did not found 15 large differences between them either. However, Hofstra et al. (2008) accept as more appropriate a global kriging for they work at European scale and others as Jeffrey et al. (2001) use simpler methods as thin splines.
In this work, we use GLMMs and GLMs as a general approach to the daily temperature estimation, using as support monthly estimates based on daily data of months with complete observations. This part gives consistency to all the temporal structure of the data series, as similar approaches used in previous works (e.g. Jones et al., 2012). On the other hand, the use of 20 regressions in temperature estimation is not new. For example, several works establish that regression models are more reliable than other interpolation methods for monthly temperatures (Kurtzman and Kadmon, 1999;Güler and Kara, 2014). Li et al. (2018) built a high-resolution grid for urban areas in USA using geographically weighted regressions (GWR) and reported Pearson correlations between 0.95 and 0.97, similarly to the present work. However, Hofstra et al. (2008) found that, for European scale, daily temperature regressions worked worse than other interpolation methods. 25 The present work constitutes a novelty regarding previous methodological approaches mainly due to: i) all the available information is used, being the longer series supported by shorter ones, and ii) it includes a comprehensive iterative quality control checking the spatial and temporal consistency of the data until no suspect values are detected. In addition to the developed validation process, the results in the form of spatial coherence show that the method is able to reproduce realistic climatic situations. The new approach of the quality control detects a number of suspect data in line with previous research, 30 assuring the deletion of anomalies in a spatial and a temporal dimension. Although many works dedicate little efforts to this part of the reconstruction, it is one of the most important since it will have a decisive weight in the final result. For instance, Jeffrey et al. (2001) simply remove those data exceeding a fixed threshold in regard of the residuals of the splines; and in ECA&D (Klok and Klein-Tank, 2009) the quality tests are absolute, without a comparison with neighbouring data series.
Nevertheless, a similar approach to our spatial check of quality was developed in Estévez et al. (2018) but comparing nearest stations in all their data series instead of daily individual data. They used the spatial regression test (SRT) following You et al (2008) and Hubbard and You (2005). Others such Durre et al. (2010) also applied spatial consistency checks to test if temperature data lied significantly outside their neighbours. They flagged as suspect a 0.24% of all the (worldwide) data considering temperature, precipitation, snowfall, and snow depth, a quite low figure. 5 One of the key elements in any gridded dataset creation is to provide an uncertainty value, which informs about the reliability of the data and should be a standard to all the climatic information. The uncertainty values presented in this work come from each of the individual models for each timestep and location, therefore it apprises about the changes in the reliability of the day-to-day data. Now, several datasets provide this kind of information, though it can be obtained from different methods. In our attempt to create a useful reconstruction and gridding methodology, some of the stages of the method imply arbitrary 25 decisions that could be changed based on user-defined options. For instance, we use 15 neighbouring observations to build the model but there is not an objective number. We are building these models with 4 cofactors, which need certain degrees of freedom. An increase in the neighbours could lead to a loss of local representativeness, but also a gain of statistical robustness and lower influence of anomalous data.
In respect of the quality control process, the initial thresholds were set at the beginning only to remove outliers that are 30 sometimes included in the original datasets (e.g. -999 or nonsenses as 54354 that is one of the removed values in this work) but that sometimes have a meaning to identify specific situations or local codes. There are a lot of sources and types of errors as repeated series, duplicities or coding errors, that we try to identify through a simple collection of criteria. For example, we use the correlation between series and the differences between observations and estimates to remove data based on probability thresholds that we defined based on our experience, but maybe others could be useful depending on the dataset.
The effort of this research has been mainly dedicated to create an accurate estimate of temperature using all the available information and providing a validation as complete and transparent as possible, as well as afford an uncertainty measure tailored for each value allowing the assessment of data in each day and location. 5

Data availability
The STEAD dataset is freely available in the web repository of the Spanish National Research Council (CSIC). It can be accessed through http://dx.doi.org/10.20350/digitalCSIC/8622, and cited as Serrano-Notivoli et al. (2019). The data is arranged in 12 files (daily maximum and minimum temperature estimations and their uncertainties for peninsular Spain, Balearic Islands and Canary Islands) in NetCDF format that allows an easy processing in scientific analysis software (e.g. R, Python…) and 10 GIS (list of compatible software at http://www.unidata.ucar.edu/software).

Conclusions
We present a new high-resolution daily maximum and minimum temperature dataset for Spain (STEAD). Using all the available daily temperature data (5,520 stations, representing about 1 station per 90 km2 considering the whole period), a 5 x 5 km spatial resolution grid was created. The original data were quality-controlled and the missing values were filled based on 15 the monthly estimates and using the 15 nearest observations. A serially complete dataset was obtained for all stations from 1901 to 2014 for peninsular Spain and from 1971 to 2014 for Balearic and Canary Islands. Based on this dataset, daily temperatures were calculated for each grid node, resulting in a high-resolution gridded dataset that we used to compute four daily temperature indices: mean annual absolute maximum and minimum temperatures, mean annual number of frost days and mean annual number of summer days. 20 The spatial distribution of mean annual maximum and minimum temperatures showed a strong relationship with the altitude, (decreasing along with the elevation) and with the distance to the coast, revealing a high effect of continentality with increased values of the indices in both inner mainland Spain and islands. The mean annual number of frost days was higher in northern half of peninsular Spain and in high-elevation areas of the south, while the mean number of summer days obtained the highest values at south, in the Guadalquivir Valley and southern Mediterranean coast, progressively decreasing to the north. 25 The use of all the available information in combination with a methodology based on local variations of temperature over a high-resolution grid, provided a daily dataset that is able to reproduce the high spatial and temporal variability.