Edinburgh Research Explorer Meteorological and evaluation datasets for snow modelling at 10 reference sites: description of in situ and bias-corrected reanalysis data

. This paper describes in situ meteorological forcing and evaluation data, and bias-corrected reanalysis forcing data, for cold regions’ modelling at 10 sites. The long-term datasets (one maritime, one arctic, three boreal, and


Introduction
In the past decade, several long-term datasets aimed at providing high-quality continuous meteorological and evaluation data for cold regions modelling have been published (Table 1). The importance of such datasets is twofold. Their primary value is scientific: they help us to understand key surface processes by enabling the development and evaluation of existing and new geophysical models for climate research and forecasting. The second, perhaps less obvious, value of having multiple long-term datasets is for meta-research; the smaller the studies or sample size, the less likely research findings are to be true (Ioannidis, 2005). In a snow modelling context, this is corroborated by Rutter et al. (2009) who found low correlations in performance statistics for the same snow models but in different years.
Here, we describe 10 long-term datasets (Table 1) from reference sites chosen to force and to evaluate models participating in the Earth System Model-Snow Model Intercomparison Project (ESM-SnowMIP) (Krinner et al., 2018), an international coordinated modelling effort that investigates snow schemes. ESM-SnowMIP is closely aligned with the Land Surface, Snow and Soil Moisture Model Intercomparison Project (LS3MIP; van den Hurk et al., 2016), which is a contribution to the Coupled Model Intercomparison Project Phase 6 (CMIP6), including global offline land model experiments with meteorological forcing data provided by the Global Soil Wetness Project Phase 3 (GSWP3; Kim, 2017). Two meteorological datasets are described for each site: one compiled from on-site measurements, the other derived from GSWP3. Previous iterations of SnowMIP have provided 19 site years of data from four sites in SnowMIP1 (Essery and Etchevers, 2004) and 9 site years of data from five sites in SnowMIP2 Rutter et al., 2009); ESM-SnowMIP totals 136 site years of in situ data from 10 sites and 300 site years derived from GSWP3.
Measurement details at five of the sites have been described in dedicated publications within the last 8 years. The other five sites are partially described in a number of publications which, combined, give a broad overview of the data. All of the in situ measurements and the GSWP3 data are freely available either on the web or on request, but, previously, post-processing would have been required to homogenise the in situ datasets compiled by different teams or to downscale the reanalyses. This situation causes two major issues. Firstly, different modelling teams are likely to apply different post-processing methods, leading to numerous versions of the same dataset being used for scientific studies. Secondly, although time spent identifying and processing data has never been quantified in scientific literature, it is, to the best of our knowledge, a well-known but underacknowledged time-consuming task for modellers.
The aim of this collaborative work is to provide easy-touse, quality-controlled data in a format adopted by the climate modelling community to facilitate consistency, continu-ity, and reproducibility in snow research (Menard and Essery, 2019). As such, it complies with efforts in geosciences to foster best practices on data accessibility and documentation (Gil et al., 2016). The seven teams who collated the in situ datasets have provided updates since previous publications and details about instrumentation, gaps in the original data, and methods for gap filling. Such additions are first steps towards being able to quantify uncertainty in observed data, without which "meaningful evaluation of a model is impossible" (Clark et al., 2011). As the sites have already been described in previous publications (Table 1), Sect. 2 describes the in situ meteorological and evaluation data with the user in mind, highlighting differences and similarities between the sites but also areas where both instrumentation and modification of the data through modelling may increase uncertainty. Section 3 introduces the GSWP3 data and the site-specific downscaling methods. Finally, the Discussion highlights the importance of sharing data to identify errors and to improve practices in geosciences.

Data
Broad geographic characteristics and the climate of each site, described by a snow cover classification and the Köppen climate classification from seasonal precipitation and air temperature, are shown in Table 2.
Both meteorological and evaluation data contain uncertainties and errors, partly due to instrument accuracy and calibration, gap filling of missing data, or subjective choices. Fully quantifying these uncertainties and errors is beyond the scope of this paper, but the information provided here, complemented by the Supplement which includes a list of instruments, details about missing data, and gap-filling methods, aims to highlight singular features as well as potential weaknesses in the data at each site.

Meteorological forcing data
All of the models participating in ESM-SnowMIP (Krinner et al., 2018) operate on energy balance principles, requiring incoming shortwave and longwave radiation fluxes, solid and liquid precipitation rates, air temperature, humidity, wind speed, and air pressure as forcing data. Figure 1 shows climatological monthly averages of all meteorological forcing variables except air pressure at all sites; note that "climatological" here refers to the time period for which variables are available at each site. Wind speeds provided in the datasets are measured at variable heights but are normalised for Fig. 1g to 10 m height assuming a logarithmic wind profile such that where u is wind speed measured at height z 1 , d is a displacement height (2/3 of vegetation height at BERMS and 0 at   other sites), and z 0 is a roughness length (1/10 of vegetation height at BERMS and 0.1 m at the other sites).

Differences and similarities between sites
Of the 10 sites, 5 are mountainous (Col de Porte and Weissfluhjoch in the European Alps; Reynolds Mountain East, Senator Beck and Swamp Angel in the western USA), 3 are in the Canadian boreal forest (the Boreal Ecosystem Research and Monitoring Sites, BERMS, the acronym hereafter collectively describing the Old Aspen, Old Black Spruce and Old Jack Pine sites), 1 lies above the Arctic circle (Sodankylä) and 1 is urban (Sapporo). Most sites are in artificial forest gaps or in sheltered environments. All are situated in the Northern Hemisphere.
Sodankylä is the only site without incoming solar radiation in winter (14 d) and uninterrupted daylight in spring/summer (44 d). Air temperatures drop to −35 • C in most years at Sodankylä and BERMS; the lowest temperature recorded at any site was −41 • C at Old Black Spruce (Fig. 2). There is little in the forcing data to differentiate the three boreal sites other than wind speed, which is lower at the Old Aspen site than at the other two sites. Vegetation and soil characteristics are what distinguishes the boreal sites most (Table 2, Sect. 2.2).
All mountain sites are located within a narrow 10 • latitude strip, but there is a difference of 2400 m between the lowest (CDP) and the highest (SNB) sites. Of the 10 sites, the mountain sites experience the most snowfall (WFJ, SNB, SWA, RME, and CDP in decreasing order), with Weissfluhjoch being the only site where snow falls year-round. On the other hand, Col de Porte and Reynolds Mountain East are the mountain sites with the warmest annual average temperature and can have rain in any month of the year. Sapporo has the highest annual mean (9.3 • C) and minimum (−15.8 • C) temperatures, although Col de Porte is generally warmer from December to February. Of the 10 sites, Senator Beck is the only one to have an annual mean temperature below freezing (−1 • C) (Fig. 2).
Col de Porte is situated in a dedicated experimental area (60 m ×50 m) in the south-east corner of a larger clearing (270 m× 360 m) within a spruce forest. As mentioned in Morin et al. (2012), all trees sheltering the north side of the experimental area were cut in summer 1999; mean wind speed at 10 m height was 1 m s −1 prior to the event but 1.26 m s −1 afterwards. Swamp Angel is also situated in a forest clearing, unlike the nearby exposed Senator Beck, where average wind speed in December and January is more than 4 times higher. Mann-Kendall (MK) tests show significant increasing trends in wind speed at these three sites. At Col de Porte, the trend starts in 1999 despite tree regrowth mentioned in Lejeune et al. (2019). MK shows a significant decreasing trend in wind speed at Reynolds Mountain East. It is unknown why such trends occur.

Site-specific measurement methods
The data presented here were prepared for a model intercomparison project but are expected to be used beyond this immediate purpose. As such, it is important that an understanding of possible errors, caveats, or singularities in data measurements used to force models are made clear to users, most of whom will not have visited any or all of the sites.
One such singularity concerns measurements of air temperature in Col de Porte, where the temperature sensor is generally moved weekly to keep it at a constant height above the snow. Temperature sensors are kept at fixed heights at the other sites, so it is recommended that measurement heights used in models be adjusted according to observed or simulated snow depths because this can have a significant im-pact on turbulent flux computations. Another issue with air temperature is that of instrument ventilation: depending on wind speed and solar radiation, unventilated instruments can overestimate air temperature by up to a daytime average of 2.5 • C (Georges and Kaser, 2002) or up to 10 • C for individual measurements (Huwald et al., 2009). Such errors are not corrected for in Reynolds Mountain East, the Senator Beck basin sites, or Sodankylä (temperature sensors at the other sites are artificially ventilated).
Humidity is measured using capacitive sensors at all of the sites except Weissfluhjoch. These sensors respond to changes in relative humidity (Anderson, 1995), but vapour fluxes in models are driven by specific humidity gradients. At temperatures below 0 • C, there are two possible definitions of relative humidity because of the different saturation vapour pressures over water and ice, but sensors calibrated following the World Meteorological Observation (WMO) convention of reporting relative humidity with respect to water at all temperatures are used at all of the sites except Weissfluhjoch. The consequences of this choice at three example sites (SAP, OJP, and SWA) are compared to measurements at Weissfluhjoch in Fig. 3. Although the consequences are not very significant at warmer sites such as Sapporo or Col de Porte (not shown), they are clear in data from colder sites such as Old Jack Pine and Swamp Angel, where relative humidity with respect to water is never observed much above the ice saturation point for a particular temperature. However, measurements from a chilled mirror dew point hygrometer at Weissfluhjoch show relative humidity with respect to ice can reach 100 %. In homogenising the datasets, relative humidity has been limited to a maximum of 100 % and converted to specific humidity using the site calibrations. To avoid the ambiguity in relative humidity, only specific humidity is provided in the datasets.
Snowfall measurements are notoriously difficult; they are often underestimated and prone to large errors because much is lost to sublimation or displaced by wind. Such difficulties are acknowledged by the WMO which, rather than imposing a standardised method, advises that adjustment methods be chosen depending on environmental conditions and gauge types (Goodison et al., 1998;Nitu et al., 2018). Precipitation at all sites is measured either with tipping buckets or weighing gauges, and six different methods are applied by the seven collecting teams to correct for undercatch: yearly or constant scaling factors, model simulations, matching against snow water equivalent (SWE) or replicate gauges. As weighing gauges do not provide information on the type of precipitation, further choices have to be made about how to partition snowfall and rainfall. Figure 4 shows how the different methods used at each site affect the solid fraction of precipitation as a function of air temperature; total precipitation at Swamp Angel and Senator Beck is assumed to be same because of their proximity, so only the latter is shown. Partitioning methods include using dew point (RME, SAP, SWA, SNB) or air temperature (BERMS, SOD, WFJ) functions or thresholds and ancillary data such as snow depth and albedo measurements (CDP). More information about instrumentation, correction for undercatch, and partitioning is provided in the Supplement.
Radiation measurements are also prone to errors and/or missing data because snow can settle on upward-looking sensors. In the absence of natural (wind) or forced ventilation and heating to prevent snow and frost accumulation, data are only reliable after the instruments have been wiped clean. Three of the sites (SAP, SOD, and WFJ) are located near staffed research stations, which allows frequent (daily to subweekly) and regular maintenance of all instruments (i.e. not restricted to radiometers). Col de Porte, Reynolds Mountain East, and the sites in the Senator Beck basin (SWA and SNB) are accessible from nearby research facilities, allowing regular (weekly to fortnightly) maintenance visits. Intensive monitoring associated with the BERMS project took place in the first years after the instruments were installed, but visits to the sites during winter have become sporadic. Methods for gap filling during snowfall events or while instruments are obstructed by snow vary; details for all sites are in the Supplement.

Modelling and modification to in situ data
Raw data are rarely used in snow modelling. At the very least, some time averaging of samples measured over very short intervals (seconds) is required. The longest time period used for averaging in the data occurs for air pressure for which the temporal coefficient of variation in pressure is always very small: a single value averaged for the site elevations is used when continuous measurements are not available (CDP, RME, SNB and SWA). For other variables, modelling fills, modifies, or provides consistency in a dataset.
At Col de Porte and Sapporo, where data outside of the snow season have not been published, all meteorological data are filled with downscaled (CDP) and bias-corrected (SAP) meteorological reanalysis data (publication of summer data for Col de Porte started in 2015; Lejeune et al., 2019).
Radiation and wind speed at Sodankylä are measured above the canopy, but evaluation data are measured in a nearby clearing. For consistency, the meteorological variables were modified by Essery et al. (2016) to emulate belowcanopy measurements. Sky view fraction and transmissivity were calculated from hemispherical photographs to modify shortwave radiation such that the effects of shading were accounted for. Sky view fraction was also used to account for longwave emission from nearby trees in the modification of longwave radiation. Wind speed was scaled down to 2 m height using a ratio obtained from an anemometer installed for 1 week in the clearing.
Finally, at Reynolds Mountain East, where the data starts in 1988, longwave radiation measurements started in 2002. For consistency, all longwave radiation is modelled, but measured data are used to provide information on seasonal and diurnal variations (e.g. cloud cover, turbidity, canopy and terrain exposure conditions). Details of the methods used to model longwave radiation are in Reba et al. (2011).

Evaluation data
The largest uncertainties associated with snow in climate change predictions relate to its albedo and to its insulative properties. Successive IPCC reports have noted that Earth system models (ESMs) often underestimate soil temperatures at high latitudes (Randall et al., 2007;Flato et al., 2013;Koven et al., 2013), thus having implications on assessing the permafrost carbon feedback, i.e. the amplification of surface warming from carbon emissions released by thawing permafrost. Equally, model spread over snow-albedo feedback remains a major source of uncertainty in quantifying the contribution of decreasing snow cover to climate warm-ing. Long-term datasets as presented here are therefore essential to evaluate model performance and to improve model representations of snow-soil-atmosphere interactions.

Snow depth and water equivalent
Although automatic sensors are increasingly being used to measure SWE, the most reliable methods to obtain snow mass are still manual (Pirazzini et al., 2018). They work by weighing snow mass in samplers of known volume or area, such as small cutters in snow pits or tubes to extract vertical snow cores. Nevertheless, such measurements are prone to errors: wet snow can stick to instruments, manual measurements can never be replicated in the same place because they are destructive, and subjectivity and skill do play a part; consistency can be hard to achieve if multiple people collect the data.
One way to quantify uncertainty caused by measurement errors, spatial variability, or a combination of both is to use replicate measurements of SWE and snow depth. Rootmean-square difference (RMSD) in snow depth can be calculated at all sites as all have both automatic and manual measurements. RMSD is shown in Table 3, along with maximum and minimum peak snow depth to normalise the difference. At Senator Beck, the snow pits cannot be collocated with the automated snow depth, so the spatial variability of snow is intrinsic to any comparisons between the manual and automated measurements. RMSD in SWE could only be calculated at two sites. At Col de Porte, three replicate weekly snow pits are available, two of which are used to calibrate automatic SWE measurements. Mean standard deviation is 17 kg m −2 , and, although it increases with increasing snow amount, it is generally less than 10 % of mean SWE. At Reynolds Mountain East, a snow pillow next to a snow course is visited approximately 10 to 15 times during the snow season. RMSD between the two methods is 40 kg m −2 , for annual maximum SWE ranging from 186 kg m −2 (1992) to 838 kg m −2 (1989).
Climatological averages of measured SWE and snow depth are shown in Figs. 5 and 6 respectively. Although all sites are situated in the Northern Hemisphere, and only one is above the Arctic Circle, the snow season characteristics provide a diverse range of scenarios for the evaluation and development of snow models, e.g. cold sites (e.g. SNB, SWA, SOD) with a well-defined snow season (snowpack building in autumn and winter, melting in spring/summer), warmer sites with occasional early-to mid-season snowmelt (CDP and SAP), forest sites with interception of snowfall by the canopy (BERMS), and sites with frequent summer (WFJ) and early autumn snowfall (CDP, WFJ) that can form snow cover that melts before the winter snowpack accumulates.

Albedo
Reflected shortwave radiation is measured at all sites except Sodankylä and Reynolds Mountain East, thus allowing calculations of albedo (Fig. 7). Daily effective albedos have been calculated at all sites with reflected shortwave radiation measurements using the method described in . Hourly data are rejected during snowfall if incoming shortwave radiation is less than 20 W m −2 or if reflected shortwave radiation is less than 2 W m −2 . For days with more than 5 h of data remaining after rejection, an albedo is calculated by dividing the sum of reflected shortwave radiation measurements by the sum of incoming shortwave radiation measurements. Information about errors and uncertainties in albedo due to incoming radiation measurements is in Sect. 2.1.2. Three of the sites have lower than expected albedo because of impurities in the snowpack. Frequent dust storms dirty the snow surface at Senator Beck and Swamp Angel (Painter et al., 2012); this is more noticeable during melt when other non-forested sites with comparable snow depths show higher albedo (Fig. 7b). Although not obvious from Fig. 7, model simulations suggest that the high concentrations of black carbon found in the Sapporo snowpack reduce albedo by 0.05 in winter and by 0.18 during melt (Aoki et al., 2011;Niwano et al., 2012). Figure 7b shows hysteresis at all of the sites, with snow cover of the same depth having lower albedo when melting than when accumulating.

Surface and soil temperature
Surface temperature (Fig. 8a) and soil temperatures (Fig. 8b) are available at eight of the sites. Surface temperature was calculated from measured outgoing longwave radiation assuming black body radiation except at the Senator Beck basin sites, where infrared temperature sensors are used. The pyranometers measuring outgoing longwave radiation are above the snow cover at Col de Porte, Sapporo, and Weissfluhjoch and above the canopy at BERMS.
The strong insulating effect of snow is apparent in Fig. 8bc for all sites with average winter air temperatures below 0 • C. Although freezing occurs in some individual years, daily climatological averages of winter soil temperatures, even at a shallow depth (10 cm), remain above freezing at six sites. The two exceptions are Old Jack Pine and Sodankylä, which show the highest annual ranges of temperatures, with climatologically averaged winter temperatures down to −5 • C and summer temperatures above 12 • C, and are the only two sites where soil temperatures do not plateau during the snow season.

Large-scale meteorological forcing data for reference site simulations
The Land Surface, Snow and Soil Moisture Model Intercomparison Project (LS3MIP; van den Hurk et al., 2016) contribution to CMIP6 includes global offline land model experiments with meteorological forcing data provided by the Global Soil Wetness Project Phase 3 (GSWP3; Kim, 2017). GSWP3 forcing data were generated by a run of the Global Spectral Model at T248 (approximately 50 km) resolution nudged at each pressure level with meridional-and zonalwind and air temperature from the 20th Century Reanalysis (Compo et al., 2011), followed by bias corrections described in Weedon et al. (2011) using observations of precipitation, air temperature, and surface radiation. All of the variables required for forcing land surface models are provided on a 0.5 • global grid and 3 h time steps. For ESM-SnowMIP, 1980-2010 forcing data have been extracted for GSWP3 grid cells containing reference sites and interpolated to 1 h time steps. The longer time period provides more variability for investigating the sensitivity of models to trends in forcing data. These data would also allow rerunning LS3MIP experiments at reference sites with models that do not have capabilities for global runs, but a complication is immediately apparent from the comparisons of site and grid data in Fig. 9. The maritime, boreal, and Arctic sites (SAP, OAS, OBS, OJP, SOD) are in areas with low relief and lie close to the mean elevations of their GSWP3 grid cells, but snow study sites in mid-latitude mountains (CDP, RME, SNB, SWA, WFJ) are typically established at higher elevations with longer snow seasons; most of the ESM-SnowMIP mountain sites are hundreds of metres higher than grid elevations (Fig. 9a). Consequently, GSWP3 temperatures at the mountain sites are too high (Fig. 9b), total precipitation is too low (Fig. 9c), and snowfall is much too low (Fig. 9d).
Site-specific bias corrections were therefore required and have been applied to all GSWP3 meteorological variables at all sites for model forcing. Quantile mapping was used to correct relative humidity within the 0 % to 100 % range, but only mean biases for overlapping data periods were removed from the other variables to retain the interannual and shorter variability in the large-scale forcing; the aim is to stay as close as possible to the global GSWP3 simulations without introducing gross elevation-dependent errors in site simulations. Offsets were applied to air temperature, pressure, and longwave radiation data, and multipliers were applied to precipitation, wind speed, and shortwave radiation data to avoid negative or spurious non-zero values. Site wind speeds were first normalised to the GSWP3 10 m reference height using a logarithmic profile and an assumed 0.1 cm roughness length.
Total precipitation rate P r in each time step was repartitioned into snowfall rate S f = f s P r and rainfall rate R f = (1 − f s ) P r depending on corrected air temperature T using a logistic curve: with site-dependent parameters T 0 and T 1 fitted to unadjusted GSWP3 data (Table 4). Figure 4 shows that the logistic curve fits the GSWP3 data well at all sites with the exception of Sapporo, which has the unusual feature of some precipitation at low temperatures falling as rain and a significant fraction of snowfall at temperatures above 5 • C. Anomalous features in precipitation phase partitioning based on surface observations have been attributed to the mechanisms of snow formation as cold continental air masses flow over the Sea of Japan (Jennings et al., 2018). Annual mean temperature and snowfall variations are shown in Figs. 10 and 11 for the in situ and bias-corrected GSWP3 data at all sites. Although only mean errors for the periods of overlap have been removed, there is generally good correlation between annual means of GSWP3 data and site observations for overlapping years. Table 5 gives linear trends fitted to the in situ and bias-corrected GSWP3 annual mean temperatures and snowfall. 1998-2009 observations at BERMS show decreasing temperatures and increasing snowfall after the Saskatchewan drought of the early 2000s, but there are negligible trends in the longer GSWP3 series. Sapporo also has increasing snowfall in recent years but little trend in GSWP3. Some sites show stronger warming trends in the GSWP3 data, which will be useful for investigating modelled snow responses to warming.

Discussion
A number of errors were identified in the datasets in the course of the study. Firstly, we noted that snowfall at the Old Aspen was much lower than at the Old Black Spruce and Old Jack Pine during the 2007/08 winter. It was subsequently found that a gauge malfunction in November and December 2007 was not identified at the quality control (QC) stage. Secondly, two other errors were identified by decomposing time series: trend analyses showed an increase in wind speed at the Senator Beck basin sites from October 2012 to the end of the dataset in October 2015. Both sites measure wind speed at two heights; the lower wind speed measurements were used for the first 17 years of the dataset, but the upper wind speed was accidentally used for the last 3 years. At the same sites, instrument recalibration led to a small but statistically significant increasing trend in longwave radiation. These errors were included in the preliminary ESM-SnowMIP results shown in Krinner et al. (2018); erroneous years will either be neglected in future publications, or models will be forced with the corrected datasets which are published alongside this paper (see Sect. 6).
While unfortunate, such errors are symptomatic of longterm datasets for which consistent maintenance and data collection are problematic. Firstly, by definition, long-term monitoring stations might have been installed before metadata were kept electronically (and before the word "metadata" was invented in 1983; Merriam-Webster, 2018) and when information about changes of instruments or recalibrations was in notebooks which might never have been digitised, have now been lost, or never even existed. Equally, improvements in data storage capacities mean that temporal sampling intervals are shorter than they were. For example, measurements at Reynolds Mountain East were initially made every 15 min and averaged to hourly values; currently, 10 s samples and 5 min averages are aggregated to hourly values for most variables. Such factors are known to affect the values of meteorological variables (Hupet and Vanclooster, 2001), but it is beyond the scope of this study to attempt to quantify their contributions to errors or variations in the datasets. Secondly, immediate use of the data allows instrument malfunctions to be identified quickly. For example, a power supply failure was not identified at Sodankylä for 52 d in September and October 2011 because data were being collected but not used; more frequent QC checks are now in place. Thirdly, long-term monitoring stations are susceptible to funding cycles and to changes in climate change policies by successive governments. For example, BERMS, which were established in 1994, had the most frequent site visits from 2001 to 2008, but changing priorities led to less frequent snow surveys after 2008, with only one in the 2009/10 snow season. Finally, while automated QC protocols are in place, some checks require a subjective interpretation of the data and can therefore depend on just one person to identify errors due to malfunction, snow deposition on instruments, etc. Reliance on subjectivity or local knowledge -which in some cases is advocated as mentioned in Sect. 2.1.2 to choose the best method to correct undercatch in precipitation -diminishes the likelihood of the dataset being reproducible. In a discipline like geoscience, in which uncertainties and errors are required to be quantifiable, it is important to acknowledge that subjectivity is not. The closest estimate comes from a survey in which more than 40 % of scientists in the field of earth and environment admitted to failing to have reproduced their own experiments (Baker, 2016); the figure increased to more than 60 % when trying to reproduce other researchers' experiments. Nevertheless, human errors, or more appropriately "mistakes", are not exclusive to data processing: Ménard et al. (2015) identified mistakes in the description files of the land surface model JULES that caused it to underperform considerably. Figure 10. Annual mean temperatures and fitted trends for years starting on 1 October at reference sites from GSWP3 and in situ data. Numbers show correlation (r) between GSWP3 and in situ air temperature for the n complete years of overlap.
A recent and growing push towards standardising methods for data sharing and publishing may lead to errors being identified more systematically as more people have access to data. One of the advantages of open-source software is that bugs are reported by users, and their correction is, at times, a community effort which allows software to be improved quickly (Wu et al., 2016). Sharing of geoscientific models' source code, although still a fairly recent development compared to the field of engineering software, has equally led to model improvements through the identification and fixing of bugs beyond the model development teams . One might expect a similar trend for data sharing, whereby identifying errors becomes an asset to the community because, as mentioned by Gil et al. (2016) in their proposal for a framework for best practices in the publication of data papers, "data sharing makes authors double-check their work, improving science at the first stage as well as future reuse". The more data are used, the more likely it is that mistakes, errors, and uncertainties are identified, and the less likely it will be that model results can, according to Clark et al. (2011), "at best be merely attributed to a nebulous mix of data and structural errors"; to this we can also add human errors.

Data availability
The data presented and described in this paper are available in the data repository PANGAEA: https://doi.org/10.1594/PANGAEA.897575. Figure 11. Annual snowfall and fitted trends for years starting on 1 October at reference sites from GSWP3 and in situ data. Numbers show correlation (r) between GSWP3 and in situ snowfall for the n complete years of overlap. The legend is as in Fig. 10.

Conclusion
It is hoped that one of the legacies of ESM-SnowMIP will be for the datasets presented in this paper to be used as benchmarks for model development and to facilitate improvements in snow modelling. Cold region processes have been a major source of uncertainties in previous IPCC reports. The sparsity of long-term high-quality datasets in cold regions in the past may have contributed to this if one considers that ESMs are run globally, but their snow schemes are generally evaluated at a small number of sites; the first iteration of SnowMIP (Etchevers et al., 2002) 16 years ago included only one longterm (15-year) dataset and three short-term (fewer than two snow seasons) ones. Meta-research argues that it is mislead-ing to emphasise statistically significant findings of any single team; what matters instead is the totality of the evidence (Ioannidis, 2005). It is equally misleading to draw conclusions on model performance when models are evaluated only at one or two sites for 1 or 2 years. The ease of use and availability of the datasets presented here, as well as further ESM-SnowMIP reference sites which will be located in more challenging conditions, should help model developers quantifyand reduce -model uncertainties and errors.