Reconstruction of a daily gridded snow water equivalent product for 1 the land region above 45° N based on a ridge regression machine 2 learning approach

. The snow water equivalent (SWE) is an important parameter of the surface hydrological and climate systems, and 9 it has a profound impact on Arctic amplification and climate change. However, there are great differences among existing 10 SWE products. In the land region above 45° N, the existing SWE products are associated with a limited time span and 11 limited spatial coverage, and the spatial resolution is coarse, which greatly limits the application of SWE data in cryosphere 12 change and climate change studies. In this study, utilizing the ridge regression model (RRM) of a machine learning 13 algorithm, we integrated various existing SWE products to generate a spatiotemporally seamless and high-precision RRM 14 SWE product. The results show that it is feasible to utilize a ridge regression model based on a machine learning algorithm 15 to prepare SWE products on a global scale. We evaluated the accuracy of the RRM SWE product using hemispheric-scale 16 snow course (HSSC) observational data and Russian snow survey data. The MAE, RMSE, R, and R² between the RRM 17 SWE products and observed SWEs are 0.21, 25.37 mm, 0.89, and 0.79, respectively. The accuracy of the RRM SWE dataset 18 is improved by 28%, 22%, 37%, 11%, and 11% compared with the original AMSR-E/AMSR2 (SWE), ERA-Interim SWE, 19 Global Land Data Assimilation System (GLDAS) SWE, GlobSnow SWE, and ERA5-land

3 terms of time series. Similarly, the GlobSnow SWE dataset is also seriously lacking in time series. Although the reanalysis 49 SWE data have good spatial and temporal continuity and high data integrity, their accuracy is poor, and the MAE is 0.65 50 (Snauffer et al., 2016). The SWE data from stations and meteorological observations cannot meet the needs of 51 hydrometeorological and climate change research. This is mainly because SWE from stations is discontinuous in time series 52 and severely missing. Furthermore, hydrometeorological studies often require spatiotemporally continuous grid data to be 53 derived (Pan et al., 2003). There are great differences among remote-sensing SWE, reanalysis SWE data, data assimilation 54 SWE and observational SWE. For remote-sensing SWE, the spatiotemporal characteristics of different passive microwave 55 SWE data differ significantly due to differences in sensors or retrieval algorithms (Mudryk et al., 2015a). Data assimilation 56 SWE and reanalysis SWE data also tend to exhibit different spatiotemporal characteristics due to differences in model 57 design, driving data, and assimilation methods (Vuyovich et al., 2014). In summary, although there are a variety of SWE data 58 in the world, the data quality is uncertain.

59
Previous studies have shown that all kinds of SWE data in the Northern Hemisphere have advantages and disadvantages, 60 and none of these data perform well in all aspects (Mortimer et al., 2020). An effective method was applied in a study by  Wang et al., 2020). Studies have shown that even the 68 simplest multisource data average is more accurate than a single SWE product (Snauffer et al., 2018). However, the simple 69 multisource data average cannot highlight the advantages of high-precision data, and it is easily affected by the weight ratio 70 of low-precision data, which reduces the accuracy of fused data (Mudryk et al., 2015a). Although the linear regression 71 method can make good use of the actual observational data to correct the original data, it is easy to overfit and causes the 4 overall deviation (Snauffer et al., 2016). The "multiple" collocation method changes the size of the original SWE data before 73 fusion, which easily causes data errors. The data assimilation method is sensitive to the accuracy of input data, and it is 74 difficult to fuse multisource data (Pan et al., 2015). In recent years, machine learning methods have been widely used in data 75 fusion (Santi et al., 2021;Ntokas et al., 2021). Machine learning methods can not only integrate the advantages of 76 multisource data but also make full use of site observational data to train the sample data, which easily generates SWE data 77 products with large spatial scales and long time series (Broxton et al., 2019;Bair et al., 2018).

78
In summary, based on the existing SWE data products, combining a machine learning algorithm to fuse multisource SWE 79 data is an effective method to prepare SWE products with long time series and large spatial scales and retain the advantages 80 of single SWE data products. The ridge regression model is a biased estimation method specifically designed to address the 81 problem of multicollinear data (Duzan and Shariff, 2015;Saleh et al., 2019). It has good tolerance to "ill-conditioned" data 82 and has a good effect in using SWE data to address the multicollinearity problem (Hoerl and Kennard, 1970b;Guilkey and 83 Murphy, 1975). In this study, we integrated multisource SWE data products of the RRM SWE based on the ridge regression 84 model of the machine learning algorithm. We selected ERA-Interim SWE, GLDAS SWE, GlobSnow SWE, AMSR-85 E/AMSR2 SWE, and ERA5-land SWE data with relatively complete time series as the original data for the production of the 86 RRM SWE product. The missing parts of the ERA-Interim SWE, AMSR-E/AMSR2 SWE, and GlobSnow SWE data were 87 filled by the spatiotemporal interpolation method. The HSSC dataset (Pulliainen et al., 2020) and Russian snow survey data 88 (Bulygina et al., 2011) were used as training sample data of "true SWE", and the effect of altitude on the algorithm was also 89 considered. Thus, we prepared a set of spatiotemporal seamless SWE datasets (RRM SWE) covering the land region above  The research region of the RRM SWE product is located in the land region north of 45° N (Fig. 1) (Dee et al., 2011). The data provide a global assimilated numerical product of various surface and top 108 atmospheric parameters from January 1979 to the present (https://apps.ecmwf.int/datasets/data/interim-full-109 daily/levtype=sfc/). We obtained the SWE dataset with a daily temporal resolution, a spatial resolution of 0.25°, and 110 NETCDF4 data format. The spatial range of the data is the land region above 45° N. The GLDAS is a model used to describe global land information; it contains data, such as global rainfall, water 120 evaporation, surface runoff, underground runoff, soil moisture, surface snow cover distribution, temperature, and heat flow 121 distribution (Rodell et al., 2004). This assimilation system includes data with spatial resolutions of 1°×1° and 0.25°×0.25° 122 and temporal resolutions of 3 hours, 1 day and 1 month. The GLDAS data are available for download from the Goddard 123 Earth Sciences Data and Information Services Center (GES DISC). We obtain an SWE dataset with a daily temporal 124 resolution, 0.25° spatial resolution, and NETCDF4 data format.

130
To maintain consistency in the spatial and temporal resolutions of the fused data, we unified the ERA-Interim SWE data, 131 GLDAS SWE data, GlobSnow SWE data, AMSR-E/AMSR2 SWE data, and ERA5-land SWE data into a daily temporal 132 resolution, with a spatial resolution of 0.25° and geographic projection of the North Pole Lambert azimuthal equal area.  i.e., the independence of training products and models. In addition, when integrating multiple SWE products, the accuracy of 151 each SWE dataset is likely to differ. A small change in one of the SWE products involved in the training will cause a 152 significant error in the final calculation results, while the ridge regression model has high accuracy and stability for these 153 "ill-conditioned" SWE data. In addition, the main advantage of this model is that SWE products with long time series and 154 large spatial scales are easy to prepare. The principle equation of the ridge regression model is defined as follows: The integration process of the RRM SWE product (Fig. 2) is described as follows: 166 1) The original ERA-Interim SWE data, GLDAS SWE data, GlobSnow SWE data, AMSR-E/AMSR2 SWE data, ERA5-167 land SWE data, DEM data, unified temporal resolution, spatial resolution, projection, spatial range, and unit are 168 preprocessed.

169
2) The spatiotemporal interpolation method is used to fill in the missing data of AMSR-E/AMSR2 SWE, ERA-Interim 170 SWE, and GlobSnow SWE in space and time. Based on this method, the missing AMSR-E/AMSR2 SWE data at low 171 latitudes and the missing ERA-Interim SWE and GlobSnow SWE data in the time series are added.

235
According to the verification results in Fig. 3 and improve the accuracy of SWE products to some extent (better than AMSR-E/AMSR2 SWE and GLDAS SWE), the 265 improvement of this method is still very limited. The RRM SWE product has a significant advantage over the multisource 266 data average method, and its accuracy is much higher than that of the simple multisource data average method (Table 2).

267
Based on the above verification results, the accuracy of the RRM SWE is significantly improved; the RRM SWE dataset has 268 higher accuracy than that of any single grid SWE dataset, and it also fills the gap in the original SWE data in terms of spatial 269 and temporal resolutions.

270
Based on the kernel density estimation method, we analyzed the density distribution of different SWE datasets (Fig. 4). 271 The results show that the RRM SWE dataset is closer to the 1:1 line and has the highest accuracy. The RRM SWE dataset is 272 particularly accurate for SWE estimation in the low-value region, and the test data are concentrated near the 1:1 line in the 273 high-density region (kernel density estimation > 0.00015) (Fig. 4) The estimation accuracy of the RRM SWE product for the high value range of SWE (SWE > 400 mm) is lower than that for 282 the low value range of SWE (SWE < 400 mm) (Fig. 4). The main reason for this is that the training accuracy of the RRM 283 model for the high-value range of SWE is affected by the small number of stations that observe the high-value range of 284 SWE.

285
However, in this study, there are still some uncertainties in the ridge regression machine learning algorithm that integrates 286 SWE products. First, this model is strongly dependent on on-site observational data, and the fusion precision of SWE is poor 287 in some areas with sparse observational stations. The fusion accuracy of SWE products will be affected to a certain extent 288 without considering the prior snow cover information. The RRM SWE product is still underestimated in cases of high SWE.

289
Then, in addition to the DEM, meteorological elements, NDVI, land type, and other factors will affect the SWE estimation.

290
Unfortunately, our current RRM presented here does not consider these factors as predictors, which is a limitation of the 291 current RRM SWE product. Finally, in complex terrain with an elevation interval >1000 m, the RRM SWE product 292 performed poorly, with an RMSE of 31.14 mm (Fig. 5)

295
The accuracy of each SWE product is not absolute at different altitude gradients based on evaluations of the AMSR-296 E/AMSR2 SWE, ERA-Interim SWE, GLDAS SWE, GlobSnow SWE, and ERA5-land SWE product accuracies (Fig. 5). 297 The accuracy of a single SWE product is different from its overall accuracy. We consider the influence of altitude in the 298 algorithm and make full use of the accuracy advantage of each SWE data for different altitude gradients.

306
The RRM SWE product has good performance in different regions, and its RMSE in Russia, Canada, and Finland are 307 26.39 mm, 29.31 mm, and 25.29 mm, respectively; additionally, the performance of the RRM SWE product in different 308 regions is basically similar ( Table 3). The RRM SWE product performs well not only at different altitudes but also in 309 different regions, and it has good stability.

323
RRM SWE dataset is more reasonable for estimating the spatial distribution of SWE in the land region above 45° N, and the 324 data integrity is higher. Moreover, based on the new machine learning algorithm, a variety of SWE data products in different 325 time series are fused, which makes the RRM SWE dataset completely temporally and spatially continuous.

15
The relative difference between the RRM SWE data and GLDAS SWE data is the highest, and the relative difference is 327 greater than 80% in most low altitude regions (Fig. 7). The relative difference between the RRM SWE data and the 328 GlobSnow SWE data is relatively small overall, especially in most high-latitude areas where the relative difference is less 329 than 10% (Fig. 7). Overall, the annual average relative differences in the RRM SWE data and AMSR2 SWE, ERA-Interim 330 SWE, GLDAS SWE, GlobSnow SWE, and ERA5-land SWE are 37%, 41%, 54%, 25%, and 29%, respectively (Fig. 7).

331
Previous studies have shown that the accuracy of the SWE in the Northern Hemisphere estimated by GlobSnow SWE data is

362
In this study, we propose a method to fuse multisource SWE data by a ridge regression model based on machine learning. A 363 new method was utilized to prepare a set of spatiotemporally seamless SWE datasets of the RRM SWE, combined with the 364 original AMSR-E/AMSR2 SWE, ERA-Interim SWE, GLDAS SWE, GlobSnow SWE, and ERA5-land SWE datasets. In the 365 RRM SWE dataset, the time series of the data is 1979-2019, the temporal resolution is daily, the spatial resolution is 10 km, 366 and the spatial range is the land region above 45° N.

367
The RRM SWE data product has the best accuracy, especially for the estimation of low SWE. The accuracy ranking of the 368 SWE dataset verified by the test dataset is described as follows: RRM SWE > ERA5-land SWE > GlobSnow SWE > ERA-369 Interim SWE > multisource data average SWE > AMSR-E/AMSR2 SWE > GLDAS SWE. The accuracy of the RRM SWE 370 dataset is higher than that of the existing SWE products at most elevation intervals. The RRM SWE product has good 371 performance and stability in different regions. Moreover, the RRM SWE dataset spatiotemporally fills in the missing data of 17 the original SWE dataset.

373
Compared with traditional fusion methods, machine learning methods have a strong advantage. We find that the simple 374 machine learning algorithm has not only high efficiency but also good accuracy in the preparation of SWE products on a 375 global scale. Without losing the advantages of existing SWE products, this method can also make full use of station 376 observational data to integrate the advantages of various SWE products. The model training process does not rely too much 377 on a specific sample, and this model has a strong generalization ability. In addition, the influence of altitude on the 378 preparation scheme is considered in detail in the model. Compared with the SWE dataset prepared by the traditional method, 379 the spatial resolution is only 25 km, while this new method obtains an SWE dataset with a higher spatial resolution of 10 km.

380
We propose that the RRM SWE dataset preparation scheme has good continuity and can prepare real-time and high-381 quality SWE datasets in the land region above 45° N. In addition, the new method proposed in this paper has the advantages 382 of simplicity and high precision in preparing large-scale SWE datasets and can be easily extended to the preparation of other 383 snow datasets. This dataset is an important supplement to the land region above the 45° N SWE database and is expected to 384 provide data support for Arctic cryosphere studies and global climate change studies.

386
DS and HL designed the study and wrote the manuscript; JW, XH, and TC contributed to the discussions, edits, and 387 revisions. DS and WJ compiled the model code.

389
The authors declare that they have no conflicts of interest.        35