Daily soil moisture mapping at 1 km resolution based on SMAP data 1 for desertification areas in Northern China

: Land surface soil moisture (SM) plays a critical role in hydrological processes and terrestrial ecosystems in 11 desertification areas. Passive microwave remote sensing products such as the Soil Moisture Active Passive (SMAP) have been 12 shown to monitor surface soil water well. However, the coarse spatial resolution and lack of full coverage of these products 13 greatly limit their application in areas undergoing desertification. In order to overcome these limitations, a combination of 14 multiple machine learning methods, including multiple linear regression (MLR), support vector regression (SVR), artificial 15 neural networks (ANN), random forest (RF) and extreme gradient boosting (XGB), have been applied to downscale the 36 km 16 SMAP SM products and produce higher spatial-resolution SM data based on related surface variables, such as vegetation index 17 and surface temperature. Desertification areas in Northern China, which are sensitive to SM, were selected as the study area, 18 and the downscaled SM with a resolution of 1 km on a daily scale from 2015 to 2020 was produced. The results show a good 19 performance compared with in situ observed SM data, with an average unbiased root mean square error value of 0.057 m 3 /m 3 . 20 In addition, their time series are also consistent with precipitation and perform better than common gridded SM products. The 21 data can be used to assess soil drought and provide a reference for reversing desertification in the study area. This dataset is 22 freely available at https://doi.org/10.6084/M9.FIGSHARE.16430478.V5


Introduction 25
Surface soil moisture (SM) plays a very important role in water-energy cycle processes (Sandholt et  The accurate acquisition of SM is valuable to ecological conservation and revegetation in arid areas of Northern China. 29 In the past, SM data were mainly obtained through ground measurements or the assimilation of products based on land 30 surface models such as the Global Land Data Assimilation System (GLDAS) (Fang and Lakshmi, 2014; Zawadzki and Kędzior,31 2016; Liu et al., 2021). Although most accurate SM data at different soil depths can be obtained, field measurements and in 32 situ observations are limited due to the high cost and labor intensity involved in their collection and are generally not 33 representative of soil water status over larger areas (Rahimzadeh-Bajgiran et al., 2013;Zhao et al., 2018;Bai et al., 2019). 34 With the development of remote sensing technologies, continuous SM estimates can be generated at regional and global scales 35 (Peng et al., 2021). Compared to ground measurements, remote sensing products can provide good spatial and temporal 36 coverage of SM with a relatively low cost to the user (Zeng et al., 2015;Zhao et al., 2018;Meng et al., 2020). Data assimilation 37 SM products largely depend on the accuracy of the land surface model and the original inputs (Zawadzki and Kędzior, 2016). 38 They generally have low accuracy in areas where ground measurements are scarce, which is a problem that can be overcome 39 with remote sensing. 40 At present, there are many remotely sensed SM data, some of which are from microwave remote sensing satellites,  Table 1 below. Studies have compared these products and found that SMAP SM 45 products have higher accuracy and robustness than other remotely sensed SM products (Liu et al., 2019;Wang et al., 2021). 46 al., 2020). However, due to their coarse spatial resolution, microwave SM products have limited applicability to small-scale 49 areas. Compared to microwave sensors, optical satellites such as MODIS and Landsat have a finer spatial resolution. Some 50 observations generated from optical satellites provide good information about SM, such as vegetation index (VI) (usually 51 Normalized Difference Vegetation Index (NDVI) or Enhanced Vegetation Index (EVI)) and land surface temperature (LST) 52 MODIS provides continuous time-series predictors for important parameters, such as vegetation index and surface 120 temperature. This paper used MODIS products MOD09A1, MOD11A1, MOD13A2, MOD15A2H and MCD43D58 (Table 2). 121 The 1-km daily LST was provided by MOD11A1, and the 1-km 16-day EVI and NDVI was provided by MOD13A2. 122 MOD15A2H provided 8-day Leaf Area Index (LAI) with a spatial resolution of 500 m. MCD43D58 provided daily albedo 123 data with a spatial resolution of 30 arc second (~1 000 m). Some soil wetness related indexes, including NDWI, NSDSI, and 124 Land Surface Water Index (LSWI), were produced by MOD09A1. Their formulas are: 125 = ( 4 − 2 )/( 4 + 2 ) (1) 126 = ( 2 − 6 )/( 2 + 6 ) (2) 127 = ( 6 − 7 )/ 6 (3) 128 where B 2, B 4, B 6 and B 7 represent the MOD09A1 surface reflectance of the 2nd, 4th, 6th and 7th bands, respectively. 129 These MODIS products are available from NASA Earthdata (https://search.earthdata.nasa.gov). All data were obtained 130 from 2015 to 2020 and processed to a spatial resolution of 1,000 meters. 131

Topographic data 132
Topographic factors are strongly related to SM, including elevation, slope and aspect (Bai et  2015 were used in this study (Fig. 1). The Babao Monitoring Network covers 40 sites and provides hourly SM for the surface 146 layer (4 cm, 10 cm and 20 cm) from 2013 to 2017; 29 of the available sites have data after 2015 and their observations of the 147 first layer (4 cm) were used in this study (Fig. 1). To compare with the simulated results, they were all processed into daily 148 time series. 149

Precipitation and temperature data 150
The daily precipitation and temperature data were acquired from 131 meteorological stations from the China 151 Meteorological Data Service Centre (http://data.cma.cn). The spatial locations of these meteorological stations are shown in 152 Fig. 1. The average annual precipitation of most stations from 2015 to 2020 is less than 600 mm, and gradually decreases from 153 northwest to southeast (Fig. 1). 154

Other gridded SM datasets 156
Some other gridded SM data were used to compare the simulation results (Table 3). The SMAP Level-2 product 157 (L2_SM_SP) merges SMAP radiometer and processed Sentinel-1A/1B SAR observations. It is available at 3 km and 1 km 158 resolutions. The Global Change Observation Mission Water (GCOM-W1) AMSR2 product is produced by the Japan 159 Aerospace Exploration Agency (JAXA), and SM data at a 0.1° spatial resolution were selected for this study. The Copernicus 160 Climate Change Service (C3S) produces a global SM gridded dataset from 1978 to present from satellite sensors such as SMOS, 161 AMSR2 and SMAP. It has a spatial resolution of 0.25 degrees and offers three types of products: active, passive and combined. 162 The combined product that we used in this study is generated by merging the active and passive products. The fifth generation 163 reanalysis dataset (ERA5) produced by European Centre for Medium-Range Weather Forecasts (ECMWF) provides several 164 variables including volumetric soil water over several decades. In the dataset, the soil is divided into four layers and the depth 165 of the top layer is 0-7 cm. In this study, we downloaded the hourly volumetric soil water data of the top layer and processed 166 them as daily averages. Famine Early Warning Systems Network (FEWS NET) Land Data Assimilation System (FLDAS) 167 provides daily SM at a 0.01° spatial resolution over the Central Asia region (30-100° E, 21-56° N), which covers part of our 168 study area. The product consists of four layers of SM, and the SM at the top layer (0-10 cm) was selected for this study. 169

Downscaling approach based on multi-machine learning 171
According to the selected variable indicators (mainly including topographic data, soil data and some MODIS products) 172 and machine learning methods, we constructed a framework to downscale SMAP SM based on multiple machine learning 173 methods (Fig. 2). The XGB algorithm is a boosting-type ensemble of multiple CART decision trees (Chen and Guestrin, 2016). The 192 predicted result of the boosting-type tree ensemble model can be expressed as follows: 193 where is the space of regression tree, K is the total number of trees, which means the model uses K additive functions, 195 ( ) is the weighted score of the -th tree on -th input data ( ). 196 XGB adopts a regularized learning objective to optimize the simulation results. 197 where is the loss function, N is the total number of input, Ω is the regularization term to penalize the model complexity 199 and prevent overfitting.

The construction of 16-day regression model 204
The downscaling process is shown in Fig. 2. First, all data need to be preprocessed. Daily LST data are likely to be 205 affected by the cloud, so we performed quality control to MOD11A1 products using its quality control (QC) band and choose 206 high-quality cloud-free pixels. All selected variables, including LST, Albedo, LAI, NDWI, LSWI, NSDSI, NDVI, EVI, DEM, 207 slope, aspect, sand, silt and clay, were aggregated into a resolution of 1 km with a geotiff format. These variables were further 208 resampled to the spatial resolution of the SMAP SM data (36 km) using the nearest neighbor interpolation method. 209 Second, valid samples were obtained and split. Since it is severely affected by noise (such as clouds), MOD11A1 only 210 provides daily valid clear-sky LST values onto grids. In addition, each SMAP image has a narrow coverage and provides only 211 a small number of valid pixels per day. It means that there may be few or no valid samples if only the data of a certain day are 212 selected to build the regression. The variables from MOD13A2 and MOD15A2H are the best composite within 16 days and 8 213 days, respectively. To overcome the limitation, we chose to build regression models within 16 day periods (the lowest temporal 214 resolution from these dynamic variables). All valid data (including training and test sets) within 16 days were used as the 215 samples in the regression model. For instance, for NDVI and EVI on January 1, 2020, which are composite results from January 216 1 to January 15, the valid data during the period were used as samples. The number of valid samples for surface variables and 217 SMAP SM for each period in 2015-2020 is shown in Fig. 3. The day of year (DOY) is used to represent the corresponding 218 period. Since limited available SMAP SM grid data, there may be few valid samples we can obtain during cold seasons. The 219 valid samples for each period were divided into training and test sets, each accounting for 50% of the total number of samples. 220 In this study, stratified random sampling based on sampling date during the 16-day period was employed to split the training 221 and test sets. Moreover, to avoid excessively inconsistent training and test sets, the Kolmogorov-Smirnov (KS) test is adopted 222 to test the distribution consistency of them (Kovalev and Utkin, 2020). If the p-value of the KS test result is less than or equal 223 to 0.05, stratified random sampling is performed again, and until the requirements are met. 224 where is the SMAP SM, is the corresponding SM predicted by the regression model, Cov represents the covariance 241 function, Var is the variance, and is the number of valid samples for or . 242 The RMSE is used as the evaluation metric for hyperparameter turning. The tuning results of hyperparameters are shown 243 in Tables S2 and S3. According to the optimal hyperparameter, the corresponding model can be constructed. 244

Prediction of 1-km daily SM product 245
The accuracy of the five regression models is compared using the average RMSE of training and test sets. This average 246 RMSE can be expressed as: 247 where and are the RMSE of training and test sets for these models, respectively. 249 The regression model with the smallest ̅̅̅̅̅̅̅̅ was selected as the optimal model. Furthermore, we used the selected 250 optimal model and these surface variables with a resolution of 1 km within 16 days to simulate daily SM at 1 km resolution 251 on the corresponding date. Taking 16 days as a period, all daily SM data with a spatial resolution of 1 km from 2015 to 2020 252 were predicted. In addition, to obtain a more complete time series of SM data, we used the model of the previous period when 253 the number of valid samples was less than 100. 254

Evaluation method 255
The in situ SM measurements were used to validate the downscaled results. In addition to R and RMSE, bias and unbiased 256 RMSE (ubRMSE) were also used for accuracy evaluation. Bias indicates the overall level of overestimation or underestimation 257 of simulation results. ubRMSE can eliminate the influence of deviation. They were calculated according to: 258 where is the in situ observed SM, is the downscaled SM of the corresponding grid, and is the number of valid 261 samples for or . 262 The results of the test set show that XGB, RF and SVR perform better than ANN and MLR. Moreover, the evaluation accuracy was generally better in the cold season, when sample sizes were smaller. 280

Comparison with the in situ data and precipitation 288
The downscaled 1 km gridded SM were compared with the in situ SM observations of the Maqu Network and Babao 289 Network (Fig. 6). Due to the difference in sensors, soil depth and measurement scale (point observation in case of the in situ 290 measured SM and 1 km grid for the downscaled SM), there is a certain deviation between in situ observation data and the 291 downscaled gridded SM data. The downscaled SM of most sites at the Maqu Network and Babao Network are highly correlated 292 with the in situ measured SM (R>0.6). In the Maqu Network, the ubRMSEs with an average of 0.057 m 3 /m 3 are all less than 293 0.090 m 3 /m 3 , and the bias ranges from -0.10 to 0.22 m 3 /m 3 . In the Babao Network, the average ubRMSE of all sites is 0.081 294 m 3 /m 3 , and some of them exceed 0.1 m 3 /m 3 . In addition, their bias ranges from -0.07 to 0.45 m 3 /m 3 . It means that the validation 295 accuracy of Babao Network is generally lower than Maqu Network. That may be mainly because the measured soil depth at 296 the Babao Network is 4 cm, which means that there could be a systematic error between the datasets. Therefore, the validation 297 accuracy should mainly refer to the evaluation accuracy of Maqu network.

300
To better understand the reason for these poor results, the scatter plots comparing the two sets of data were drawn. Figure  301 7 shows the results of the 19 sites of the Maqu Network. All four statistical metrics, namely, R, RMSE, ubRMSE and bias 302 were calculated, and their fitting line of the scatter was also plotted. Not surprisingly, the relationship is generally improved 303 where there are more valid data. It means that the validation effect of in situ observations is affected by the amount of data. 304 The same conclusion can be drawn through 29 sites at the Babao Network (Fig. S1). 305

307
All SM products are compared with in situ SM. Figure 8 shows a significantly higher correlation between the downscaled 308 SM and in situ SM of the Maqu Network. The median ubRMSE of the downscaled SM is the smallest, and its RMSE is second 309 only to the C3S (0.25°) product. The bias of the downscaled SM is higher than that of some products, even higher than the 310 original SMAP L3 (36 km) data. Almost the same results can be obtained from in situ observations of Babao Network (Fig.  311   S2). The difference is that the bias of the downscaled SM is lower than the result of SMAP L3 (36 km). Compared with the 312 RF-based and the XGB-based downscaled SMs, the downscaled SM with multiple machine learning approaches performed 313 better, especially R and ubRMSE. 314

316
The observed SM of sites with a greater number of observed data were compared with these gridded SM data at different 317 resolutions and precipitation. Figure 9 shows

343
The annual average SM was also calculated (Fig. S3). Compared with the monthly average SM, the annual average SM 344 changed significantly less. Further, we compared the spatial patterns of the downscaled SM with the gridded SM products with 345 different resolutions. Figure 11 shows the daily average SM of these products from 2015 to 2020. The spatial patterns of the 346 downscaled SM and 36 km SMAP SM are basically consistent, but the downscaled data show better details in some areas such 347 as near rivers. The overall values of GCOMW SM are relatively small, and exhibit some obvious errors in some areas. For 348 example, SM in the Tarim Basin is higher than in the surrounding area, which is completely inconsistent with other SM data. 349 The spatial pattern of the C3S SM is close to the downscaled SM and the 36 km SMAP SM, but some details are not presented. 350 For example, SM in the Hetao Plain along the Yellow River is much higher than that in its surrounding area, which can be 351 found in the downscaled SM and the SMAP SM, but not in the C3S SM. There are obvious errors in the results of ERA5. The 352 average SM is significantly overestimated in the southern part of the study area, and underestimated in some areas in the 353 northern of the study area. The FLDAS SM has high resolution, and its overall spatial pattern is relatively consistent with the 354 downscaled SM and 36 km SMAP SM. The difference is that the FLDAS SM is significantly larger in higher elevation areas 355 of the west than in other regions, which is quite different from other products. This suggests that the FLDAS SM may be 356 overestimated in these regions. In addition, FLDAS SM does not show wetter soil along the river. The spatial patterns of the 357 RF-based and XGB-based downscaled SMs are both close to that of the downscaled SM with multiple machine learning 358 approaches, however, the the maximum SM based on RF is smaller than the results based on XGB and multi-model

364
To better demonstrate the differences in SM, a case of the Mu Us Desert was selected (Fig. 12). The Mu Us Desert is 365 located in a semi-arid area with annual average precipitation of generally less than 400 mm, decreasing gradually from 366 southeast to northwest. The main types of land cover are grassland and sandy land, and the salinization is serious in a few 367 areas. Desertification has been severe for a long time in the past but significantly reversed with artificial afforestation in recent 368 years. 369 SM shows an overall trend of gradual decrease from the southeast to the northwest (Fig. 12 (b)~(g)), which is consistent 370 with the distribution of precipitation. The average SM of the same location changes little from year to year. Overall, it is 371 relatively large in 2018 and relatively small in 2015, which is also roughly consistent with annual precipitation patterns. Land 372 cover types also have a certain influence on the spatial difference of SM. The northwestern portion of the Mu Us Desert is 373 mainly grassland, which is strongly dependent on precipitation (Fig. 12 (h)). The southeastern area is mainly cultivated land 374 and is less affected by precipitation as it relies on pumping groundwater rather than natural precipitation (Fig. 12 (j)).

Variable importance assessment 381
The selection of variables is an important step of a nonlinear regression model. The importance analysis of the variables 382 carried out for this research found that a larger number of variables can improve the regression effect of these models. Due to 383 the variables obtained in this study come from multiple data sources, their preprocessing may affect the construction of 384 regression models and their relationship with SM. Moreover, variables collinearity and hyperparameters also affects the 385 importance relationship of variables. This average result of the 128 regression models can give a relatively reasonable result. 386 Figure 13 shows the average importance scores of each variable for the RF and XGB models across all available days. The 387 importance scores of different variables in the RF based model and the XGB based model are similar. LST and surface albedo 388 both affect surface energy exchange and partition. LST is an important variable in both models, which is consistent with the 389 study of Zhao et al. (2018). NSDSI is the most sensitive soil moisture index compared to LSWI and NDWI, which was 390 demonstrated in Yue et al. (2019). Topographical factors also exhibite importance on SM, especially elevation. NDVI is more 391 sensitive to vagetation index than EVI and LAI. However, their effect was smaller than that of soil moisture index. It indicates 392 that the SM inversion method based only on LST and VI is inadequate. The influence of soil texture (sand, silt and clay) is 393 relatively weak. 394 The standard deviation of the importance scores of each variable is shown with error bars in Fig. 13. Its changes are 395 mainly affected by the samples used in the regression model and the temporal variations in surface variables. For static 396 variables such as soil structure and topographic factors, the changes in their importance scores mainly depend on the number 397 and the location of the samples. Figure 13 also shows that their standard deviation is relatively small. Compared with static 398 variables, the standard deviation of the importance scores of dynamic variables is significantly larger, especially for LST and 399 LAI. This indicates that it is not reliable to construct a single regression model for a long time series. 400 In general, the variable importance analysis suggests that the selected variables are suitable for the construction of the 401 regression model. Moreover, choosing 16 days as a time period to build a regression model benefits from obtaining a sufficient 402 number of samples, especially since the surface variables were found still unchanged during these intervals. 403  . This study showed that the simulation results of ANN have greater uncertainty, and the accuracy is 410 generally worse than that of RF (Figs. 4 and 5). The RF algorithm shows a good simulation ability, but in comparison, the 411 XGB algorithm also has a corresponding effect or even higher. We also compared our simulation results combining multiple 412 models and the RF-based simulation results. The results showed that the combined products have higher accuracy than the RF-413 based products, which is mainly reflected in the relatively more reasonable simulation of peaks and valleys (Table 4 and Fig  414   11). MLR has the worst effect compared to the other four models, which is likely to be affected by variable collinearity. In 415 fact, many algorithms, especially linear ones, exhibit more or less poor robustness when there is high collinearity between overfitting. In contrast, the difference between training and test accuracy of the RF model is smaller. It showed better stability 425 than XGB at several periods (Figs. 4 and 5). The training accuracy of MLR and SVR has a small difference from the test 426 accuracy, but the overall accuracy is obviously lower (Table 4), which might be due to variable collinearity. Some studies have 427 also proved that SVR may also perform better than ensemble algorithms (Yu et al., 2012;Fan et al., 2018). The fitting effect 428 of ANN varies greatly in different periods, indicating that its generalization is lower than other models (Piotrowski and 429 Napiorkowski, 2013). In general, the XGB and RF models provide the best combination of prediction accuracy and stability.

Uncertainty and Prospects 444
While this study greatly improved the spatial resolution of SM data from 2015-2020 in the desertification areas of North 445 China by downscaling SMAP SM products, it still presents some shortcomings. Due to the influence of snow, ice and frozen 446 ground, the number of valid SMAP pixels during cold seasons is very small, that makes the number of available samples is 447 limited. With a period of 16-day, the number of valid samples may still be less than 100 during cold seasons (Fig. 3). The 448 sample size affects the simulation accuracy. Figures 4 and 5 show that there are seasonal variations in R and RMSE, which is 449 likely to be affected by the sample size (Figure 3). In general, a larger sample size often mean more efficient sampling and 450 more reliable results, but not necessarily better evaluation accuracy. Likewise, insufficient samples can sometimes have good 451 evaluation accuracy, although the results are less reliable. In order to reduce the error caused by insufficient samples, this study 452 replaced the periods with less than 100 samples with the model of the previous periods. For this reason, the simulation results 453 sometimes perform poorly during cold seasons (Fig. 9). In addition, the upscaling (from 1 km to 36 km resolution) of surface 454 variables also has a certain impact on the accuracy of the model. 455 Our products have a good correlation with the in situ observation data. Although there are deviations, the results are 456 relatively reasonable. However, in situ observed SM data are limited in their representation of the entire 1 km ×1 km grid, 457 which adds to the uncertainty of our product. Figure 6 shows that the evaluation accuracy of different points varies greatly. 458 Through the investigation of past literature and our study, it is found that the relationship between in situ observation data and relatively limited, and their spatial distribution is concentrated in a certain part of of the study area, which is not representative. 463 It increases the uncertainty of the simulation results. In order to verify the accuracy of the data as much as possible, this study 464 also selected several sets of gridded SM products for comparison. The results showed that our products perform better in 465 temporal variability and spatial patterns (Figs. 9 and 11). 466

Data availability 467
The codes mainly used in this paper mainly includes sample selection, the building of the optimal regression model and 468 the result prediction. The downscaled daily SM dataset at 1 km spatial resolution is available at 469 https://doi.org/10.6084/M9.FIGSHARE.16430478.V5 (Rao et al., 2021). The data maps are all provided in Geotiff format, 470 and the value has expanded 10, 000 times to make them easier to store. The filenames reflect the production date in Julian Day 471 format. 472

Conclusions 473
In this study, an approach was proposed for downscaling 36 km SMAP SM products using MODIS optical products and 474 other surface variables (mainly topographic data and soil data) based on multiple machine learning methods. Overall, the 475 regression performance of the five methods is, in order: XGB>RF>SVR>ANN>MLR. Compared with MLR, SVR and ANN, 476 XGB and RF have much better accuracy, and they were used in combination to produce daily 1 km downscaled SM in a period 477 of 16 days. The validation shows that the downscaled SM are highly related to most in situ measured SM. The ubRMSE with 478 an average of 0.057 m 3 /m 3 is generally less than 0.090 m 3 /m 3 at the Maqu Network. Time series of SM data from in situ 479 observation sites are also compared. The results show that the downscaled SMs are highly related to SMAP SMs, and provide 480 a more complete time series and match better with the in situ measured SM. Compared with some commonly used gridded 481 SM products such as SMAP L2 (l km or 3 km), GCOMW/ASMR2, C3S, ERA5 and FLDAS SMs, the downscaled SM data 482 not only have higher spatial resolution, but also have a more reliable accuracy whether in time series or spatial distribution. 483 The maps of downscaled SM show larger values from June to September, which coincides with the vegetation growing 484 season. The difference in annual mean SM is small. Spatially, SM is relatively large in Qinghai Province and in northeastern 485 Inner Mongolia, especially in summer. In arid areas such as the Tarim Basin, SM is relatively small throughout the year. 486 Moreover, precipitation and temperature both have a great influence on SM in the study area. Precipitation has a greater impact 487 on SM in the eastern part of the study area, while the effect of temperature appears to be more pronounced in the west. 488 This approach makes it possible to more accurately assess the soil moisture status in the study area. The results can support 489 regional agricultural planting and revegetation efforts and can be applied to limit desertification in other areas in the future.