Daily soil moisture mapping at 1 km resolution based on SMAP data 1 for areas affected by desertification in Northern China 2

Pinzeng Rao, Yicheng Wang, Fang Wang, Yang Liu, Xiaoya Wang, Zhu Wang 3 State Key Laboratory of Hydroscience and Engineering, Department of Hydraulic Engineering, Tsinghua University, Beijing 4 100084, China. 5 State Key Laboratory of Simulation and Regulation of Water Cycle in River Basin, China Institute of Water Resources and 6 Hydropower Research, Beijing 100038, China. 7 State Key Laboratory of Remote Sensing Science, Faculty of Geographical Science, Beijing Normal University, Beijing 8 100875, China. 9 *Correspondence to: Fang Wang (657563390@qq.com) 10 Abstract: Land surface soil moisture (SM) plays a critical role in hydrological processes and terrestrial ecosystems in areas 11 affected by desertification. Passive microwave remote sensing products such as the Soil Moisture Active Passive (SMAP) have 12 been shown to monitor surface soil water well. However, the coarse spatial resolution and lack of full coverage of these 13 products greatly limit their application in areas undergoing desertification. In order to overcome these limitations, a 14 combination of multiple machine learning methods, including multiple linear regression (MLR), support vector regression 15 (SVR), artificial neural networks (ANN), random forest (RF) and extreme gradient boosting (XGB), have been applied to 16 downscale the 36 km SMAP SM products and produce higher spatial-resolution SM data based on related surface variables, 17 such as vegetation index and surface temperature. Areas affected by desertification in Northern China, which are very sensitive 18 to SM, were selected as the study area, and the downscaled SM with a resolution of 1 km on a daily scale from 2015 to 2020 19 was produced. The results show a good performance compared with in situ observed SM data, with an average unbiased root 20 mean square error value of 0.049 m/m. In addition, their time series are also consistent with precipitation and perform better 21 than some common gridded SM products. The data can be used to assess soil drought and provide a reference for reversing 22 desertification in the study area. This dataset is freely available at https://doi.org/10.6084/M9.FIGSHARE.16430478.V5 (Rao 23 et al., 2021). 24


Introduction 26
Surface soil moisture (SM) plays a very important role in water-energy cycle processes (Sandholt et   composites of the Level-2 products and are the most commonly used for applications. The Level-3 products are available in 103 three spatial resolutions: 36 km passive, 9 km active-passive, and 3 km active (O'Neill et al., 2010). Following the 104 malfunctioning of its radar in 2015, SMAP radar data were replaced with those of Sentinel-1, limiting the application of active 105 and active-passive products. 106 The SMAP Level-3 passive daily SM product (L3_SM_P, Version 6) with a grid resolution of 36 km has been produced 107 since March 31, 2015. Zeng et al. (2015) showed that most of remotely sensed SM products were slightly better during daytime 108 than during nighttime, and the same conclusion for the SMAP SM product was confirmed by Zhao et al. (2018). Therefore, 109 the SMAP Level-3 SM product with the descending overpass time of 6:00 AM was used in this study. The data were 110 downloaded from NASA Earthdata (https://search.earthdata.nasa.gov). 111

MODIS products 112
MODIS provides continuous time-series predictors for important parameters, such as vegetation index and surface 113 temperature. This paper used MODIS products MOD09A1, MOD11A1, MOD13A2, MOD15A2H and MCD43D58 (Table 2). 114 The soil wetness related indexes, including NDWI, LSWI and NSDSI, were produced using bands of the MOD09A1 product. 115 Their formulas are: 116 = ( 4 − 2 )/( 4 + 2 ) (1) 117 = ( 2 − 6 )/( 2 + 6 ) (2) 118 = ( 6 − 7 )/ 6 (3) 119 where B 2, B 4, B 6 and B 7 represent the MOD09A1 surface reflectance of the 2nd, 4th, 6th and 7th bands, respectively. 120 These MODIS products are available from NASA Earthdata (https://search.earthdata.nasa.gov), and all data were 121 obtained from 2015 to 2020. 122 The in situ SM measurements were collected from the data provided by the Maqu Monitoring Network (Zhang et al., 133 2020) and the Babao Monitoring Network (Kang et al., 2017). The Maqu Monitoring Network covers 26 sites and provides 134 SM for the surface layer (0-5 cm) at 15-minute intervals from 2009 to 2019; 19 of the available sites which have data after 135 2015 were used in this study (Fig. 1). The Babao Monitoring Network covers 40 sites and provides hourly SM for the surface 136 layer (4 cm, 10 cm and 20 cm) from 2013 to 2017; 29 of the available sites have data after 2015 and were used in this study 137 ( Fig. 1). To compare with the simulated results, they were all processed into daily time series. 138

Precipitation data 139
The precipitation data were acquired from 131 meteorological stations from the China Meteorological Data Service 140 Centre (http://data.cma.cn). The spatial locations of these meteorological stations are shown in Fig. 1. 141

Other gridded SM datasets 143
Some other gridded SM data were used to compare the simulation results (Table 3). The SMAP Level-2 product 144 (L2_SM_SP) merges SMAP radiometer and processed Sentinel-1A/1B SAR observations. It is available at 3 km and 1 km 145 resolutions. The Global Change Observation Mission Water (GCOM-W1) AMSR2 product is produced by the Japan 146 Aerospace Exploration Agency (JAXA), and SM data at a 0.1° spatial resolution were selected for this study. The Copernicus 147 Climate Change Service (C3S) produces a global SM gridded dataset from 1978 to present from satellite sensors such as SMOS, 148 AMSR2 and SMAP. It has a spatial resolution of 0.25 degrees and offers three types of products: active, passive and combined. 149 The combined product that we used in this study is generated by merging the active and passive products. The fifth generation 150 ECMWF reanalysis dataset (ERA5) provides several variables including volumetric soil water over several decades. In the 151 dataset, the soil is divided into four layers and the depth of the top layer is 0-7 cm. In this study, we downloaded the hourly 152 volumetric soil water data of the top layer and processed them as daily averages. Famine Early Warning Systems Network 153 (FEWS NET) Land Data Assimilation System (FLDAS) provides daily SM at a 0.01° spatial resolution over the Central Asia 154 region (30-100° E, 21-56° N), which covers part of our study area. The product consists of four layers of SM, and the SM at 155 the top layer (0-10 cm) was selected for this study. 156

Downscaling approach based on multi-machine learning 158
According to the selected variable indicators (mainly including topographic data, soil data and some MODIS products) 159 and machine learning methods, we constructed a framework to downscale SMAP SM based on multiple machine learning 160 methods (Fig. 2).

Downscaling process 181
The downscaling process is shown in Fig. 2. First, due to the difference in spatial resolution and data format, all required 182 data were pre processed. All selected variables, including LST, Albedo, LAI, NDWI, LSWI, NSDSI, NDVI, EVI, DEM, slope, 183 aspect, sand, silt and clay, were aggregated into a resolution of 1 km with a geotiff format. These variables were further 184 resampled to the spatial resolution of the SMAP SM data (36 km) using the nearest neighbor interpolation method. The 185 regression model was then defined according to the selected machine learning method: 186

Evaluation method 207
The correlation coefficient (R) and the root mean square error (RMSE) were used to evaluate the accuracy of the 208 regression model based on these machine learning methods (MLR, SVR, ANN, RF and XGB). They are calculated as: 209 where is the SMAP SM, is the corresponding SM predicted by the regression model, Cov represents the covariance 212 function, Var is the variance, and is the number of valid samples for or . 213 The regression model with the smallest average RMSE of training and test datasets was selected as the optimal model. 214 where and are the RMSE of the training set and test set for these models, respectively. 216 We used the selected optimal model with these surface variables with a resolution of 1 km within 16 days to simulate SM 217 at 1 km resolution on the corresponding date. Taking 16 days as a period, we predicted all daily SM data with a spatial 218 resolution of 1 km from 2015 to 2020. In addition, to obtain a more complete time series of SM data, we used the model of the 219 previous period when the number of valid samples was less than 100. 220 The in situ SM measurements were used to validate the downscaled results. In addition to R and RMSE, unbiased RMSE 221 (ubRMSE) and bias were also calculated according to: 222  The correlation coefficient (R) and the root mean square error (RMSE) of each regression result for the training set and 232 the test set are shown in Fig. 4 and Fig. 5, respectively. For all models, R is greater than 0.8 and RMSE is less than 0.1 both 233 for the training and the test set. For the training set using XGB, Rs are all above 0.96, generally higher than for other methods; 234 Similarly, the RMSEs of XGB are all lower than 0.02, generally lower than those of other methods. The R of RF is second 235 only to that of XGB, and for some periods it is higher than for XGB; the RMSEs of RF are also generally lower than 0.02 and 236 are lower than those of XGB in some periods. SVR and ANN perform better in the cold season, and worse in other seasons. 237 In general, their results are inferior to those of XGB and RF. The simulation results of MLR are relatively poor both in terms 238 of RMSE and R. 239 The results of the test set show that XGB, RF and SVR perform better than ANN and MLR, and are better in the cold 240 season. Table 4

Comparison with the in situ data and precipitation 250
To evaluate the performance of the downscaling approach, the downscaled 1 km gridded SM were compared with the in 251 situ SM observations. The SM before and after downscaling were both compared with the in situ SM data of the Maqu Network 252 and Babao Network (Fig. 6)

262
To better understand the reason for these poor results, the scatter plots comparing the two sets of data were drawn.

269
The observed SM of sites with a greater number of observed data were compared with these gridded SM data at different 270 resolutions and precipitation. Figure 8

Mapping of the downscaled SM 287
SM varies greatly in different months in desertified areas. Figure 9 shows the average SM in each month in the study area. 288 The SM shows a monthly change pattern, and the values from June to September are bigger than in other months, especially 289 in southern Qinghai Province, eastern Inner Mongolia Province, and western Xinjiang Province, which is consistent with the 290 process of vegetation growth. The SM in some areas is low throughout the year, such as in the Tarim

294
The annual average SM was also calculated (Fig. S2). Overall, there is little variation in SM in different years. Further, 295 we compared the spatial patterns of the downscaled SM with the gridded SM products with different resolutions. Figure 10  is close to the downscaled SM and the 36 km SMAP SM, but some details are not presented. For example, SM in the Hetao 301 Plain along the Yellow River is much higher than that in its surrounding area, which can be found in the downscaled SM and 302 the SMAP SM, but not in the C3S SM. The average SM of the ERA5 products is polarised. In some areas the values are very 303 large, and in some small areas they are very small. The FLDAS SM has high resolution, and its overall spatial pattern is 304 relatively consistent with the downscaled SM and 36 km SMAP SM. The difference is that the FLDAS SM is significantly 305 larger in higher elevation areas of the west than in other regions, which is quite different from other products. This suggests 306 that the FLDAS SM may be overestimated in these regions. In addition, FLDAS SM does not show wetter soil along the river. 307

310
To better demonstrate the differences in SM, a case of the Mu Us Desert was selected (Fig. 11). The Mu Us Desert is 311 located in a semi-arid area with annual average precipitation of generally less than 400 mm, decreasing gradually from 312 southeast to northwest. The main types of land cover are grassland and sandy land, and the salinization is serious in a few 313 areas. Desertification has been severe for a long time in the past but has been significantly reversed with artificial afforestation 314 in recent years. 315 SM shows an overall trend of gradual decrease from the southeast to the northwest (Fig. 11 (b)~(g)), which is consistent 316 with the distribution of precipitation. The average SM of the same location changes little from year to year. Overall, it is 317 relatively large in 2018 and relatively small in 2015, which is also roughly consistent with annual precipitation patterns. Land 318 cover types also have a certain influence on the spatial difference of SM. The northwestern portion of the Mu Us Desert is 319 mainly grassland, which is strongly dependent on precipitation (Fig .11 (h)). The southeastern area is mainly cultivated land 320 and is less affected by precipitation as it relies on pumping groundwater rather than natural precipitation (Fig .11 (j)).

Regression variable importance 327
The selection of variables is an important step of a nonlinear regression model. The importance analysis of the variables 328 carried out for this research found that a larger number of variables can improve the regression effect of these models to some 329 extent. Figure 12 shows the average importance scores of each variable for the RF and XGB models across all available days.  the test accuracy of the XGB model is significantly reduced in several periods. This means that the simulation results of the 354 XGB model is likely to have a certain degree of overfitting. In contrast, the difference between training accuracy and test 355 accuracy of the RF model is even smaller. It showed better stability than XGB at some periods (Figs. 4 and 5). The training 356 accuracy of MLR and SVR has a small difference from the test accuracy, but the overall accuracy is obviously lower, which 357 is not suitable for remote sensing SM prediction (Table 4). Some studies have also proved that SVR may also perform better 358 indicating that its generalization is lower than other models (Piotrowski and Napiorkowski, 2013). In general, the XGB and 360 RF models provide the best combination of prediction accuracy and stability.  While this study greatly improved the spatial resolution of SM data from 2015-2020 in the desertifying areas of North 375 China by downscaling SMAP SM products, it still presents some shortcomings. For example, due to the image quality and 376 coverage of SMAP and the impact of noise from clouds on the MODIS products, the number of valid samples for a 16-day 377 period may still be less than 100 points. This study replaced the periods with less than 100 samples with the model of the 378 previous periods. Due to the limited number of available samples, the simulation in the cold season is relatively poor (Fig. 8). 379 In addition, the upscaling (from 1 km to 36 km resolution) of surface variables also has a certain impact on the accuracy of the 380 model. 381 The Chinese government focuses on desertification reduction through afforestation and the establishment of grasslands. 382 SM data with high temporal and spatial resolution can provide a reference for the next steps of revegetation. 383

Code and data availability 384
The codes mainly used in this paper mainly includes sample selection, the building of the optimal regression model and