An improved global remote-sensing-based surface soil moisture (RSSSM) dataset covering 2003–2018

Soil moisture is an important variable linking the atmosphere and terrestrial ecosystems. However, long-term satellite monitoring of surface soil moisture at the global scale needs improvement. In this study, we conducted data calibration and data fusion of 11 well-acknowledged microwave remote-sensing soil moisture products since 2003 through a neural network approach, with Soil Moisture Active Passive (SMAP) soil moisture data applied as the primary training target. The training efficiency was high (R2 = 0.95) due to the selection of nine quality impact factors of microwave soil moisture products and the complicated organizational structure of multiple neural networks (five rounds of iterative simulations, eight substeps, 67 independent neural networks, and more than 1 million localized subnetworks). Then, we developed the global remote-sensing-based surface soil moisture dataset (RSSSM) covering 2003–2018 at 0.1 resolution. The temporal resolution is approximately 10 d, meaning that three data records are obtained within a month, for days 1–10, 11–20, and from the 21st to the last day of that month. RSSSM is proven comparable to the in situ surface soil moisture measurements of the International Soil Moisture Network sites (overall R2 and RMSE values of 0.42 and 0.087 m3 m−3), while the overall R2 and RMSE values for the existing popular similar products are usually within the ranges of 0.31– 0.41 and 0.095–0.142 m3 m−3), respectively. RSSSM generally presents advantages over other products in arid and relatively cold areas, which is probably because of the difficulty in simulating the impacts of thawing and transient precipitation on soil moisture, and during the growing seasons. Moreover, the persistent high quality during 2003–2018 as well as the complete spatial coverage ensure the applicability of RSSSM to studies on both the spatial and temporal patterns (e.g. long-term trend). RSSSM data suggest an increase in the global mean surface soil moisture. Moreover, without considering the deserts and rainforests, the surface soil moisture loss on consecutive rainless days is highest in summer over the low latitudes (30 S–30 N) but mostly in winter over the mid-latitudes (30–60 N, 30–60 S). Notably, the error propagation is well controlled with the extension of the simulation period to the past, indicating that the data fusion algorithm proposed here will be more meaningful in the future when more advanced microwave sensors become operational. RSSSM data can be accessed at https://doi.org/10.1594/PANGAEA.912597 (Chen, 2020). Published by Copernicus Publications. 2 Y. Chen et al.: An improved global RSSSM dataset covering 2003–2018

One may argue that if NARX (nonlinear autoregressive with external input) is applied instead, in which the soil moisture in the previous 10-day period is also incorporated as a predictor, precipitation data can be very beneficial to neural network training.
This result is true because precipitation directly contributes to increases in soil moisture. However, NARX is not suitable for global-scale long-term continuous soil moisture mapping because the base map (i.e., the soil moisture at the beginning of the simulation period) is difficult to determine. Moreover, in mid to high latitudes, the lack of soil moisture retrievals over frozen ground in winters will lead to missing data there in summers when soil moisture data are otherwise available. Therefore, if NARX is adopted, we can only estimate long-term soil moisture in the tropics and subtropics with air temperatures consistently higher than 0 °C. Finally, if the soil moisture in the previous phase and the current precipitation amount are both incorporated, they will largely conceal the role of satellite-observed signals. As shown in Figure S16b, the total contribution fraction of all four microwave soil moisture products is reduced to only 10.6%, while the roles of ASCAT, AMSR2-JAXA and AMSR2-LPRM are all negligible. Without taking full advantage of remote sensing, simulations based on previous soil moisture and current precipitation products will lead to errors in regions where soil moisture gains are mostly driven by glacier melting or in places with high levels of radiation-driven surface soil evaporation. The reliability of the derived soil moisture will be reduced in irrigated croplands and afforestation/deforestation areas as well.
On account of all above, precipitation data is neither included as an ancillary soil moisture indicator, nor added as a 'quality impact factor' in this study. 5 Text S2. The screen and processing of ISMN sites' soil moisture records It has been acknowledged that the scale difference between the records at ISMN sites and the 0.1° pixel-scale soil moisture data may lead to incomparability, especially for pixels with open water and inundated land (Loew, 2008). If the measurement site is located on land, away from water, yet the corresponding pixel contains much water, the pixel-scale soil moisture can be significantly higher than the site-measured values.
Conversely, if the site is in or close to the open water or inundated areas but land also exists in the pixel, the soil moisture measured at the station will be much higher than the average pixel value. The absolute values are unmatchable, and the temporal variations cannot be directly compared as well, because the moisture conditions of riverside (or wetland) soil and the land soil may change with precipitation differently.
Therefore, the sites located in the pixels with an average annual maximal water area fraction greater than 5% according to SWAMPS data are excluded (for example, some sites in wetlands in Canada).
Some stations may have two or more sensors, producing multiple soil moisture values at the same time. On this condition, the obviously abnormal values retrieved by one out of the three or more sensors can be excluded by comparison.
Because the ISMN data are in hourly-scale, we first averaged them to daily scale and then to 10-day scale. To ensure data reliability, when calculating daily averages, the days with less than 12 hours with valid records are assigned no data while the soil moisture during a 10-day period can only be obtained by taking the mean value of at least 5 valid daily-averages.

Supplementary figures
8 Figure S2. The overall data accuracy comparison between RSSSM and the surface soil moisture of ERA5-Land: (a) the scatter plot between RSSSM and the measured soil moisture; (b) the scatter plot between ERA5-Land soil moisture and the measured values. All plots are represented as the density of points in a logarithmic scale.        Figure S12. The spatial and temporal pattern comparison between the neural network simulated soil moisture in this study (RSSSM) and other well-acknowledged products: (a) the latitudinal patterns of RSSSM and GLDAS Noah V2.1 surface soil moisture (averaged during 2003~2018); (b) the interannual trends of global mean surface soil moisture derived from RSSSM and GLEAM v3.3a products during 2003~2018. Note that according our validation results, among previous well-known global long-term surface soil moisture products, GLDAS Noah V2.1 has the highest quality in terms of spatial pattern, whereas GLEAM v3.3a can best characterize the temporal variation. Figure S13. The intra-annual variation of surface soil moisture indicated by RSSSM data product. (a) the global spatial pattern of the lowest trough location in time of the calculated surface soil moisture (unit: 10 days); (b~c) the global spatial pattern of the (b) highest peak and (c) minimum trough surface soil moisture content within a year. Figure S14. The relationship between precipitation and the calculated surface soil moisture (RSSSM). (a~b) the spatial map and the cumulative frequency curve of the correlation coefficient between the precipitation and the surface soil moisture during 2003~2018; (c) the cumulative frequency curve of the correlation coefficient between the intra-annual variations of precipitation and surface soil moisture fitted by Fourier functions. Figure S15. The intra-annual variation in the surface soil moisture (RSSSM) decline after 10 consecutive dry days and its relationship with surface soil moisture. (a) the global map of the lowest trough location of the surface soil moisture decline on dry days (unit: 10 days); (b) the correlation coefficient map between the intra-annual variations of surface soil moisture decline on dry days and the surface soil moisture content fitted by Fourier functions; (c) the cumulative frequency curve of the correlation coefficient values in subfigure (b); (d) the global spatial pattern of the intra-annual range of the surface soil moisture decline after 10 dry days. Figure S16. The role of precipitation data in the soil moisture simulations based on BP neural networks and NARX with microwave soil moisture products incorporated: (a) the contributions of different input features of a primary neural network: NN1-1-1, including 4 predictor soil moisture products, 9 quality impact factors of microwave soil moisture retrieval, plus 1 probable ancillary soil water indicator: 10-day averaged precipitation, to the neural network training efficiency indicated by the increased MSE; (b) the contributions of all the input features to the training efficiency, if NN1-1-1 is changed into a NARX (nonlinear autoregressive with external input), in which the SMAP soil moisture for the previous period is also applied as a predictor.

Supplementary tables
Table S1. The basic information on the first round of neural network training (the substep 1 and 2). ASCAT; AMSR2-JAXA NN1-1(2)-7 ASCAT; AMSR2-LPRM NN1-1(2)-8 e ASCAT a 'NN' represents neural network, the first number is the round number, the second one is the number of substep while the last one indicates the priority order of different networks. b There is no order among different soil moisture products. c For NN1-1-X (X=1, 2, …, 8), the PROBA-V LAI is used whereas for NN1-2-X (X=1, 2, …, 8), the GLASS LAI is used. d For NN1-1-X (X=1, 2, …, 8), the time period is 2015D10~2018D36 whereas for NN1-2-X (X=1, 2, …, 8), the time period is 2015D10~2017D36. e This neural network is optional because it cannot further increase the spatial coverage of simulation outputs. f D represents the ordinal of ten days' period in a year. For example, 2015D10 stands for April 1st to April 10th in 2015 while 2018D36 is December 21st to December 31st in 2018. Table S2. The basic information on the first round of surface soil moisture simulation using the trained neural network (the substep 1 and 2). 'ASCAT; AMSR2-JAXA' means SMOS and AMSR2-LPRM data are lacking in that specific pixel. b Order of neural network preference means: if the first neural network (the most preferred one) is available in the zone where the pixel is located, it is applied for soil moisture simulation in that pixel; otherwise, the following neural network is applied if it is available, and so on. c 'SIM' represents the neural network simulated soil moisture, the first number is the round of simulation while the second one indicates the substep. d For SIM-1-1, the data temporal coverage is 2014D01~2018D36 whereas for SIM-1-2, the data period is 2012D19~2013D36. e Optional because unhelpful to increasing the spatial coverage of simulation outputs. f NN1-1-X (X=1, 2, …, 8) are used for soil moisture simulation during substep 1 (the production of SIM-1-1) whereas the neural networks built in substep 2, that are labelled as NN1-2-X (X=1, 2, …, 8), are applied for the calculation of SIM-1-2.    Table S6. The basic information on the third round of surface soil moisture simulation using the trained neural network (the substep 1).
Network code Training target Network input soil moisture products Input LAI product Input data's time period Number of 10 days' period a Optional because these neural networks has already been included in substep 1. Table S8. The basic information on the third round of surface soil moisture simulation using the trained neural network (the substep2).
Available soil moisture products in the pixel Order of neural network preference The code of output soil moisture product and its temporal coverage  Table S9. The method applied in combining SIM-3-1 and SIM-3-2 to produce SIM-3.
Data availability of SIM-3-1 and SIM-3-2 in a specific pixel The expression for SIM-3 Other conditions SIM-3=(SIM-3-1+SIM-3-2)/2 a 'Range' refers to the data range of SMAP_E surface soil moisture in a specific pixel from April 2015 to 2018. b 'SMmax' is the maximum soil moisture value in a specific pixel reported by the SMAP_E dataset. c 'SMmin' is the minimum soil moisture value in a specific pixel according to the SMAP_E dataset.

24
Table S10. The basic information on the fourth round of neural network training (the substep1).  Table S11. The basic information on the fourth round of surface soil moisture simulation using the trained neural network (the substep1).
Table S13. The basic information on the fourth round of surface soil moisture simulation using the trained neural network (the substep2) a .
Available soil moisture products in the pixel Order of neural network preference The code of output soil moisture product and its temporal coverage