Median bed-material sediment particle size across rivers in the contiguous U.S.

10 Bed-material sediment particle size data, particularly for the median sediment particle size (D50), are critical for understanding and modeling riverine sediment transport. However, sediment particle size observations are primarily available at individual sites. Large-scale modeling and assessment of riverine sediment transport are limited by the lack of continuous regional maps of bed-material sediment particle size. We hence present a map of D50 over the contiguous U.S. in a vector format that corresponds to millions ofapproximately 2.7 million river segments (i.e., flowlines) in the National Hydrography Dataset Plus 15 (NHDplusNHDPlus) dataset. We develop the map in four steps: 1) collect and process the observed D50 data from 2577 U.S. Geological Survey stations or U.S. Army Corps of Engineers sampling locations; 2) collocate these data with the NHDplusNHDPlus flowlines based on their geographic locations, resulting in 1691 flowlines with collocated D50 values; 3) develop a predictive model using the eXtreme Gradient Boosting (XGBoost) machine learning method based on the observed D50 data and the corresponding climate, hydrology, geology and other attributes retrieved from the NHDplusNHDPlus dataset; 20 4) estimate the D50 values for flowlines without observations using the XGBoost predictive model. We expect this map to be useful for various purposes, such as research in large-scale river sediment transport using model- and data-driven approaches, teaching of environmental and earth system sciences, planning and managing floodplain zones, etc.

Median bed-material sediment particle size across rivers in the contiguous U.S.
The sediment transport modes can be classified into bed-material load and wash load (Garcia, 2008). The bed-material load consists of all sizes of particles existing in a river bed regardless of whether they are being transported along the bed (bedload) or in suspension (suspended load). Wash load consists of very fine particles (diameter less than 0.062 mm) that are always in 35 suspension in the water and rarely reside on the bed (Garcia, 19752008). Wash load is usually controlled by only land surface processes (soil erosion in hillslopes and transport from hillslopes into rivers), but not much by riverine hydraulic conditions (Garcia, 19752008). In this study, we focus on the bed-material sediment particle size data that are critical in applying sediment transport formulas to estimate bed-material load. For example, the Engelund-Hansen equation estimates bed-material load, and median bed-material sediment particle size (denoted as D50, i.e., the size larger than 50% of sediment particles) is one of 40 the most important parameters in the Engelund-Hansen equation (Engelund and Hansen, 1967).
Despite the importance of bed-material sediment particle size, such data has limited availability due to the expensive costs of measuring and analyzing such data. As one of the most data-rich countries in the world, the United States (U.S.) collects and disseminates the sediment particle size data mainly through two federal agencies: The U.S. Geological Survey (USGS) and the U.S. Army Corps of Engineers (USACE). USGS manages the most gauges and distributes the river-related measurements 45 on the U.S. rivers. As of April 2021, there are 424948 stations with field/laboratory samples in the USGS water quality portal, among which 1.2% (49913644) include bed-material sediment particle data for rivers over the contiguous U.S., and 0.56% (2277) have complete percentiles of bed-material sediment particle data to computefor computing D50.
Spatial approximation, i.e., interpolation or extrapolation, is a typical method to overcome data sparsity when there is no universal relationship between the variable of interest (e.g., D50) and other extensively available information. In the case of 50 sediment particle size, a simple spatial approximation should be conducted within the same river system, assuming similar geological and hydrological settings. Here we denote a river system as the whole river network discharging to the ocean (or inland lakes) via the same outlet. Such a simple spatial approximation is nevertheless not feasible in many river systems, where there are few or no measurement data to support meaningful interpolation and extrapolation. Several studies have reported empirical relationships between bed-material sediment particle size with river channel characteristics (e.g., channel slope) and 55 flow regimes (Niño, 2002;Zhang et al., 2017). Such relations are nonetheless site-specific and not universal enough to apply over various river systems.
An alternative approach is to establish complex correlations between sediment particle size and other data that are extensively available over the contiguous U.S. Such correlations can then be applied across the U.S. for predicting sediment particle size.
Conventional linear or nonlinear regression methods usually require good prior knowledge of the mechanisms controlling 60 sediment size distribution, and thus are not suitable for use to establish complex correlations when understanding of factors that control sediment size is somewhat limited. Machine learning offers an effective way forward because of its ability to establish nonlinear, complex predictive models without the prerequisite of sufficient process-based knowledge (Afan et al., 2016). Therefore, our objective is to develop a spatial map of D50 over the contiguous U.S. rivers by establishing a predictive model 65 between D50 and other extensively available hydroclimatological and geological data using state-of-the-art machine learning techniques. In the following, we describe the data in Section 2, introduce the machine learning model development in Section 3, and present our results in Section 4. We also explain the limitations of our method in Section 5, potential usage of the D50 map in Section 6, and data availability in Section 7. We finally conclude with Section 8.

Bed-material sediment particle size observations
The USGS sediment data are available to the public through the National Water Information System (NWIS) water quality data portal. There are 49913644 USGS stations with at least one sample of bed-material sediment particle size, but only 2277 stations have complete data to allow meaningful computation of D50, as shown in Figure 1a. There are 1367 USGS stations with incomplete percentiles of bed-material sediment particle data, which can be divided into three groups: 1) 1183 stations 75 have no effective percentiles provided; 2) 147 stations have only percentiles above the 50 th percentile; 3) 37 stations have only percentiles below the 50 th percentile. Therefore, we neglect these data in further analysis.
The USACE sediment particle size data are available in a technical report by Gaines and Priestas (2016). Gaines and Priestas (2016) include the bed-material sediment particle size samples taken at 442 locations along the Mississippi River main stem between Head of Passes, Louisiana and Grafton, Illinois. We exclude the locations without exact geographic coordinates and 80 eventually yield 300 locations, as shown in Figure 1a. In total, we have 2577 locations with complete bed-material sediment particle size percentiles to allow for the D50 calculation. At each location, the sediment particle size might have been sampled more than once inat different yearstimes, although almost half of the locations are sampled only once (see Figure 1b). Figure   1c shows for the years that histogram and Figure S1a for the spatial map). For about 94% of these stations, the latest samples were taken. About 94% of these locations have been sampled after 1970the 1970s (see Figure 1c for the histogram and Figure  85 S1b for the spatial map). We calculated the coefficient of variation (CV) for the 760 stations that have at least 5 samples over time. For the rest of the stations, the number of samples is too small for meaningful calculation of CV. For most of these 760 stations, the CV values range between 0.3 and 1.2, with a median of approximately 0.6 (see Figure S2). The small CV values indicate the good stability of D50 (at the same location) over time.
We compute the D50 values from the measured sediment particle size distributions in three steps: 1) the cumulative sediment 90 size distribution curve is drawn with log-2-transformed sediment size (in mm) following the concept of the Krumbein phiΨ scale (Krumbein, 1934Parker andAndrews, 1985). 2) A linear interpolation is performed between the percentiles smaller and larger than the 50 th percentile to obtain the D50 value. 3) For the stations with multiple sampling times, a representative D50 value is computed as the mean D50 value from all the sampling times. We take the mean as a representative D50 to simply account for possible uncertainties in sampling and measurement. Although the sampling and measurement procedures are 95 carefully designed (Edwards & Glysson, 1999), it is practically impossible to avoid uncertainties in such sampling and measurement procedures. Thus, we believe a representative D50 can be better estimated by taking a mean. The D50 values calculated following this procedure are denoted as ""observed D50 values"" to differentiate them from the predicted D50 values using machine learning techniques described later. Figure 1d shows the histogram of the computed D50 values in the Krumbein phiΨ scale. About 75% of these D50 values are between 0.0625 mm and 2.0 mm. Garcia (2008)It is suggested that 100 a river can be a sand-bed or gravel-bed river if the D50 value is below or above 2.0 mm. (Garcia, 2008). The D50 values computed from the observed sediment particle size distributions thus dominantlymostly reflect sand-bed river conditions, while only approximately 25% are gravel-bed rivers.
One might wonder how the sites with observed D50 values are distributed between small and large streams (e.g., whether or not smaller streams have more observed D50 data than larger streams). We use stream order ( Figure S3a and D50 samples ( Figure S3b, d), respectively. The total flowline length increases with the stream size (i.e., stream order or drainage area), which is expected since overall larger rivers have longer lengths. Interestingly, the number of D50 stations follows a bell distribution except for the largest stream order or drainage area, which is primarily due to the USACE measurements on the lower Mississippi River (198 sample locations). Therefore, there is no clear indication that larger or 110 smaller streams dominate the D50 data points.

Predictive variables
The predictive variables are retrieved from the NHDplusNHDPlus database (McKay et al., 2012) and additional attributes for the NHDPlus catchments from the ScienceBase dataset (Wieczorek et al., 2018). ScienceBase is a comprehensive scientific data and information management platform hosted by USGS (sciencebase.gov). In the medium resolution NHDplusNHDPlus,115 there are about 2.7 million stream segments (average length of 1.93 km, denoted as flowlines from now on). NHDplusNHDPlus directly provides 138 attributes of flowlines, most of which are descriptive instead of quantitative. We select eight quantitative attributes relevant to the channel geometry and hydrology, such as upstream drainage area, channel bed slope, mean annual flow velocity, sinuosity, etc. ScienceBase provides additional attributes related to the NHDplusNHDPlus watersheds (local drainage area corresponding to a single flowline) and associated upstream drainage areas in thirteen themes (Wieczorek et al., 120 2018). We select 68 hydroclimatological and geological attributes from ScienceBase, such as climate, hydrologic, topographic, soil, and geologic conditions. In total, 76 attributes are selected as potential predictive variables for input to the machine learning algorithm. We provide a detailed list of these predictive variables in Supplimentary Supplementary Table S1 and four illustrative maps in Supplementary Figure S1S4.
We then establish the spatial correspondence between the observed D50 values and the 76 predictive variables. In 125 NHDplusNHDPlus, there are ~26000 USGS stations associated with a portion of the flowlines through the common identifiers.
This common identifier is unique for every flowline, but several USGS stations may be located on the same flowline and have the same common identifier. We match the 2277 USGS stations (withthat have observed D50 values) with stations in NHDplusNHDPlus. Some of the 2277 USGS stations are not included in NHDplusNHDPlus, so we obtain 1530 matching stations. The 300 USACE sampling locations are collocated with the flowlines via their geographic coordinates. Several USGS 130 stations or USACEThere are 12 flowlines with 2 sampling locations may be on the same flowline.and 2 flowlines with 3 sampling locations. In that casethose cases, we assign the average of the D50 values of these USGS stations to the flowline.
The mean length of the 14 flowlines is 6.63 km. In such a length, only two or three sampling locations cannot capture the spatial variability in a meaningful way. Therefore, we simply calculate the average without making further assumptions. We further exclude a few flowlines with missing attribute values. Finally, we have a total of 1691 flowlines corresponding to the 135 observed D50 values. In other words, as shown in Figure 2. As such, in each of these 1691 flowlines, we have established a good correspondence between the observed D50 values and the 76 predictive attributes.

Model Development
Among various machine learning methods, eXtreme Gradient Boosting (XGBoost) is a version of the gradient tree boosting algorithm known for its high efficiency and superior performance in recent years (Chen and Guestrin, 2016;Zheng et al., 2019;140 Fan et al., 2021). The relations between the input predictors (e.g., watershed characteristics) and D50 are too complex to be established with traditional linear regression or dimensionless analysis methods. Therefore, we adopt XGBoost to develop a predictive model with the Optuna optimization framework (Akiba et al., 2019) for tuning hyperparameters and the SHapley Additive exPlanations (SHAP) (Lundberg and Lee, 20162017) for feature importance analysis and thus feature selection. We also consider the representativeness of input predictors in the feature selection. More details are explained as follows. 145

XGBoost: eXtreme Gradient Boosting
Tree boosting is a machine learning framework that combines weak learners to develop a strong learner, where the base learners are decision trees that are trained sequentially, with the latter focusing on mistakes made by the preceding one. Gradient boosting machines are a family of tree boosting techniques where errors are minimized by gradient descent algorithms.. One of the most recent offspring of gradient boosting techniques is the XGBoost, a scalable end-to-end tree boosting system (Chen 150 & Guestrin, 2016). It has been successfully utilized across a wide array of applications, such as snowpack estimation (Zheng et al., 2019) and water storage change in a large lake (Fan et al., 2021). XGBoost dataset is represented as = {( , ), = 1, 2, . . . , }, where = [ 1 , 2 , 3 , . . . , ] is a row vector with input features with real value elements and . The tree ensemble model employs M additive functions to predict the output of interest as where Ϝ is the space of regression trees. The model is trained in an additive manner by minimizing a regularized objective to learn the set of functions employed in the model. At each iteration, a differentiable convex loss function that measures the difference between the prediction ̂ and the target is computed, and the model is also penalized for the complexity of the regression tree functions.

Tuning Hyperparameters 160
Tuning hyperparameters is a cumbersome task and is often performed by reducing the parameter search space through randomized search and applying a grid search on the reduced space. Alternatively, hyperparameter optimization frameworks like Hyperopt (Bergstra et al., 20132015) and Optuna (Akiba et al., 2019) are commonly preferred since they can continually narrow down the bulky hyperparameter search space to an optimal space based on the preceding results. This study implements Optuna with a Tree-structured Parzen Estimator (TPE) parameter sampling framework to obtain the optimal hyperparameter 165 sets.
The procedure for tuning hyperparameters relies on two major components: cross-validation and evaluation metrics. Crossvalidation measures the model's predictive power with a given hyperparameter set by dividing a dataset into folds. In -fold cross-validation, the dataset is randomly split into mutually exclusive subsets of approximately equal size as, = { 1 , 2 , 3 , . . . , }. In each iteration, − 1 folds of are used for training, and the remaining one is used for validation. The 170 predictions resulting from a given set of hyperparameters are made by iterating through the folds, so the model is trained and validated times. Hence, model performance values and the mean value is reported as the model performance for this set of hyperparameters. Optuna allows the use of user-defined metrics for model evaluation during the -fold cross-validation.
Taking advantage of this structure, we use the Kling-Gupta Efficiency (KGE) (Gupta et al., 2009) as the model performance metric. 175 where σ is the standard deviation, μ is the mean, and is the linear correlation between the observed and simulated series. A perfect agreement between observation and simulation gives the theoretical maximum KGE value at 1.0. The higher the KGE value, the closer the match between the observed and simulated series. KGE offers some advantages over commonly used metrics like root mean squared errors (RMSE) or the coefficient of determination (R 2 ) because: 1) it is not dominated by 180 relatively large values; and 2) it simultaneously captures both the magnitude and phase differences between the observed and simulated series (Gupta et al., 2009).

Feature Selection
Feature selection is also an essential step in developing a simpler model that is still capable of reasonably predicting the target variable with fewer attributes. Feature importance is a technique of computing each predictive variable's degree of contribution 185 towards the optimal prediction model, which can be used for determining feature selection. The approaches of computing feature importance scores include correlation coefficient, the coefficients calculated as part of decision trees, or advanced approaches like SHAP (SHapley Additive exPlanations) (Lundberg and Lee, 20162017). In this study, we use the mean absolute SHAP values as feature importance measures.
Initially, we begin with 76 predictive variables. For feature selection purposes, we add a new ""predictor"" of randomly 190 generated real number values. We train the model and compare the feature importance scores (i.e., the mean absolute SHAP values) of all predictors. Then, all attributes with scores less than the random number attribute are dropped out. The procedure is repeated using the new set of predictors until the random number attribute is the least important feature. Lastly, the remaining features are utilized for tuning the final optimal set of hyperparameter values.
Then, we further examine the representativeness of the data by comparing the ranges of the selected features between the D50-195 available data and the nationwide data. We use the 2.5th and 97.5th percentiles to represent the lower and higher ends of ranges in the available data. We do not directly use the absolute min/max values to avoid the impacts of outliers. We then calculate the percentage of the nationwide data below the 2.5th percentile of the available data. A percentage value of no more than 10% indicates a good match of lower ends between the available and nationwide data. Similarly, we calculate the percentage of the nationwide data above the 97.5th percentile of the available data. A percentage value of no more than 10% indicates a good 200 match of upper ends between the available and nationwide data. Taking together, for any feature, if more than 80% of the nationwide data are located within the 2.5th and 97.5th percentiles of the available data, we consider that the available data is sufficiently representative of the nationwide data for this specific feature. Otherwise, a feature is considered non-representative thus is removed from model development. Lastly, the remaining features are utilized for tuning the final optimal set of hyperparameter values. 205

General Steps
The general steps of the model development procedure are as follows and illustrated in Figure 3. 1. The predictors are scaled using the Minimum-Maximum scaler method, i.e., all features will be transformed into a range of [0,1]. The main advantage of having this bounded range normalization is that it can supresssuppress the effect of outliers. 210 2. The dataset is randomly split into training (70%) and testing (30%) sets, respectively. Only the training data are used in steps 3 and 4, while the testing data are reserved for step 5.
3. Optuna and k-fold (k=5) cross-validation are used for tuning hyperparameters, with a maximum tree of 5000 and an early stopping value of 50. The objective function for the hyperparameter optimization procedure is to maximize the mean Kling-Gupta Efficiency (KGE) value returned from the k-fold cross-validation. 215 4. Feature selection is performed as described in section 3.3, so step 3 is repeated with the new and smaller set of predictors. Steps 3 and 4 are repeated until no more predictorpredictors can be excluded.
5. The final model is developed by fitting on the whole training data using the optimal hyperparameters, and evaluated using the testing data reserved in step 2. 6. The model from step 5 is used to predict the D50 values for the contiguous U.S. river flowlines. 220

Results
We discuss our results in three steps: the subset of flowlines as the basis to formulate our predictive model, the development and validation of our predictive model, and the national D50 map derived based on the predictive model.

Feature Selection
After iterations of feature selection (procedure described in sections 3.3 and 3.4), 1312 out of 76 predictive variables, or predictors, are eventually selected and shown in Table 1. These. Firstly, 13 variables are identified as more significant than a random-number input vector based on the mean absolute SHAP value., as shown in Table 1. 2 out of 8 channel characteristics 235 and 11 out of 68 basin characteristics remain as the significant predictors (see Table 1 for description). The most important predictor is found to be the soil erodibility factor (Soil_erod_factorTOT_KFACT), followed by average annual wet day (Ann_wet_daysTOT_WDANN) and mean annual snow as a percent of total precipitation (TOT_PRSNOWAnn_snow_perc).
These three basin-related predictors rank higher than the two channel-related characteristics. Channel slope (Slope) and distance between flowline and the river mouth (PathlengthChan_length) are found to be the most important channel 240 characteristics for predicting D50, which agrees with the downstream fining phenomena and sediment transport mechanisms (Nino, 2002). It is somewhat surprising that some hydraulic channel characteristics such as mean annual flow velocity are not included in the final feature selection. Studies on river hydraulics show relations between channel flow (i.e., velocity and water depth) and channel bed characteristics (i.e., slope and roughness), such as the Manning'sManning's equation, Chezy'sChezy's law, etc., and channel bed roughness can be related to bed sediment size (Garcia, 2008). However, the feature selection with 245 the XGBoost model and SHAP value indicates that mean annual flow velocity may not be a good predictor for D50 in this case. A possible reason is that mean annual flow velocity is dependent on some of the selected parametersfeatures such as TOT_WDANNAnn_wet_days, Slope, etc., so excluding this variable avoids overfitting the data.
It should be noted that the ranking of feature importance according to the mean absolute SHAP values is quite different from the correlation coefficients between D50 and predictors, as shown in Figure 3. TOT_KFACT4. Soil_erod_factor and 250 TOT_WDANNAnn_wet_days, the two most important features in Table 1, have correlation coefficients of only 0.08 and 0.06, respectively. TOT_PRSNOWAnn_snow_perc has the strongest correlation with D50, with a correlation coefficient of 0.29.
The individual scatter plots between D50 and alleach of the selected features do not show any apparent relationship between D50 and any single feature (see Supplementary Figure S2). This indicatesS5), indicating that the XGBoost model can revealthere might exist some higher-order interactions among the predictors for better predictionswhich the traditional 255 regression analysis cannot reveal.
We further examine the representativeness of the 1691 flowlines with observed D50 values of the rest flowlines included in the NHDPlus database in terms of the ranges of the 13 features. For convenience, we denote the subset of NHDPlus data associated with the 1691 flowlines as D50-available, and the whole NHDPlus database as nationwide. For most of the 13 features, the percentages of the nationwide data that are beyond the lower or higher ends of the D50-availableare no more than 260 10%, except for channel slope, i.e., "Channel_mean_slope". Table 2 also lists the relative difference in the 25th, 50th and 75th percentiles between the D50-available and nationwide data. For instance, for the 25th percentile, we calculated the relative difference as the ratio of the difference between the 25th percentile of the D50-available data and that of the nationwide data over the average between the 25th percentile of the D50-available data and that of the nationwide data. This relative difference is less than 0.5 for most of the features, again except for "Channel_mean_slope". For a better visual illustration, Figure 5 shows 265 the cumulative distribution functions (CDFs) and corresponding 5th-, 25th-, 50th-, 75th-, and 95th-percentiles. The CDFs are close between the D50-available and nationwide data, except for "Channel_mean_slope". A similar message can be seen from the box plots in Figure S6. In addition, we would like to point out that the 1691 sampling stations we use to train and test our model are located across the whole contiguous U.S., hence geographically representative. Therefore, we conclude that the 1691 flowlines with observed D50 values are representative enough of all the flowlines nationwide in terms of the 12 input 270 predictors, except for "Channel_mean_slope".
We remove "Channel_mean_slope" and use the remaining 12 predictors to develop the final model, following the same model training and testing procedures as before. Figure 6 shows the comparison of model performances between the previous and new models. The model performance metrics are similar. Actually, R 2 became slightly better in both training (0.834 vs. 0.830) and testing (0.405 vs. 0.367), while KGE became slightly worse in training (0.775 vs. 0.794) but better in testing (0.527 vs. 275 0.513). The slight decrease of KGE in training data is reasonable since the model hyperparameter tuning was based on the objective of maximizing KGE, and losing one predictor will slightly reduce the space of parameter tuning. Nevertheless, now the KGE value in the testing phase is closer to that in the training phase.
Although feature selection sheds light on the contribution of input variables to model outputs, a drawback of the machine learning technique is that it cannot explain mechanistically why selected features are more important than unselected ones. 280 Therefore, the goal of feature selection is to find the best (i.e., most robust) input variables to feed the best model for D50 predictions. If a different machine learning algorithm from XGBoost is used, the selected features, especially their rankings, can be different. Feature selection is dependent on the selection of the algorithm, so the selected features in this study should not be directly used in other models or studies. In the 12 selected predictors, only one is directly related to the channel processes. The remaining 11 are all land features, and their mechanistic connections with D50 are rather mysterious at this 285 stage, which could be considered as empirical evidence on the likely causal, yet highly complicated relationships between D50 and the land features, and hopefully inspire future studies to shed light on the underlying mechanisms. Table 23 shows the tuned hyperparameters of the best XGBoost model that is trained using the 1312 selected predictors and 70% of the training data set. For a detailed explanation of the hyperparameters please refer to Chen and Guestrin (2016). Figure  290 46 shows the performance of the optimal XGBoost model on the training and testing data sets, respectively. Here we consider an optimal model based on two criteria: 1) the model performance is satisfactory in both the training and testing phases, adand indicated by good metrics values (e.g., KGE in this study), and 2) the model performance is relatively consistent between the training and testing phases. Here the optimal XGBoost model gives the KGE value 0.79475 for training and 0.513528 for testing. The testing value is above 0.5, suggesting satisfactory model performance (Gupta et al., 2009;Knoben et al., 2019). 295

Model Hyperparameters and Performance
The performance on the testing data is noticeably worse than that on the training data, as expected. This difference is nevertheless acceptable given the complexity of the prediction problem. The relatively consistent model performance between the training and testing phase suggests that the model validation (via the testing phase) is successful.

Model UncertaintySensitivity Analysis 300
We carry out further analysis to shed light on how the modellingmodeling results may be sensitive to some of the key steps as outlined in Section 3.4. We focus on Steps 2 to 4 only because Steps 1 and 5 are standard practice, and Step 6 is to utilizeutilizes the modellingmodeling results.

For
Step 2, the 2/3 (train) and 1/3(test) split is typical in machine learning for splitting training and testing data. This can be 305 readjusted up to 4/5(train) and 1/5(test) if the total sample size is sufficiently large, which is nonetheless not the case here. For Step 3, theFor Step 3, we test the sensitivity on the choices of model performance metrics and k value, respectively. For the model performance metrics, we have also tried NSE and R 2 and found that using KGE gave better model performance ( Figure  S7) due to two reasons: 1) a much smaller percentage of bias (PBIAS) and 2) visually better alignment between the simulated and observed D50 values along the 1:1 line. The choice of k value is usually 5 or 10 depending on the training sample size. 310 We use 5 since using 10 significantly reduces the number of samples per fold, and the left-out sample will be too small for validation during cross-validation. Increasing k-fold to 6 or decreasing it to 4 still gives a similar satisfactory performance in both the training and testing phases, with training/testing KGE of 0.759/0.505 and 0.795/0.512, respectively. For Step 4, we evaluate the model sensitivity to each selected feature by dropping one of the 1312 variables at a time and repeating the same modellingmodeling procedure for the remaining 12 variables. Figure 57 shows that dropping the variables leads to the model 315 performance dropping below KGE = 0.475 in5 during the trainingtesting phase forin most features except for cases. The change in testing phase KGE ranges between 4 -22 %. The largest changes are observed when dropping R_factorTOT_SATOF and TOT_WBM_TAVMean_elev. Even with those two, the KGE difference between the training and testing phases increases from 0.28 to 0.36 by including them as predictors. Thus, all the variables remaining after feature selections play a significant role in the final model. 320

National Map
Using the developed machine learning model and NHDplusNHDPlus channel/basin characteristics data, we are able to produce a national map of bed sediment D50 values ( Figure 68). To our best knowledge, it is the first of its kind D50 data for the whole contiguous U.S. The spatial pattern of D50 in Figure 68 are generally consistent with the observed D50 in Figure 2. High D50 values are mostly distributed in the west coast, upper Missouri and Ohio regions, and low D50 values are concentrated in the 325 east coast. The consistency between Figures 22 and 68 suggests that the observed D50 data are reasonably representative of the whole contiguous U.S., despite the sparse distribution. Given that the testing data set is independent of the training dataset, we expect that the error statistics derived for the testing data should be relatively consistent with the error statistics in applying the model to derive the national map of D50. To our best knowledge, it is the first-of-its-kind D50 data for the whole contiguous U.S. Such a D50 map is mostly valuable to support the parameterization of large-scale sediment modelling at the regional or 330 national scale, which has been very challenging, if not impossible, before this map.

Limitations of the method
The predicted D50 values may be subject to several limitations despite using state-of-the-art machine learning techniques to develop the predictive model. These limitations include: (1) Limited data availability. Although the 1691 observed D50 values are adequately representative of the contiguous U.S. (i.e., consistent spatial patterns between Figures 23 and 68), limited data 335 availability prevents us from establishing a separate predictive model for each river basin. For example, there is little observed D50 data in the Rio Grande and Great Basin, so the predicted D50 values over these basins should be used cautiously.
(2) Our methodology is statistical in nature and lacks explicit process-based understanding. For example, Figure 46 shows the model tends to overestimate D50 for smaller D50 values (particularly < 0.25 mm) and underestimate D50 for larger D50 values (particularly > 1 mm). However, in various trials we have performed, the current result is closest to the 1:1 relationship based 340 on both the KGE metric and visual check. Further process-based understanding of this systematic bias is beyond the scope of this study and left for a future work. because it would require a) a highly-integrated, process-based model that considers at least sediment erosion, deposition and transport processes in both hillslopes and channels, and b) well-designed numerical experiments to isolate the dominant processes and controlling factors. (3) We have not explicitly incorporated the effects of lakes and reservoirs but rather assumed these effects have been indirectly reflected in the NHDplus hydrologic attributes 345 adopted in the predictive model. NHDPlus hydrologic attributes adopted in the predictive model. (4) The bed sediment at the gage station may not always be representative of the reach. Edwards and Glysson (1999) characterized how most of the bed sediment samples were collected and composited at a cross-section by the USGS over the years. Gage stations are established at cross-sections in the stream where flow measurements are convenient and with conditions conducive to high-quality flow measurementsthe issue of whether the bed sediment composition represents the reach is generally not taken into account 350 when the gage station location is established. As such, our predictive results are certainly not free from uncertainties. Therefore, we recommend using our D50 map for sediment modeling and assessment at the regional or national scales instead of local studies at the individual river segment.

Potential usage
The D50 map might be used for large-scale sediment transport modeling over the whole contiguous U.S., or a major river 355 basin such as the Mississippi River basin. For example, we have tested the usage of the new D50 dataset within a large-scale suspended sediment modeling framework , and our successful model validation against the USGS observed suspended sediment load over multiple stations suggests the good value of such a national scale D50 dataset. There is inevitably some uncertainty embedded in these mapsthis map sourced from the original D50 observations and NHDplusNHDPlus attributes, the XGBoost modeling, and the spatial extrapolation process. This uncertainty should be taken 360 into account while evaluating the uncertainties in the model simulationsutilizing this map for regional-scale assessment or modeling.
The D50 map may also be used to derive other important parameters for large-scale hydrological or hydraulic modeling. For instance, Manning's roughness coefficient is an important parameter for river routing modeling. It is, however, largely empirical and hard to directly measure, hence not readily available at a regional or national scale. Previous studies have 365 established some empirical relationships between Manning's roughness coefficient and D50 for river channels (Coon, 1998;Gillen, 1996;Julien, 2002;Meyer-Peter & Müller, 1948). Thus, one might derive a map of Manning's roughness coefficient over the contiguous U.S. based on the empirical relationship and the D50 maps.

Conclusions
We develop a new national map of the median bed sediment particle size by combining the USGS sediment observations, the channel and watershed characteristics from NHDplusNHDPlus and ScienceBase, and state-of-the-art machine learning 375 techniques. Despite the limitations, the map is highly valuable for sediment modeling and assessment at the regional and larger scales, which has not been feasible previously.