Deep-sea sediments of the global ocean

. Although the deep-sea floor accounts for approximately 60 % of t he Earth’s surface , there has been little progress 5 in relation to deriving maps of seafloor sediment distribution based on transparent, repeatable, and automated methods such as machine learning. A new digital map of the spatial distribution of seafloor lithologies below 500 m water depth is presented to address this shortcoming. The lithology map is accompanied by estimates of the probability of the most probable class, which may be interpreted as a spatially-explicit measure of confidence in the predictions, and probabilities for the occurrence of five lithology classes (Calcareous sediment, Clay, Diatom ooze, Lithogenous sediment, and Radiolarian ooze). These map 10 products were derived by the application of the Random Forest machine learning algorithm to a homogenised dataset of seafloor lithology samples and global environmental predictor variables that were selected based on the current understanding of the controls on the spatial distribution of deep-sea sediments. It is expected that the map products are useful for various purposes including, but not limited to, teaching, management, spatial planning, design of marine protected areas and as input for global spatial predictions of marine species distributions and seafloor sediment properties. The map products are available 15 at https://doi.org/10.1594/PANGAEA.911692 (Diesing, 2020).

Referee 1 was generally positive about the submitted manuscript, but pointed out the following issues that should be addressed: 1. The importance of eliminating correlated input variables was not discussed. Considering that the Random Forests algorithm allows for non linear input relationships, a justification for this does not seem obvious, particularly since its results contradict those of the Boruta algorithm, which deemed all 38 predictors as important.
The selection of predictor variables/features is a two-step process. The first step (Boruta algorithm) narrows the selection to those variables that are relevant. In this case, all variables were identified as relevant/important. It is important to note that the Boruta algorithm is an "all-relevant" feature selection method (Nilsson et al., 2007), which identifies all predictors that might be relevant for classification (Kursa and Rudnicki, 2010). It does not address the question of redundancy in the predictor variable data. To limit redundancy, the second step seeks to identify predictor variables that are correlated with other predictors of higher importance. To do so, it is necessary to define a critical value of the correlation coefficient r. This is arguably subjective, and the choice of r will influence the number of predictor variables that are selected. To investigate the influence of r on the OOB error of a Random Forest model (using default parameter values), several values of r between 0.1 and 1 (step 0.01) were now trialled. When the OOB error is plotted over r (Figure below), it is apparent that initially (with small values of r) the OOB error drops off quickly but stabilises at values of r  0.4 -0.5. As a large number of predictors potentially leads to overfitting, increases processing time and decreases interpretability of the model, it was decided to select a small value of 0.5, leading to the selection of eight predictor variables (see section 4.1 Variable selection).
The general strategy for predictor variable selection was to initially find all predictors that are potentially relevant, then reduce redundancy by limiting the set of predictors to those that are uncorrelated. Such an approach has also been advocated by . Finding important predictors is achieved with the Boruta algorithm, an "all-relevant" feature selection method (Nilsson et al., 2007), which identifies all predictors that might be relevant for classification (Kursa and Rudnicki, 2010). Limiting redundancy is subsequently achieved by a correlation analysis. To do so, it is necessary to define a critical value of the correlation coefficient r. This is arguably subjective, and the choice of r will influence the number of predictor variables that are selected. To investigate the influence of r on the OOB error of a Random Forest model (using default parameter values), several values of r between 0.1 and 1 (step 0.01) were now trialled. When the OOB error is plotted over r, it is apparent that initially (with small values of r) the OOB error drops off quickly but stabilises at values of r  0.4 -0.5. As many predictors potentially lead to overfitting, increase processing time, and decrease interpretability of the model, it was decided to select a small value of 0.5. This led to eight predictors being selected.
For clarity, section 3.2 Predictor variable selection was updated. Dutkiewicz et al., 2016, referred to the absence of iron concentration as potential predictor variable form their model and correlation analysis, respectively. In this study, the author initially includes iron concentration as a potential predictor variable. However, in the final model, the iron concentration is not used. Dutkiewicz et al. (2016) highlighted that iron concentration in surface waters might be an important predictor, as phytoplankton blooms (e.g. diatoms) are enhanced by iron fertilisation. However, they also point out that the Southern Ocean where diatom oozes are abundant, receives very little iron. In fact, productivity is not the only factor determining the composition of deep-sea sediments, dissolution during sinking and dilution with other materials also have to be considered.

Dutkiewicz et al., 2015 and
Iron concentration was initially included in this study. However, iron concentration was not included in the final model because it was correlated (above the selected threshold) with sea-floor minimum temperature. Box plots of iron concentration versus seafloor lithology did not reveal high discriminatory power of this predictor. Because of that and because it was intended to keep the variable selection process as "automated" as possible, it was decided not to intervene and force the inclusion of iron concentration as a finally selected predictor.
No changes were made to the manuscript.

It would be interesting to be discussed the presence of available global CCD models (and its limitations) that could be used or not as a predictor.
The Carbonate Compensation Depth (CCD) is potentially another important predictor that was not included in the model. This was due to the absence of relevant datasets in the utilised databases. A literature search did not yield any publications with associated datasets that could be used. However, it is assumed that this missing information is (at least partly) provided by other predictors, such as water depth and bottom water temperature.
No changes were made to the manuscript.
6. The author uses the unscaled mean decrease in accuracy as a variable importance measure. Although this is the recommended approach (e.g. , it would be better if the reasons behind this choice are stated inside the text.

Agreed.
The text of section 3.4 Random Forest classification model has been updated accordingly and now reads: RF also provides a relative estimate of predictor variable importance. The importance() function of the randomForest package allows to assess variable importance as the mean decrease in either accuracy or node purity. However, the latter approach might be biased when predictor variables vary in their scale of measurement or their number of categories  and was not used here. Variable importance is therefore measured as the mean decrease in accuracy associated with each variable when it is assigned random but realistic values and the rest of the variables are left unchanged. The worse a model performs when a predictor is randomised, the more important that predictor is in predicting the response variable. The mean decrease in accuracy was left unscaled as recommended by , and is reported as a fraction ranging from 0 to 1. 7. In the manuscript, it is not stated if the RF sub-sampling of predictor variables is performed with or without replacement. In the provided script seems to be with replacement (as the default option in the used package). However, studies have shown that this approach can be biased when predictor variables vary in their scale and/or in their number of categories (e.g. . Also, it is not also clear if any type of feature scaling has been applied in the predictor variables before modelling. In case that the author would like to continue without feature scaling for the predictor variables, packages like party seem to provide more unbiased results. In any case, it would be good to be provided with a more comprehensive explanation, as the model interpretability is one of the main targets in this work.
Agreed. The predictor variables have now been scaled and the subsampling of predictor variables is performed without replacement.
The manuscript was updated accordingly. Relevant changes were made to sections 3.1    is spatially imbalanced, with areas have experienced heavier sampling efforts than other, making even more important the concept of spatial cv or/and spatial RF. The author addresses this issue by setting a minimum distance among the training points and by removing duplicates.
Agreed. I have now implemented a spatial leave one out cross validation scheme to test the accuracy of the model in a more robust way. The spatial autocorrelation distance is determined with the spatialAutoRange function of the blockCV package. This value is utilised to determine the buffer size around observations which serve as a test point. Details can be found in a new section "3.5 Spatial cross-validation".
A new section 3.5 Spatial cross-validation was introduced.
However, this meant that model tuning would have become very complex and even more timeconsuming. As model tuning gave a very limited gain in performance, it was therefore decided to run the Random Forest with default parameter values.
The former sections on model tuning were deleted.
As a result of this more robust estimation of map accuracy, the accuracy of the model is now lower (or rather less inflated), but still significantly larger than the no information rate.

The section 4.3 Model accuracy was updated accordingly.
9. Despite the selected class simplification, the two main classes still count for the 77.6% of the training and test sample, resulting in relatively high overall accuracy, but with limited accuracy in the rare classes. Considering the availability of methods and algorithms that try to overcome these class imbalances, (e.g. weights, penalty costs, over/under-sampling. SMOTE, Isolation Forests) it would be interesting to see how further can be improved the performance compared to the presented (baseline) model. The overall accuracy as a performance metric is not the best option in such situations (it is also mentioned from the first Reviewer) However, the no information rate is provided, showing that there is still gain.
The problem of class imbalances is now addressed by utilising a balanced version of Random Forest. This is achieved by using the strata and sampsize arguments of the randomForest function. Here, we stratify by lithology class. The sampsize is then set to the same number for every class. This means that the class with the lowest frequency (Radiolarian ooze) determines the number of observations used to fit individual trees. However, each sample is still drawn from all available observations and hence this approach is likely more effective then downsampling the whole dataset prior to model building.
Relevant changes were made to section 3.4 Random Forest classification model (l.144-149).
In addition to the overall accuracy, I have now also included the balanced error rate, which is more appropriate for datasets with unbalanced class frequencies.
Relevant changes were made to sections 3.6 Accuracy assessment and 4.3 Model accuracy.
As a result, the final map has a different appearance in some areas of the global ocean. It now approximates hand-drawn maps of the distribution of deep-sea sediments in much more detail. For example, equatorial patches of radiolarian ooze in the Indian Ocean are visible now. A nearequatorial band of radiolarian ooze in the eastern Pacific is now visible, too. Table 4. 10. The results show good overall agreement with the above mentioned previous mapping efforts. However, the comparison demands parallel examinations of the maps from the two papers. Given the availability of the results from Dutkiewicz et al., 2015, it would be interesting to include a direct categorical map comparison between the two approaches (after the proper modifications due to the different number of lithological classes) showing clearer the areas with the highest agreement and disagreement.

Changes were made to section 4.4 Spatial distribution of deep-sea sediments, Figures 6, 7 and A2 and
It was not the intention of this contribution, and might go beyond the scope of a data description paper, to compare the final map with that of . I would prefer to leave such an analysis to whoever is interested in it. The map products of both publications are readily available.
No changes were made to the manuscript. Figures 1 & 6a, the use of purple colour for the Mixed Ooze is not ideal, as it has limited contrast with the background map and its surrounded classes. An essential part of this study and its results is related to map creation and interpretation. Consequently, the use of colourblindfriendly palette is recommended, making the manuscript more comfortable on a broader audience.

In
I agree that the choice of the colour scheme should be inclusive, but when consulting colorbrewer2.org, I could not find a colour-blind safe option for qualitative data with 5 classes. (NB: The classification has been reduced to the five classes used by Dutkiewicz et al. (2016), as suggested by reviewer 1.) The purple colour has, nevertheless, now been removed, as the respective class is no longer included. I also removed the hillshade bathymetry to make the map clearer.  Abstract. Although the deep-sea floor accounts for more than 70 % of the Earth's surface, there has been little progress in 5 relation to deriving maps of seafloor sediment distribution based on transparent, repeatable, and automated methods such as machine learning. A new digital map of the spatial distribution of seafloor lithologies in the deep sea below 500 m water depth is presented to address this shortcoming. The lithology map is accompanied by estimates of the probability of the most probable class, which may be interpreted as a spatially-explicit measure of confidence in the predictions, and probabilities for the occurrence of seven five lithology classes (Calcareous sediment, Clay, Diatom ooze, Lithogenous sediment, and Mixed 10 calcareous-siliceous ooze, Radiolarian ooze and Siliceous mud). These map products were derived by the application of the Random Forest machine learning algorithm to a homogenised dataset of seafloor lithology samples and global environmental predictor variables that were selected based on the current understanding of the controls on the spatial distribution of deep-sea sediments. The overall accuracy of the lithology map is 69.5 %, with 95 % confidence limits of 67.9 % and 71.1 %. It is expected that the map products are useful for various purposes including, but not limited to, teaching, management, spatial 15 planning, design of marine protected areas and as input for global spatial predictions of marine species distributions and seafloor sediment properties. The map products are available at https://doi.pangaea.de/10.1594/PANGAEA.911692 (Diesing, 2020).

Introduction 20
The deep-sea floor accounts for >90 % of seafloor area (Harris et al., 2014) and >70 % of the Earth's surface. It acts as a receptor of the particle flux from the surface layers of the global ocean, is a place of biogeochemical cycling (Snelgrove et al., 2018), records environmental and climate conditions through time and provides habitat for benthic organisms (Danovaro et al., 2014). Being able to map the spatial patterns of deep-sea sediments is therefore a major prerequisite for many studies addressing aspects of marine biogeochemistry, deep-sea ecology, and palaeo-environmental reconstructions. 25 Until recently, maps of global deep-sea sediments were essentially variants of a hand-drawn map presented by Berger (1974) and typically depicted five to six sediment types, namely calcareous ooze, siliceous ooze (sometimes split into diatom ooze and radiolarian ooze), deep-sea (abyssal) clay, terrigenous sediment and glacial sediment. Since then,  collated and homogenised approximately 14,500 samples from original cruise reports and interpolated them using a support Commented [DM1]: Requires updating vector machine algorithm (Cortes and Vapnik, 1995). Their map displayed the spatial distribution of 13 lithologies across the 30 world ocean and exhibited some marked differences from earlier maps.
The controls on the distribution of deep-sea sediments have long been discussed (e.g. Seibold and Berger, 1996): Biogenous oozes (>30 % microscopic skeletal material by weight) dominate on the deep-sea floor and their composition is controlled by productivity in overlying surface ocean waters, dissolution during sinking and sedimentation and dilution with other materials.
The ocean is undersaturated with silica. Preservation of siliceous shells is therefore a function of shell thickness, sinking time 35 (water depth) and water temperature, as siliceous shells dissolve slower in colder water. The dissolution of calcareous shells is increased with increasing pressure (water depth) and CO2 content of the water (decreasing temperature). The water depth at which the rate of supply with calcium carbonate to the sea floor equals the rate of dissolution (calcite compensation depth; CCD) varies across ocean basins. Deep-sea clays dominate in the deepest parts of ocean basins below the CCD. Deposition of terrigenous material is thought to be a function of proximity to land (distance to shore). 40 Dutkiewicz et al. (2016) investigated the bathymetric and oceanographic controls on the distribution of deep-sea sediments with a quantitative machine-learning approach. The influence of temperature, salinity, dissolved oxygen, productivity, nitrate, phosphate, silicate at the sea surface and bathymetry on lithogenous sediment, clay, calcareous sediment, radiolarian ooze and diatom ooze was quantified. They found that bathymetry, sea surface temperature and sea surface salinity had the largest control on the distribution of deep-sea sediments. Calcareous and siliceous oozes were not linked to high surface productivity 45 according to their analysis. Diatom and radiolarian oozes were associated with low sea surface salinities and discrete sea surface temperature ranges.
The aim of this study is to derive a map of deep-sea sediments of the global ocean by utilising environmental predictor variables for the development and application of a machine-learning spatial prediction model. Besides a categorical map giving the spatial representation of seafloor types in the deep sea, probability surfaces for individual sediment classes and a map 50 displaying the probability of the most probable class in the final prediction will also be provided.

Predictor variables
The initial choice of the predictor variables was informed by the current understanding of the controls on the distribution of deep-sea sediments and the availability of data with full coverage of the deep sea at a reasonable resolution. We chose predictor 55 variables mentioned above, but also included sea-surface iron concentrations, which wasere not available to Dutkiewicz et al. (2016), but is an important nutrient for phytoplankton (Table 1). The predictor variable raster layers from Bio-ORACLE (Assis et al., 2018;Tyberghein et al., 2012) and MARSPEC (Sbrocco and Barber, 2013) were utilised. Whenever available, statistics of the variable other than mean were downloaded. These included the minimum, maximum and the range (maximumminimum). 60 3

Response variable
The response variable is seafloor lithology, a qualitative multinomial variable. The seafloor sediment sample data (seafloor_data.npz) from  were downloaded from ftp://ftp.earthbyte.org/papers/ Dutkiewicz_etal_seafloor_lithology/iPython_notebook_and_input_data/. The original dataset consisted of 13 seafloor lithology classes, while Dutkiewicz et al. (2016) simplified these to five major classes. The latter scheme was chosen here 65 (Table 2), as the five major classes agree well with lithologies typically depicted in hand-drawn maps.As a compromise, seven lithology classes (Table 2)

Methods 70
The general workflow for building a predictive spatial model was outlined by Guisan and Zimmermann (2000). This involves five main steps: (1) Development of a conceptual model, (2)

Data pre-processing
The raster layers (predictor variables) were stacked, limited to water depths below 500 m, and projected to Wagner IV global equal-area projection with a pixel resolution of 10 km by 10 km and scaled. 80 The sample data (response variable) were pre-processed in the following way: Only samples of the five major lithologies (Table 2) deeper than 500 m were used and a minimum distance between sample locations of 14.5 km (the diagonal of a 10 km pixel) was enforced to limit the number of samples per pixel to one. This had also the effect of removing duplicates were removed from the original sample dataset. and reduced the The number of records was therefore reduced from 14,400 to 10,19010,438. The data were projected to Wagner IV. Locations of the sample locations and their respective lithology class 85 are shown in Fig. 1.
Predictor variable values were extracted for every sample location. The resulting data frame was subsequently split into a training and a test set with a ratio of 2:1 using a stratified random approach. The stratification was based on the lithology class.
This ensured that the test set contained approximately the sameThe class frequencies as the training set (are shown in Table   2). 90 Commented [DM2]: Requires updating.

Predictor variable selection
Variable selection reduces the number of predictor variables to a subset that is relevant to the problem. The aims of variable selection are three-fold: (1) to improve the prediction performance, (2) to enable faster predictions and (3) to increase the interpretability of the model (Guyon and Elisseeff, 2003). It is generally advisable to reduce high-dimension datasets to uncorrelated important variables . Here, a two-step approach was utilised to achieve this goal. 95 The first step identifies those variables that are relevant to the problem. The second step minimises redundancy in the remaining predictor variables.
Predictor variables were selected in a two-step approach: Initially, the Boruta variable selection wrapper algorithm (Kursa and Rudnicki, 2010) was employed to identify all potentially important predictor variables. Wrapper algorithms identify relevant features by performing multiple runs of predictive models, testing the performance of different subsets (Guyon and Elisseeff, 100 2003). The Boruta algorithm creates so-called shadow variables by copying and randomising predictor variables. Variable importance scores for predictor and shadow variables are subsequently computed with the Random Forest algorithm (see below). The maximum importance score among the shadow variables (MZSA) is determined and for every predictor variable, a two-sided test of equality is performed with the MZSA. Predictor variables that have a variable importance score significantly higher than the MZSA are deemed important, while those with a variable importance score significantly lower than the MZSA 105 are deemed unimportant. Tentative variables have a variable importance score that is not significantly different from the MZSA. Increasing the maximum number of iterations (maxRuns) might resolve tentative variables (Kursa and Rudnicki, 2010). Only important variables were retained for further analysis.
The Boruta algorithm is an "all-relevant" feature selection method (Nilsson et al., 2007), which identifies all predictors that might be relevant for classification (Kursa and Rudnicki, 2010). It does not address the question of redundancy in the predictor 110 variable data, which would be required for "minimal optimal" feature selection (Nilsson et al., 2007) usually preferred for model building. To limit redundancy, a second step seeks to identify predictor variables that are correlated with other predictors of higher importance. In a subsequent stepTo achieve this, the Boruta importance score was used to rank the remaining predictor variables. Beginning with the most important variable, correlated variables (|r| > 0.5) with lower importance were subsequently removed. Values of the correlation coefficient r were trialled between 0.1 and 1 with a step size 115 of 0.01 to find an appropriate r value that strikes a balance between prediction performance and model interpretability. As a result, only important and uncorrelated predictor variables are retained.

Environmental space
It is generally preferable to apply a suitable sampling design for model calibration and evaluation. This would ensure that t he environmental variable space is sampled in a representative way. Various methods have been proposed to optimise sampling 120 effort, including stratified random, generalised random tessellation stratified (Stevens Jr and Olsen, 2003) and conditioned Latin hypercube sampling (Minasny and McBratney, 2006) among others. However, such approaches are not feasible here due 5 to time and financial constraints. Instead, we utilised available (legacy) sampling data. It might nevertheless be prudent to assess to what extent the selected samples cover the environmental space of the predictor variables. This was achieved by creating a random subsample (n = 10,000) of the selected environmental predictor variables and displaying the density 125 distribution of the random subsample together with the density distribution of environmental variables based on the observations. This allows for a qualitative check to what degree the environmental space is sampled in a representative way.

Random Forest classification model
The Random Forest (RF) prediction algorithm (Breiman, 2001) was chosen for the analysis due to its high predictive performance in a number of domains (Che Hasan et al., 2012;Cutler et al., 2007;Diesing et al., 2017;Diesing and Thorsnes, 130 2018;Huang et al., 2014;Prasad et al., 2006). The RF is an ensemble technique based on classification trees (Breiman, 1984).
Randomness is introduced in two ways: by constructing each tree from a bootstrapped sample of the training data, and by using a random subset of the predictor variables at each split in the tree growing process. As a result, every tree in the forest is unique. By aggregating the predictions over a large number of uncorrelated trees, prediction variance is reduced and accuracy improved (James et al., 2013, p. 316). The 'votes' for a specific class can be interpreted as a measure of probability for that 135 class occurring in a specific location. The final prediction is determined by the class with the highest probability (vote count) to occur in a specific location. The randomForest package (Liaw and Wiener, 2002) was used to perform the analysis.
RF generally performs well with default settings, i.e. without the tuning of parameters. However, it might be advisable to tune those parameters that have the largest impact on predictive accuracy. These areInitial tuning of the number of trees in the forest (ntree) and the number of variables to consider at any given split (mtry) showed a very limited impact on model performance, 140 while at the same time the tuning process was very time-consuming.. Those two parameters were tuned in a grid search with the package e1071 (Meyer et al., 2019) using a 10-fold cross-validation scheme with three repeats on the training dataset. It was therefore decided to use the default parameter values.
The response variable is highly imbalanced (Table 2). This was accounted for by utilising a balanced version of RF (Chen et al., 2004). This is achieved by specifying the strata and sampsize arguments of the randomForest() function. The strata are the 145 lithology classes and the sample size is determined by • , where c is the number of lithology classes (seven) and nmin is the number of samples in the least frequent class. Hence, downsampling is applied when growing individual trees. However, each sample is drawn from all available observations as many trees are grown, making this scheme likely more effective than downsampling the dataset prior to model building.
RF also provides a relative estimate of predictor variable importance. The importance() function of the randomForest package 150 allows to assess variable importance as the mean decrease in either accuracy or node purity. However, the latter approach might be biased when predictor variables vary in their scale of measurement or their number of categories  and was not used here. This Variable importance is therefore measured as the mean decrease in accuracy associated with each variable when it is assigned random but realistic values and the rest of the variables are left unchanged. The worse a model performs when a predictor is randomised, the more important that predictor is in predicting the response variable. The mean 155 6 decrease in accuracy was left unscaled as recommended by , and is reported as a fraction ranging from 0 to 1. Per default, individual trees of the forest are built using sampling with replacement (replace=TRUE). However, it has been shown that this choice might lead to bias in predictor variable importance measures . It was therefore opted to use sampling without replacement.

Spatial cross-validation 160
Detailed guidelines for optimising sampling design for accuracy assessment have been developed (Olofsson et al., 2014;Stehman and Foody, 2019). However, this would require collecting new samples after modelling, which was not feasible given the geographic scope. Cross-validation schemes are frequently used to deal with such situations. It can be assumed that the response variable is spatially structured to some extent and cross-validation therefore requires accounting for the spatial structure (Roberts et al., 2017). Here, a spatial leave-one-out cross-validation (S-LOO CV) scheme was applied. In a 165 conventional LOO CV, a single observation is removed from the dataset and all other observations (n -1) are used to train the model. The class of the withheld observation is then predicted using the n -1 model. This is repeated for every observation in the dataset, producing observed and predicted classes at every location. In a S-LOO CV scheme, a buffer is placed around the withheld observation and training data from within this buffer are omitted from both model training and testing so that there are no training data proximal to the test. The S-LOO CV scheme used here was adapted from Misiuk et al. (2019). The buffer 170 size was estimated with the spatialAutoRange() function of the blockCV package (Valavi et al., 2018).

Accuracy assessment
The accuracy of the model was assessed based on a confusion matrix that was derived by predicting the model on the test setthe S-LOO CV. Overall accuracy and the balanced error rate (BER) wereas used to evaluate the global accuracy of the model, while error of omission and error of commission were selected as class-specific metrics of accuracy. The overall 175 accuracy gives the percentage of cases correctly allocated and is calculated by dividing the total number of correct allocations by the total number of samples (Congalton, 1991). The BER is the average of the error rate for each class (Luts et al., 2010).
The error of omission is the number of incorrectly classified samples of one class divided by the total number of reference samples of that class. The error of commission is the number of incorrectly classified samples of one class divided by the total number of samples that were classified as that class (Story and Congalton, 1986). The overall accuracy, its 95% confidence 180 intervals and a one-sided test to evaluate whether the overall accuracy was significantly higher than the no information rate (NIR) were calculated by applying the confusionMatrix() function of the caret package (Kuhn, 2008). The confidence interval is estimated using a binomial test. The NIR is taken to be the proportion of the most frequent class. Errors of omission and commission are not provided by the function but can be calculated from the confusion matrix. The BER was calculated with the BER() function of the package measures (Probst, 2018).

Environmental space
It is generally preferable to apply a suitable sampling design for model calibration and evaluation. This would ensure that the environmental variable space is sampled in a representative way. Various methods have been proposed to optimise sampling effort, including stratified random, generalised random tessellation stratified (Stevens Jr and Olsen, 2003) and conditioned Latin hypercube sampling (Minasny and McBratney, 2006) among others. Guidelines for optimising sampling design for 190 accuracy assessment have also been developed (Olofsson et al., 2014;Stehman and Foody, 2019). However, such approaches are not feasible here due to time and financial constraints. Instead, we utilised available (legacy) sampling data, which was split into training and accuracy testing data sets. It might nevertheless be prudent to assess to what extent the selected samples cover the environmental space of the predictor variables. This was achieved by creating a random subsample (n = 10,000) of the selected environmental predictor variables and displaying the density distribution of the random subsample together with 195 the density distribution of environmental variables based on the training data. This allows for a qualitative check to what degree the environmental space is sampled in a representative way.

Variable selection
The Boruta algorithm was run with maxRuns = 500 iterations and a p-value of 0.05, leaving no variables unresolved (i.e. 200 tentative). All 38 predictor variables initially included in the model were deemed important according to the Boruta analysis ( Fig. 2). Based on a plot of the RF out of bag (OOB) error estimates over the correlation coefficient r, a value of 0.5 was selected (Fig. 3) This selection ensured high model performance while at the same time minimising the number of predictor variables. Subsequent Ccorrelation analysis reduced the number of retained predictor variables to eight. These were bathymetry (MS_bathy_5m), distance to shore (MS_biogeo5_dist_shore_5m), sea-floor maximum temperature (BO2_tempmax_bdmean), 205 sea-surface temperature range (BO2_temprange_ss), sea-surface maximum primary productivity (BO2_ppmax_ss), sea-floor minimum temperature (BO2_tempmin_bdmean), sea-surface maximum salinity (BO2_salinitymax_ss), sea-surface salinity range (BO2_salinityrange_ss) and sea-surface minimum silicate (BO2_silicatemin_ss). The strongest correlation between remaining predictor variables (Fig. 34) was found between bathymetry and sea-floor maxinimum temperature (r = 0.38), seasurface maximum salinity range and sea-surface minimum silicate (r = -0.365) and bathymetry and distance to shore (r = -210 0.334). Maps of the selected predictor variables are shown in Fig. A1.

Environmental space
The environmental space (Fig. 58) is generally sampled adequately, although there is a tendency for an over-representation of shallower water depths (above 3,000 m) and areas closer to land (less than 500 km). Sea-floor maximum temperature, sSeasurface temperature range, sea-surface maximum primary productivity, sea-floor minimum temperature, and sea-surface 215 8 maximum salinity are all slightly biased towards higher values. Sea-surface salinity range and sea-surface minimum silicate are the environmental variables that are most closely represented by the samples.

Model tuning
The RF model was tuned with mtry ranging between 2 and 9 (the maximum number of predictors) and ntree assumed values between 100 and 2000 with steps of 100. The covered range was based on previous experience. The optimum parameter 220 combination, based on the classification error, was mtry = 2 and ntree = 600. The differences in classification error were however small, with a maximum difference of 0.95 % between the best and the worst performing combination (Fig. 4).

Model accuracy
The confusion matrix based on the test setS-LOO CV is shown in Table 3 Clay and Diatom mud), while considerably lower for the rare classes (Diatom ooze, Lithogenous sediment, Mixed calcareoussiliceous sediment, and Radiolarian ooze and Siliceous mud).

Spatial distribution of deep-sea sediments
Probability surfaces of individual sediment classes with verbal descriptions of likelihood (Mastrandrea et al., 2011) based on the estimated probabilities are displayed in Fig. 56. For any given pixel in the map, the final lithology class is that one with 235 the highest probability. The probability of the most probable class might be interpreted as a spatially explicit measure of map confidence. The resulting maps of the spatial distribution of deep-sea sediments and their associated confidence are shown in Fig. 67. Calcareous sediment and Clay dominate throughout the Pacific, Atlantic and Indian Oceans, whereby Clay occupies the deep basins and Calcareous sediment is found in shallower parts of the ocean basins. In the Southern Ocean, seafloor sediments are arranged in a banded pattern around Antarctica, with Siliceous mud closest to land, followed by Clay and 240 Lithogenous sediment forming an inner ring closest to land (Fig. A2). An outer ring of siliceous oozes, mainly (Diatom ooze, and Radiolarian ooze) dominates in the Southern Ocean. The width of this "opal belt" (Lisitzin, 1971) varies and in places, most notably south of South America, it is discontinuous. The Arctic Ocean appears to be dominated by Clay; however, confidence is generally low here, most likely due to the absence of samples north of 75 N (Fig. 1). Overall, map confidence varies between 0.218 and 1. It is generally lower in the vicinity of class boundaries and higher in the geographic centre of a 245 class.

9
The sea-floor lithology map bears a notable resemblance with previously published hand-drawn maps (e.g. Berger, 1974). in that the distribution of Calcareous sediment, Clay and Diatom ooze are very similar. Clear differences were however also noted: Most strikingly, our map does not display a band of Radiolarian ooze in the equatorial Pacific. This was also noted by  and is likely related to the sample data, which do not show a predominance of Radiolarian ooze in the 250 equatorial Pacific. Whether the band of Radiolarian ooze is under-represented in the sample data or in fact less prominent than previously depicted remains unresolved.The general patterns are very similar, e.g. the distribution of Calcareous sediment, Clay, and Diatom ooze in the major ocean basins. Patterns of Radiolarian ooze in the Indian Ocean resemble those in Thurman (1997: Fig. 5-22). In the Pacific Ocean, Radiolarian ooze is mapped widespread in the vicinity of the equator, although not in the form of a narrow band as frequently depicted in hand-drawn maps (Berger, 1974;Thurman, 1997). 255 Based on the predicted distribution of lithology classes, Calcareous sediments cover approximately 155 121 million km 2 of seabed below 500 m water depth, equivalent to 48.136.8 % of the total area (Table 4)

Predictor variable importance
The three most important predictor variables were sea-surface maximum salinity, sea-floor maximum temperature and bathymetry, and sea-floor maximum temperature with mean decreases in accuracy between 10 % and 12above 5 % (Fig. 87).
These findings are similar to results from Dutkiewicz et al. (2016), who determined sea-surface salinity, sea-surface temperature and bathymetry as the most important controls on the distribution of deep-sea sediments. Sea-surface minimum 265 silicate was of medium importance (8.64.3 % decrease in accuracy), while sea-surface temperature range, sea-surface maximum primary productivity, distance to shore and sea-surface salinity range were of lower importance (<53 % decrease in accuracy).

Environmental space
The environmental space (Fig. 8) is generally sampled adequately, although there is a tendency for an over-representation of 270 shallower water depths (above 3,000 m) and areas closer to land (less than 500 km). Sea-floor maximum temperature, seasurface temperature range, sea-surface maximum primary productivity and sea-surface maximum salinity are all slightly biased towards higher values. Sea-surface salinity range and sea-surface minimum silicate are the environmental variables that are most closely represented by the samples.

Limitations of the approach 275
This study utiliseds legacy sampling data to make predictions of the spatial distribution of seafloor lithologies in the deep-sea. This is the only viable approach as it is unrealistic to finance and execute a survey programme that samples the global ocean with adequate density within a reasonable timeframe. However, this approach also has some drawbacks: The presented spatial predictions were based on forming relationships between lithology classes and environmental predictor variables. For such a task, it would be desirable to cover the range of values of each of the predictor variables used in the model 280 (Minasny and McBratney, 2006). Although it was not possible to design a sampling survey, it became nevertheless obvious that the environmental space is reasonably well covered, presumably because of the relatively large number of observations, which was achievable as there was virtually no cost associated with "collecting" the samples. However, it might not always be the case that a large sample dataset leads to adequate coverage of the environmental space. In such a case, it might be desirable to draw a suitable sub-sample that approximates the distribution of the environmental variables. 285 Data originating from many cruises over long time periods are most likely heterogeneous, which might lead to increased uncertainty in the predictions. Sources of uncertainty might relate to sampling gear type, vintage and timing of sampling, representativeness of subsampling, analytical pre-treatment, inconsistency of classification standards and more (van Heteren and Van Lancker, 2015). However,  made efforts to homogenise the data. From a total number of more than 200,000 samples, they selected 14,400 based on strict quality-control criteria. Only surface and near-surface samples that 290 were collected using coring, drilling, or grabbing methods were included. Furthermore, only samples whose descriptions could be verified using original cruise reports, cruise proceedings and core logs were retained. Their classification scheme is deliberately generalised in order to successfully depict the main types of sediments found in the global ocean and to overcome shortcomings of inconsistent, poorly defined and obsolete classification schemes and terminologies .
Additional uncertainty might be introduced through imprecise positioning of the samples, which might lead to incorrect 295 relations between the response variable and the predictor variables. No metadata exist on the positioning accuracy or even the method of determining the position, which might give some clues on the error associated with the recorded positions. However, the chances that this shortcoming leads to significant problems when making associations between target and predictor variables are relatively low, as the chosen model resolution of 10 km is relatively coarse when compared with positioning accuracy. 300 The initial choice of predictor variables was informed by the current understanding of the controls on deep-sea sedimentation (Dutkiewicz et al., 2016;Seibold and Berger, 1996). Consequently, all selected predictor variables were deemed important (Fig. 2). The three most important predictor variables (Fig. 78) are also in good agreement with Dutkiewicz et al. (2016).
However, the large errors of commission and especially commission for the rare lithologies Diatom ooze, Lithogenous sediment, Mixed calcareous-siliceous ooze, and Radiolarian ooze and Siliceous mud might indicate that the environmental 305 controls are less well represented for these sediment types. Lithogenous sediment comprises a wide range of grain-sizes (silt, sand, gravel and coarser) and proximity to land might be an insufficient predictor. In fact, distance to shore had the second lowest variable importance (Fig. 8).
Sedimentation rates in the deep-sea typically range on the order of 1 -100 mm per 1000 yrs (Seibold, 1975). The sample depths in the dataset used here might have ranged from core top to a few dm. The lithologic signal might therefore be integrated 310 over timescales of approximately 100 yrs to a few 100,000 yrs. The model hindcasts to derive the oceanographic predictors typically cover approximately 25 yrs, while bathymetry and distance to coast might be nearly constant since global sea-level rise ceased approximately 6,700 yrs ago (Lambeck et al., 2014). Hence, there likely exists a mismatch between the time intervals, although oceanographic variables might not have changed dramatically over much longer timescales than a few decades. 315

Potential usage
Despite the good agreement with previously published maps and a reasonable overall map accuracy of 70 60 %, there is large variation in the class-specific error as well as the spatial distribution of map confidence. It is therefore recommended to always consult the information on map confidence along with the map of seafloor lithologies.
The probability surfaces of the seven lithologies might be used as input for spatial prediction and modelling, e.g. marine species 320 distribution modelling on a global scale, which typically lacks information on seafloor sediments, although substrate type is assumed to be an important environmental predictor. Additionally, the presented data layers might be useful for the spatial prediction of sediment properties (e.g. carbonate and organic carbon content).
The categorical map might serve as a resource for education and teaching, provide context for research pertaining to the global seafloor, support marine planning, management and decision-making and underpin the design of marine protected areas 325 globally. Additionally, the provided lithology map might be useful for survey planning, especially in conjunction with confidence information to target areas where a certain lithology is most likely to occur. Conversely, areas of low confidence could be targeted to further improve the accuracy of and confidence in the global map of deep-sea sediments.

Conclusions
Based on a homogenised dataset of seafloor lithology samples  and global environmental predictor variables from Bio-ORACLE (Assis et al., 2018;Tyberghein et al., 2012) and MARSPEC (Sbrocco and Barber, 2013) it was 12 possible to spatially predict the distribution of deep-sea sediments globally. The general understanding about the controls on 335 deep-sea sedimentation helped building a spatial model that gives a good representation of the main lithologies Calcareous sediment, Clay, and Diatom ooze, Lithogenous sediment, and Radiolarian ooze which collectively cover nearly 95% of the mapped seafloor. Further improvements should be directed towards the controls on the distribution of rarer lithologies (Diatom ooze, Lithogenous sediment, and Mixed calcareous-siliceous ooze, Radiolarian ooze and Siliceous mud).

Author contribution 340
MD designed the study, developed the model code, executed the analysis, and wrote the manuscript.

Competing interests
The author declares no competing interests.