Introduction

A map of deep-sea sediments of the global ocean is derived from seafloor samples, categorised into lithology types, and predictor variables that are known to influence spatial patterns of seafloor lithology. After an initial preparation of the required input data, a selection of important and uncorrelated predictor variables is carried out. A Random Forest model is fitted and predicted. Outputs are a map of global dea-sea sediments, an associated map of the probability of the predicted class and individual probability maps of the predicted lithology classes.

Preparation

Load required R packages

The required R packages are loaded.

#options(warn = -1)
library(geosphere)
library(sdmpredictors)
## Loading required package: raster
## Loading required package: sp
## Loading required package: rgdal
## rgdal: version: 1.4-7, (SVN revision 845)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
##  Path to GDAL shared files: C:/Users/diesing_markus/Documents/R/R-3.6.1/library/rgdal/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: C:/Users/diesing_markus/Documents/R/R-3.6.1/library/rgdal/proj
##  Linking to sp version: 1.3-2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(summarytools)
## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp
## 
## Attaching package: 'summarytools'
## The following object is masked from 'package:raster':
## 
##     freq
library(Boruta)
## Loading required package: ranger
## 
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
## 
##     importance
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
library(raster)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(rgdal)
library(corrplot)
## corrplot 0.84 loaded
library(blockCV)
## Warning: package 'blockCV' was built under R version 3.6.3
library(measures)
## Warning: package 'measures' was built under R version 3.6.3
## 
## Attaching package: 'measures'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

Define projection of the project

All data will be projected to Wagner IV, a global equal-area projection. See https://projectionwizard.org/ for selecting suitable projections.

crs <- "+proj=wag4 +lon_0=0"

List files

Shapefiles in the input folder are listed.

list.files(path = "../input",pattern = ".shp")
## [1] "IntClass_DuplicatesRemoved_deepsea500.shp"     
## [2] "IntClass_DuplicatesRemoved_deepsea500.shp.xml" 
## [3] "IntClass_mindist14500m_deepsea500.shp"         
## [4] "IntClass_mindist14500m_deepsea500.shp.xml"     
## [5] "SimpClass_DuplicatesRemoved_deepsea500.shp"    
## [6] "SimpClass_DuplicatesRemoved_deepsea500.shp.xml"

Response variable

Read in sample data

Load a shapefile with class labels. The sample data are derived from Dutkiewicz et al. (2015): https://doi.org/10.1130/G36883.1. Sample data can be downloaded from: ftp://ftp.earthbyte.org/papers/Dutkiewicz_etal_seafloor_lithology/iPython_notebook_and_input_data/seafloor_data.npz Data were initially converted into a csv file containing latitude, longitude and class (number code between 1 and 13). The csv file was converted into a shp file and the following pre-processing operations were carried out in a GIS:

  • Removal of duplicates

  • Restricting to samples in the deep sea: Select by location with deepsea500.shp.

The original classification with 13 classes (Dutkiewicz et al., 2015) was reduced to 5 classes as per Dutkiewizc et al. (2016):

New class Original class(es)
Calc.sed Calcareous ooze, Fine-grained calcareous sediment
Clay Clay
Dia.Ooze Diatom ooze
Lith.Sed Gravel and coarser, Sand, Silt
Rad.Ooze Radiolarian ooze

Mixed (Mixed calcareous-siliceous ooze and siliceous mud) and rare classes (Ash and volcanic sand/gravel, Sponge spicules and Shells and coral fragments) were removed.

samples <- readOGR(dsn = "../input",layer = "SimpClass_DuplicatesRemoved_deepsea500")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\diesing_markus\OneDrive - Norges Geologiske Undersøkelse\R\Global deep-sea sediments\input", layer: "SimpClass_DuplicatesRemoved_deepsea500"
## with 10484 features
## It has 4 fields
## Integer64 fields read as strings:  Class

Project sample data

The SpatialPointsDataFrame samples is projected from unprojected lat/lon (WGS84) to Wagner IV.

samples <- spTransform(samples, crs)
samples
## class       : SpatialPointsDataFrame 
## features    : 10484 
## extent      : -17236964, 17255812, -8321706, 8171254  (xmin, xmax, ymin, ymax)
## crs         : +proj=wag4 +lon_0=0 +ellps=WGS84 
## variables   : 4
## names       : Latitude, Longitude, Class, SimpClass 
## min values  :  -77.633,  -179.967,     1,  Calc.Sed 
## max values  :   74.866,  179.9965,     7,  Rad.Ooze

Plot samples

plot(samples, pch = 16, cex = 0.3)

Predictor variables

List available Bio-ORACLE and MARSPEC data layers

Displays a table of available predictor variables.

References:

Assis, J, Tyberghein, L, Bosch, S, Verbruggen, H, Serrão, EA, De Clerck, O. Bio‐ORACLE v2.0: Extending marine data layers for bioclimatic modelling. Global Ecol Biogeogr. 2018; 27: 277– 284. https://doi.org/10.1111/geb.12693

Sbrocco, E. J. and Barber, P. H. (2013) MARSPEC: ocean climate layers for marine spatial ecology. Ecology 94: 979. https://doi.org/10.1890/12-1358.1

Tyberghein, L. , Verbruggen, H. , Pauly, K. , Troupin, C. , Mineur, F. and De Clerck, O. (2012), Bio‐ORACLE: a global environmental dataset for marine species distribution modelling. Global Ecology and Biogeography, 21: 272-281. https://10.1111/j.1466-8238.2011.00656.x

datasets <- list_datasets(terrestrial = F, marine = T)
layers <- list_layers(datasets)
layers

Load predictor variables

Predictor variables were selected based on previous knowledge from literature, e.g. Dutkiewicz et al. (2016) and Seibold and Berger (1996) and availability. Layers are downloaded from the Bio-ORACLE 2 database, which also includes MAESPEC data. For local storage of layers define datadir = “”.

predictors <- load_layers(
  layercodes =
    c("MS_bathy_5m", "MS_biogeo05_dist_shore_5m",
      "BO2_tempmean_ss", "BO2_tempmax_ss", "BO2_tempmin_ss", "BO2_temprange_ss",
      "BO2_dissoxmean_ss", "BO2_dissoxmax_ss", "BO2_dissoxmin_ss", "BO2_dissoxrange_ss",
      "BO2_ironmean_ss", "BO2_ironmax_ss", "BO2_ironmin_ss", "BO2_ironrange_ss",
      "BO2_phosphatemean_ss", "BO2_phosphatemax_ss", "BO2_phosphatemin_ss", "BO2_phosphaterange_ss",
      "BO2_nitratemean_ss", "BO2_nitratemax_ss", "BO2_nitratemin_ss", "BO2_nitraterange_ss",
      "BO2_ppmean_ss", "BO2_ppmax_ss", "BO2_ppmin_ss", "BO2_pprange_ss",
      "BO2_salinitymean_ss", "BO2_salinitymax_ss", "BO2_salinitymin_ss", "BO2_salinityrange_ss",
      "BO2_silicatemean_ss", "BO2_silicatemax_ss", "BO2_silicatemin_ss", "BO2_silicaterange_ss",
      "BO2_tempmean_bdmean", "BO2_tempmax_bdmean", "BO2_tempmin_bdmean", "BO2_temprange_bdmean"
      ),
  equalarea = FALSE, rasterstack = TRUE)
## Warning in get_datadir(datadir): file.path(tempdir(), "sdmpredictors") will
## be used as datadir, set options(sdmpredictors_datadir="<directory>") to avoid
## re-downloading the data in every session or set the datadir parameter in
## load_layers
predictors
## class      : RasterStack 
## dimensions : 2160, 4320, 9331200, 38  (nrow, ncol, ncell, nlayers)
## resolution : 0.08333333, 0.08333333  (x, y)
## extent     : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
## names      :   MS_bathy_5m, MS_biogeo05_dist_shore_5m, BO2_tempmean_ss, BO2_tempmax_ss, BO2_tempmin_ss, BO2_temprange_ss, BO2_dissoxmean_ss, BO2_dissoxmax_ss, BO2_dissoxmin_ss, BO2_dissoxrange_ss, BO2_ironmean_ss, BO2_ironmax_ss, BO2_ironmin_ss, BO2_ironrange_ss, BO2_phosphatemean_ss, ... 
## min values : -1.049300e+04,              1.000000e+00,   -1.797733e+00,  -1.740006e+00,  -1.940000e+00,     7.999992e-02,      1.280844e+02,     1.987810e+02,     7.664922e+01,       3.239714e+00,    5.003627e-06,   5.079128e-06,   4.983647e-06,     5.124374e-08,         1.429982e-05, ... 
## max values :   -1.00000000,             2778.00000000,     30.17862892,    34.75999451,    29.35999298,      28.94999869,      400.86618443,     453.65198161,     381.97638152,       300.49582837,      0.01973090,     0.04008445,     0.01348529,       0.03305599,           2.23390642, ...

Subset to deep sea

The predictors are subset to the deep-sea area below 500 m water depth.

predictors$MS_bathy_5m[predictors$MS_bathy_5m > -500] <- NA
predictors <- mask(crop(predictors, predictors$MS_bathy_5m), predictors$MS_bathy_5m)

Project raster stack

The raster stack is then projected to Wagner IV with a pixel resolution of 10 km.

predictors <- projectRaster(from = predictors, res = 10000, crs = crs, method = "bilinear")
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 47709
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 36202
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 33524
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 31062
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 28781
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 26663
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 24692
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 22845
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 21113
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 19485
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 17957
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 16518
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 15159
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 13885
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 12683
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 11549
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 10488
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 9483
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 8541
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 7661
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 6836
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 6067
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 5348
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 4684
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 4069
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 3502
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 2986
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 2513
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 2088
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 1710
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 1373
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 1081
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 834
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 630
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 467
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 350
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 271
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 230
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 241
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 295
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 387
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 523
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 698
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 922
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 1185
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 1492
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 1844
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 2241
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 2683
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 3170
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 3708
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 4293
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 4925
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 5610
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 6346
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 7137
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 7982
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 8885
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 9851
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 10863
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 11966
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 13121
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 14350
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 15656
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 17043
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 18516
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 20078
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 21746
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 23520
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 25412
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 27433
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 29614
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 31956
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 34496
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 37264
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 32737
## projected point(s) not finite
predictors
## class      : RasterBrick 
## dimensions : 1739, 3469, 6032591, 38  (nrow, ncol, ncell, nlayers)
## resolution : 10000, 10000  (x, y)
## extent     : -17344073, 17345927, -8692862, 8697138  (xmin, xmax, ymin, ymax)
## crs        : +proj=wag4 +lon_0=0 +ellps=WGS84 
## source     : C:/Users/diesing_markus/AppData/Local/Temp/RtmpkbYcaP/raster/r_tmp_2020-07-08_153341_24480_97490.grd 
## names      :   MS_bathy_5m, MS_biogeo05_dist_shore_5m, BO2_tempmean_ss, BO2_tempmax_ss, BO2_tempmin_ss, BO2_temprange_ss, BO2_dissoxmean_ss, BO2_dissoxmax_ss, BO2_dissoxmin_ss, BO2_dissoxrange_ss, BO2_ironmean_ss, BO2_ironmax_ss, BO2_ironmin_ss, BO2_ironrange_ss, BO2_phosphatemean_ss, ... 
## min values : -1.043116e+04,              1.000000e+00,   -1.797137e+00,  -1.740006e+00,  -1.940000e+00,     7.999992e-02,      1.496089e+02,     1.987813e+02,     1.036297e+02,       3.243045e+00,    5.003856e-06,   5.079576e-06,   4.983790e-06,     5.174933e-08,         1.444217e-05, ... 
## max values : -5.000000e+02,              2.777553e+03,    3.003049e+01,   3.327256e+01,   2.934584e+01,     2.397654e+01,      3.855367e+02,     4.054187e+02,     3.765262e+02,       1.531691e+02,    6.892411e-03,   9.712671e-03,   3.843763e-03,     8.999437e-03,         2.233885e+00, ...
plot(predictors)

Scale predictor variables

Predictors are scaled for further analysis.

sc_preds <- scale(predictors)

Extract predictor variable values to observations

Values of predictor variables are extracted from the raster stack and attached to every sample point.

sdata <- data.frame(samples[c(4,1,2)],extract(sc_preds,samples))
names(sdata)[1] <- "class" #Change header name of first column to "class"
sdata <- sdata[-c(4:6)] #Removes unnecessary information
sdata <- subset(x = sdata, subset = !is.na(sdata$MS_bathy_5m)) #Removes NAs

Distance matrix

A distance matrix between all points is computed. This will be used for spatial cross validation.

sdat <- sdata
coordinates(sdat) = ~ Longitude + Latitude
proj4string(sdat)<- CRS("+proj=longlat +datum=WGS84")
dist.mat <- distm(sdat)

sdata <- sdata[-c(2:3)] # Remove coordinates from sdata

Define variables

Some variables that will be used subsequently are defined here.

seed <- 42 #Set a seed for reproducibility
preds <- sdata[2:ncol(sdata)] #Predictor variables without response variable
res <- sdata[,1] #Response variable
classes <- levels(res) #classes to predict
Nclasses <- nlevels(res) #Number of class labels

Npreds <- ncol(preds) #Number of predictor variables
Nobs <- nrow(preds) #Number of observations
nmin <- min(table(sdata$class)) #Number of observations in the minority class

#Colours for the sediment classes (in alphabetical order)
col.pal <- c(rgb(81,141,203, maxColorValue = 255), rgb(111,78,45, maxColorValue = 255), rgb(202,212,79, maxColorValue = 255), rgb(245,245,122, maxColorValue = 255), rgb(78,140,63, maxColorValue = 255))

Summary of sdata as html file

Summary of the data frame “sdata”. This is especially useful to check whether there are NA values. Note that the Boruta algorithm (see below) will not execute in such a case.

view(dfSummary(sdata), method = "browser")
## Output file written: C:\Users\DIESIN~1\AppData\Local\Temp\RtmpkbYcaP\file5fa02cfd1b98.html

Predictor variable selection

Predictor variable selection is carried out to reduce the dimensionality of the dataset. This consists of two steps: Boruta analysis to remove unimportant variables and correlation analysis to remove correlated variables.

Boruta variable selection

First, the Boruta algorithm is used to identify important variables and remove tentative or unimportant variables.

set.seed(seed) #Set seed for reproducibility
B1 <- Boruta(as.factor(sdata[[1]]) ~ .,data=preds, pValue = 0.05,
             maxRuns = 500)

Export Boruta figure

Export of the Boruta figure as a jpg file for use in publication.

jpeg("../figs/VariableSelection/Boruta.jpeg", width = 25, height = 15, res = 500, units = "cm", pointsize = 8, quality = 100)
par(mar=c(12.5,4,1,1))
par(mfrow=c(1,1))
plot(B1, las=2, colCode = c("greenyellow", "yellow2", "red3", "cadetblue"), xlab = "")
dev.off()
## png 
##   2

Correlation analysis

Second, colinearity is reduced through a correlation analysis. Of correlated variables only the one with the highest Boruta importance score is retained.

Variable importance

The predictor variables are sorted in decreasing order of Boruta variable importance. The function attStats() extracts importance statistics from the Boruta object (B1). The mean importance [1] is used here. As we want to exclude all variables that were unimportant or undecided, we subset the data to those that were “Confirmed”.

BorImp <- attStats(B1)[,c(1,6)]
BorImp <- subset(BorImp, decision == "Confirmed")
BorImp$Var <- rownames(BorImp)
rownames(BorImp) <- NULL
BorList <- list(BorImp[order(BorImp[1],decreasing=T),c(3,1)])

BorList
## [[1]]
##                          Var  meanImp
## 1                MS_bathy_5m 54.10350
## 38      BO2_temprange_bdmean 43.91780
## 2  MS_biogeo05_dist_shore_5m 39.37547
## 6           BO2_temprange_ss 35.62872
## 24              BO2_ppmax_ss 34.85403
## 26            BO2_pprange_ss 34.32361
## 37        BO2_tempmin_bdmean 32.26956
## 10        BO2_dissoxrange_ss 31.90227
## 36        BO2_tempmax_bdmean 31.81061
## 28        BO2_salinitymax_ss 31.59613
## 27       BO2_salinitymean_ss 31.30111
## 35       BO2_tempmean_bdmean 30.49724
## 30      BO2_salinityrange_ss 30.32951
## 23             BO2_ppmean_ss 30.14413
## 25              BO2_ppmin_ss 29.66990
## 33        BO2_silicatemin_ss 27.04216
## 18     BO2_phosphaterange_ss 27.01037
## 29        BO2_salinitymin_ss 26.97217
## 14          BO2_ironrange_ss 25.93805
## 9           BO2_dissoxmin_ss 25.59439
## 13            BO2_ironmin_ss 25.49440
## 34      BO2_silicaterange_ss 25.39390
## 22       BO2_nitraterange_ss 25.18493
## 4             BO2_tempmax_ss 24.97719
## 11           BO2_ironmean_ss 24.89616
## 12            BO2_ironmax_ss 24.56195
## 17       BO2_phosphatemin_ss 24.25184
## 8           BO2_dissoxmax_ss 23.62954
## 3            BO2_tempmean_ss 22.90907
## 15      BO2_phosphatemean_ss 22.43548
## 5             BO2_tempmin_ss 22.40179
## 16       BO2_phosphatemax_ss 22.22922
## 31       BO2_silicatemean_ss 22.02477
## 32        BO2_silicatemax_ss 21.93855
## 19        BO2_nitratemean_ss 21.78520
## 21         BO2_nitratemin_ss 21.64518
## 20         BO2_nitratemax_ss 21.23472
## 7          BO2_dissoxmean_ss 19.97032

Influence of choice of r on model performance

The effect of variable selection through the choice of the cutoff value for r on prediction performance is evaluated. For every r value between 0.1 and 1 (step 0.01), a random forest model is built with default parameters and the out of bag error is plotted over r.

df_total <- data.frame()

for (i in c(10:100)*0.01) {
r <- i

cr <- cor(preds[BorList[[1]][[1]]])

for(j in 1:length(cr[1,])){
  if (j == 1){
    pl <- c(names(cr[j,][1]),names( cr[j,][sqrt((cr[j,])^2)<r]))
    pl1 <- pl
  } else if (names(cr[j,])[j] %in% pl1){
    rem <- names(cr[j,-c(1:j)][sqrt((cr[j,-c(1:j)])^2)>r])
    if (length(rem) != 0L){  
      pl <- pl[!pl %in% rem]
    }
  }
  next
}

clms <- c(names(sdata[1]),pl)
seldata <- sdata[clms]

set.seed(seed)
fit <- randomForest(class ~ ., data=seldata)
df <- data.frame(i, length(pl), fit$err.rate[nrow(fit$err.rate),1])
df_total <- rbind(df_total,df)
}

names(df_total)[2] <- "Npreds"
names(df_total)[3] <- "OOBerror"

max(df_total$OOBerror) - min(df_total$OOBerror)
## [1] 0.05633263
ggplot(df_total,aes(i,OOBerror)) + geom_point(aes(size=Npreds)) +
  xlab("r") + ylab("OOB error ")

Set maximum allowable r-value for correlation analysis

A maximum allowable r-value needs to be set as threshold to remove correlated variables.

r <- 0.5

Remove correlated variables

A correlation analysis is carried out. Of correlated variables, only the one with the highest variable importance is retained.

cr <- cor(preds[BorList[[1]][[1]]])

for(j in 1:length(cr[1,])){
  if (j == 1){
    pl <- c(names(cr[j,][1]),names( cr[j,][sqrt((cr[j,])^2)<r]))
    pl1 <- pl
  } else if (names(cr[j,])[j] %in% pl1){
    rem <- names(cr[j,-c(1:j)][sqrt((cr[j,-c(1:j)])^2)>r])
    if (length(rem) != 0L){  
      pl <- pl[!pl %in% rem]
    }
  }
  next
}

Npreds <- nrow(as.data.frame(pl))
clms <- c(names(sdata[1]),pl)
seldata <- sdata[clms]
sel.preds <- subset(sc_preds,pl)

Correlation plot for important and uncorrelated variables

A visual representation of the correlatedness of the selected variables.

z <- cor(preds[,pl])
corrplot.mixed(z, lower.col =  "black", tl.pos = "lt", number.cex = 0.6)

jpeg("../figs/VariableSelection/corrplot.jpeg", width = 10, height = 10, res = 500, units = "cm", pointsize = 8, quality = 100)
corrplot.mixed(z, lower.col =  "black", tl.pos = "lt", number.cex = 0.6)
dev.off()
## png 
##   2

Data exploration

Summary of data as html file

view(dfSummary(seldata), file = "../figs/DataExploration/TrainSummary.html")
## Output file written: C:\Users\diesing_markus\OneDrive - Norges Geologiske Undersøkelse\R\Global deep-sea sediments\figs\DataExploration\TrainSummary.html

Box plots

for (i in 2:ncol(seldata)) {
  
  print(ggplot(seldata, aes(x = class, y = seldata[,i],fill = class)) +
          geom_boxplot() +
          scale_fill_manual(values = col.pal) +
          scale_y_continuous(name = names(seldata[i])) +
          theme(axis.text.x = element_blank(),
                axis.title.x = element_blank(),
                axis.ticks.x = element_blank()))
  
  }

Density curves

for (i in 2:ncol(seldata)) {
    
  print(ggplot(seldata, aes(x= seldata[,i], fill = class)) +
          geom_density(position="identity", alpha=0.6)+
          scale_fill_manual(values = col.pal) +
          scale_x_continuous(name = names(seldata[i])))
  }

Environmental space

A visual check to what extent the samples cover the environmental space. This is useful as legacy data were used and no formal sampling design was applied in the analysis.

  • Blue: Samples

  • Red: Environmental data (based on random subsample)

smp <- as.data.frame(sampleRandom(x = sel.preds, size = 10000))

for (i in 2:ncol(seldata)) {
    
  print(ggplot() +
          geom_density(data = seldata, aes(x=seldata[,i]),colour="darkblue",fill="darkblue", alpha=0.1,size=1)+
          geom_density(data = smp, aes(x=smp[,i-1]), colour="red",fill="red", alpha=0.1, size=1)+
          scale_x_continuous(name = names(seldata[i])))
        
    }

Building a Random Forest model

A Random Forest (Breiman, 2001) model is built, based on the response variable (lithology class) and selected predictor variables.

Evaluation of model performance

The response data is likely spatially structured. Spatial autocorrelation might lead to over-optimistic estimates of performance metrics. To account for spatial autocorrelation, a spatial leave one out (LOO) cross validation approach is taken.

Defining buffer size based on spatial autocorrelation of predictors

First, the buffer size is determined by estimating the spatial autocorrelation range of the selected predictor variables.

sac <- spatialAutoRange(sel.preds, sampleNumber = 10000, doParallel = T, showPlots = F)
## There are 8 raster layers
## Loading required namespace: rgeos
print(paste("Buffer size =", round(sac$range/1000, 2), "km"))
## [1] "Buffer size = 5299.09 km"

Spatial leave one out CV

A spatial LOO CV is performed.

t <- data.frame() #initialize a results df and run spatial LOO CV using RF

for(i in 1:nrow(seldata)){
  
  test <- seldata[i, ] #leave out one point at a time as test
  
  #select the points from the distance matrix that are > some distance
  pt.buffer <- which(dist.mat[ ,i] > sac$range)
  buffered.pts <- seldata[pt.buffer, ]
  
  nmin <- min(table(buffered.pts$class))
  
  #run rf using all points but i and points within the buffer
  rf <- randomForest(class ~ .,data=buffered.pts, replace = FALSE, strata = buffered.pts$class, 
                                   sampsize = rep(nmin, Nclasses))
  
  #prefict the test point and bind it to results
  pred <- predict(rf, test, type="response")
  t <- rbind(t, data.frame(test[1], pred, nmin)) #t is a df of predicted and observed values
  
}

summary(t)
##       class            pred           nmin      
##  Calc.Sed:5251   Calc.Sed:4554   Min.   :46.00  
##  Clay    :3714   Clay    :2973   1st Qu.:78.00  
##  Dia.Ooze: 623   Dia.Ooze:1079   Median :89.00  
##  Lith.Sed: 751   Lith.Sed: 975   Mean   :85.43  
##  Rad.Ooze:  99   Rad.Ooze: 857   3rd Qu.:96.00  
##                                  Max.   :99.00

Model accuracy

The accuracy of the model is assessed based on the spatial LOO CV.

acc <- confusionMatrix(data=t$pred, reference=t$class) 
acc
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Calc.Sed Clay Dia.Ooze Lith.Sed Rad.Ooze
##   Calc.Sed     3735  660       10      139       10
##   Clay          760 2001       44      151       17
##   Dia.Ooze      234  215      299      294       37
##   Lith.Sed      277  456      111      128        3
##   Rad.Ooze      245  382      159       39       32
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5935         
##                  95% CI : (0.584, 0.6029)
##     No Information Rate : 0.5031         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3892         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: Calc.Sed Class: Clay Class: Dia.Ooze
## Sensitivity                   0.7113      0.5388         0.47994
## Specificity                   0.8421      0.8554         0.92053
## Pos Pred Value                0.8202      0.6731         0.27711
## Neg Pred Value                0.7424      0.7705         0.96538
## Prevalence                    0.5031      0.3558         0.05969
## Detection Rate                0.3578      0.1917         0.02865
## Detection Prevalence          0.4363      0.2848         0.10337
## Balanced Accuracy             0.7767      0.6971         0.70023
##                      Class: Lith.Sed Class: Rad.Ooze
## Sensitivity                  0.17044        0.323232
## Specificity                  0.91256        0.920205
## Pos Pred Value               0.13128        0.037340
## Neg Pred Value               0.93416        0.993007
## Prevalence                   0.07195        0.009485
## Detection Rate               0.01226        0.003066
## Detection Prevalence         0.09341        0.082104
## Balanced Accuracy            0.54150        0.621719
ber <- BER(t$class, t$pred)
print(paste("BER = ", round(ber, 2)))
## [1] "BER =  0.56"
write.table(acc$table, file = "../output/ContingencyTable.txt")

Fit the model

The model is now fit using all observations.

nmin <- min(table(seldata$class))
set.seed(seed) #Set seed for reproducibility
fit <- randomForest(class ~ ., data=seldata, replace = FALSE, strata = seldata$class, sampsize = rep(nmin, Nclasses), importance = TRUE)

Variable importance

RF also provides a relative estimate of predictor variable importance. This is measured as the mean decrease in accuracy associated with each variable when it is assigned random but realistic values and the rest of the variables are left unchanged. type = 1: mean decrease in accuracy, scale = FALSE: unscaled

randomForest::importance(fit, type=1, scale = FALSE) #table
##                           MeanDecreaseAccuracy
## MS_bathy_5m                         0.06004868
## MS_biogeo05_dist_shore_5m           0.01831600
## BO2_temprange_ss                    0.02225988
## BO2_ppmax_ss                        0.02243810
## BO2_tempmin_bdmean                  0.05228442
## BO2_salinitymax_ss                  0.07163794
## BO2_salinityrange_ss                0.01670140
## BO2_silicatemin_ss                  0.04281686
varImpPlot(fit, type=1, scale = FALSE, main="Variable Importance") #plot

Save variable importance plot as file

Output as a jpg file for use in publication.

jpeg(filename = "../figs/Model/VarImp.jpg", width = 100, height = 100, units = "mm", pointsize = 8, bg = "white", res = 500)
par(mar=c(5,4,1,1))
varImpPlot(fit, type = 1, scale = FALSE, main="") #plot
dev.off()
## png 
##   2

Predicting a Random Forest model

Finally, an RF model with the tuned parameters is applied to the whole area of interest.

Plot the selected predictor variables

Showing the selected (scaled) predictor variables.

sel.preds
## class      : RasterStack 
## dimensions : 1739, 3469, 6032591, 8  (nrow, ncol, ncell, nlayers)
## resolution : 10000, 10000  (x, y)
## extent     : -17344073, 17345927, -8692862, 8697138  (xmin, xmax, ymin, ymax)
## crs        : +proj=wag4 +lon_0=0 +ellps=WGS84 
## names      : MS_bathy_5m, MS_biogeo05_dist_shore_5m, BO2_temprange_ss, BO2_ppmax_ss, BO2_tempmin_bdmean, BO2_salinitymax_ss, BO2_salinityrange_ss, BO2_silicatemin_ss 
## min values :   -5.314264,                 -1.353266,        -1.966612,    -1.035104,          -2.093250,         -13.208592,            -1.028287,          -0.541951 
## max values :    2.959405,                  3.715002,         6.204855,    19.724588,          17.286411,           3.945611,            20.407955,           7.308377
plot(sel.preds)

Plot of selected predictor variables

jpeg(filename = "../figs/VariableSelection/Predictors.jpg", width = 180, height = 125, units = "mm", pointsize = 10, bg = "white", res = 300)
plot(sel.preds)
dev.off()
## png 
##   2

Predict Random Forest

The probabilities of individual lithologies are predicted. Then, maximum probabilities are derived for every pixel in the map. This gives the probability associated with the class in the categorical prediction, which is finally predicted.

Predicted probabilities

type=“prob”

rfprob <- predict(sel.preds, fit, type="prob", index=classes)
names(rfprob) <- classes
rfprob
## class      : RasterBrick 
## dimensions : 1739, 3469, 6032591, 5  (nrow, ncol, ncell, nlayers)
## resolution : 10000, 10000  (x, y)
## extent     : -17344073, 17345927, -8692862, 8697138  (xmin, xmax, ymin, ymax)
## crs        : +proj=wag4 +lon_0=0 +ellps=WGS84 
## source     : memory
## names      : Calc.Sed,  Clay, Dia.Ooze, Lith.Sed, Rad.Ooze 
## min values :        0,     0,        0,        0,        0 
## max values :    0.932, 0.892,    0.932,    0.924,    1.000
plot(rfprob)

Calculate maximum probability

max.prob <- max(rfprob)
max.prob
## class      : RasterLayer 
## dimensions : 1739, 3469, 6032591  (nrow, ncol, ncell)
## resolution : 10000, 10000  (x, y)
## extent     : -17344073, 17345927, -8692862, 8697138  (xmin, xmax, ymin, ymax)
## crs        : +proj=wag4 +lon_0=0 +ellps=WGS84 
## source     : memory
## names      : layer 
## values     : 0.206, 1  (min, max)
plot(max.prob)

Predicted classes

type=“response”

rfres <- predict(sel.preds, fit, type="response")
rfres
## class      : RasterLayer 
## dimensions : 1739, 3469, 6032591  (nrow, ncol, ncell)
## resolution : 10000, 10000  (x, y)
## extent     : -17344073, 17345927, -8692862, 8697138  (xmin, xmax, ymin, ymax)
## crs        : +proj=wag4 +lon_0=0 +ellps=WGS84 
## source     : memory
## names      : layer 
## values     : 1, 5  (min, max)
## attributes :
##  ID    value
##   1 Calc.Sed
##   2     Clay
##   3 Dia.Ooze
##   4 Lith.Sed
##   5 Rad.Ooze
plot(rfres, col = col.pal)

Export Rasters

The results are exported as GeoTiffs for further analysis.

writeRaster(rfres, "../output/classes_v2a.tif", bylayer=TRUE, format="GTiff",overwrite=T)
writeRaster(rfprob, "../output/probabilities_v2a.tif", bylayer=TRUE, suffix=classes, format="GTiff",overwrite=T)
writeRaster(max.prob, "../output/max_probabilities_v2a.tif", bylayer=TRUE, suffix=classes, format="GTiff",overwrite=T)