A map of deep-sea sediments of the global ocean is derived from seafloor samples, categorised into lithology types, and predictor variables that are known to influence spatial patterns of seafloor lithology. After an initial preparation of the required input data, a selection of important and uncorrelated predictor variables is carried out. A Random Forest model is fitted and predicted. Outputs are a map of global dea-sea sediments, an associated map of the probability of the predicted class and individual probability maps of the predicted lithology classes.
The required R packages are loaded.
#options(warn = -1)
library(geosphere)
library(sdmpredictors)
## Loading required package: raster
## Loading required package: sp
## Loading required package: rgdal
## rgdal: version: 1.4-7, (SVN revision 845)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
## Path to GDAL shared files: C:/Users/diesing_markus/Documents/R/R-3.6.1/library/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: C:/Users/diesing_markus/Documents/R/R-3.6.1/library/rgdal/proj
## Linking to sp version: 1.3-2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(summarytools)
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
##
## Attaching package: 'summarytools'
## The following object is masked from 'package:raster':
##
## freq
library(Boruta)
## Loading required package: ranger
##
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
##
## importance
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
library(raster)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(rgdal)
library(corrplot)
## corrplot 0.84 loaded
library(blockCV)
## Warning: package 'blockCV' was built under R version 3.6.3
library(measures)
## Warning: package 'measures' was built under R version 3.6.3
##
## Attaching package: 'measures'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
All data will be projected to Wagner IV, a global equal-area projection. See https://projectionwizard.org/ for selecting suitable projections.
crs <- "+proj=wag4 +lon_0=0"
Shapefiles in the input folder are listed.
list.files(path = "../input",pattern = ".shp")
## [1] "IntClass_DuplicatesRemoved_deepsea500.shp"
## [2] "IntClass_DuplicatesRemoved_deepsea500.shp.xml"
## [3] "IntClass_mindist14500m_deepsea500.shp"
## [4] "IntClass_mindist14500m_deepsea500.shp.xml"
## [5] "SimpClass_DuplicatesRemoved_deepsea500.shp"
## [6] "SimpClass_DuplicatesRemoved_deepsea500.shp.xml"
Load a shapefile with class labels. The sample data are derived from Dutkiewicz et al. (2015): https://doi.org/10.1130/G36883.1. Sample data can be downloaded from: ftp://ftp.earthbyte.org/papers/Dutkiewicz_etal_seafloor_lithology/iPython_notebook_and_input_data/seafloor_data.npz Data were initially converted into a csv file containing latitude, longitude and class (number code between 1 and 13). The csv file was converted into a shp file and the following pre-processing operations were carried out in a GIS:
Removal of duplicates
Restricting to samples in the deep sea: Select by location with deepsea500.shp.
The original classification with 13 classes (Dutkiewicz et al., 2015) was reduced to 5 classes as per Dutkiewizc et al. (2016):
New class | Original class(es) |
---|---|
Calc.sed | Calcareous ooze, Fine-grained calcareous sediment |
Clay | Clay |
Dia.Ooze | Diatom ooze |
Lith.Sed | Gravel and coarser, Sand, Silt |
Rad.Ooze | Radiolarian ooze |
Mixed (Mixed calcareous-siliceous ooze and siliceous mud) and rare classes (Ash and volcanic sand/gravel, Sponge spicules and Shells and coral fragments) were removed.
samples <- readOGR(dsn = "../input",layer = "SimpClass_DuplicatesRemoved_deepsea500")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\diesing_markus\OneDrive - Norges Geologiske Undersøkelse\R\Global deep-sea sediments\input", layer: "SimpClass_DuplicatesRemoved_deepsea500"
## with 10484 features
## It has 4 fields
## Integer64 fields read as strings: Class
The SpatialPointsDataFrame samples is projected from unprojected lat/lon (WGS84) to Wagner IV.
samples <- spTransform(samples, crs)
samples
## class : SpatialPointsDataFrame
## features : 10484
## extent : -17236964, 17255812, -8321706, 8171254 (xmin, xmax, ymin, ymax)
## crs : +proj=wag4 +lon_0=0 +ellps=WGS84
## variables : 4
## names : Latitude, Longitude, Class, SimpClass
## min values : -77.633, -179.967, 1, Calc.Sed
## max values : 74.866, 179.9965, 7, Rad.Ooze
plot(samples, pch = 16, cex = 0.3)
Displays a table of available predictor variables.
References:
Assis, J, Tyberghein, L, Bosch, S, Verbruggen, H, Serrão, EA, De Clerck, O. Bio‐ORACLE v2.0: Extending marine data layers for bioclimatic modelling. Global Ecol Biogeogr. 2018; 27: 277– 284. https://doi.org/10.1111/geb.12693
Sbrocco, E. J. and Barber, P. H. (2013) MARSPEC: ocean climate layers for marine spatial ecology. Ecology 94: 979. https://doi.org/10.1890/12-1358.1
Tyberghein, L. , Verbruggen, H. , Pauly, K. , Troupin, C. , Mineur, F. and De Clerck, O. (2012), Bio‐ORACLE: a global environmental dataset for marine species distribution modelling. Global Ecology and Biogeography, 21: 272-281. https://10.1111/j.1466-8238.2011.00656.x
datasets <- list_datasets(terrestrial = F, marine = T)
layers <- list_layers(datasets)
layers
Predictor variables were selected based on previous knowledge from literature, e.g. Dutkiewicz et al. (2016) and Seibold and Berger (1996) and availability. Layers are downloaded from the Bio-ORACLE 2 database, which also includes MAESPEC data. For local storage of layers define datadir = “
predictors <- load_layers(
layercodes =
c("MS_bathy_5m", "MS_biogeo05_dist_shore_5m",
"BO2_tempmean_ss", "BO2_tempmax_ss", "BO2_tempmin_ss", "BO2_temprange_ss",
"BO2_dissoxmean_ss", "BO2_dissoxmax_ss", "BO2_dissoxmin_ss", "BO2_dissoxrange_ss",
"BO2_ironmean_ss", "BO2_ironmax_ss", "BO2_ironmin_ss", "BO2_ironrange_ss",
"BO2_phosphatemean_ss", "BO2_phosphatemax_ss", "BO2_phosphatemin_ss", "BO2_phosphaterange_ss",
"BO2_nitratemean_ss", "BO2_nitratemax_ss", "BO2_nitratemin_ss", "BO2_nitraterange_ss",
"BO2_ppmean_ss", "BO2_ppmax_ss", "BO2_ppmin_ss", "BO2_pprange_ss",
"BO2_salinitymean_ss", "BO2_salinitymax_ss", "BO2_salinitymin_ss", "BO2_salinityrange_ss",
"BO2_silicatemean_ss", "BO2_silicatemax_ss", "BO2_silicatemin_ss", "BO2_silicaterange_ss",
"BO2_tempmean_bdmean", "BO2_tempmax_bdmean", "BO2_tempmin_bdmean", "BO2_temprange_bdmean"
),
equalarea = FALSE, rasterstack = TRUE)
## Warning in get_datadir(datadir): file.path(tempdir(), "sdmpredictors") will
## be used as datadir, set options(sdmpredictors_datadir="<directory>") to avoid
## re-downloading the data in every session or set the datadir parameter in
## load_layers
predictors
## class : RasterStack
## dimensions : 2160, 4320, 9331200, 38 (nrow, ncol, ncell, nlayers)
## resolution : 0.08333333, 0.08333333 (x, y)
## extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## names : MS_bathy_5m, MS_biogeo05_dist_shore_5m, BO2_tempmean_ss, BO2_tempmax_ss, BO2_tempmin_ss, BO2_temprange_ss, BO2_dissoxmean_ss, BO2_dissoxmax_ss, BO2_dissoxmin_ss, BO2_dissoxrange_ss, BO2_ironmean_ss, BO2_ironmax_ss, BO2_ironmin_ss, BO2_ironrange_ss, BO2_phosphatemean_ss, ...
## min values : -1.049300e+04, 1.000000e+00, -1.797733e+00, -1.740006e+00, -1.940000e+00, 7.999992e-02, 1.280844e+02, 1.987810e+02, 7.664922e+01, 3.239714e+00, 5.003627e-06, 5.079128e-06, 4.983647e-06, 5.124374e-08, 1.429982e-05, ...
## max values : -1.00000000, 2778.00000000, 30.17862892, 34.75999451, 29.35999298, 28.94999869, 400.86618443, 453.65198161, 381.97638152, 300.49582837, 0.01973090, 0.04008445, 0.01348529, 0.03305599, 2.23390642, ...
The predictors are subset to the deep-sea area below 500 m water depth.
predictors$MS_bathy_5m[predictors$MS_bathy_5m > -500] <- NA
predictors <- mask(crop(predictors, predictors$MS_bathy_5m), predictors$MS_bathy_5m)
The raster stack is then projected to Wagner IV with a pixel resolution of 10 km.
predictors <- projectRaster(from = predictors, res = 10000, crs = crs, method = "bilinear")
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 47709
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 36202
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 33524
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 31062
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 28781
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 26663
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 24692
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 22845
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 21113
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 19485
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 17957
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 16518
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 15159
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 13885
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 12683
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 11549
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 10488
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 9483
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 8541
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 7661
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 6836
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 6067
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 5348
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 4684
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 4069
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 3502
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 2986
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 2513
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 2088
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 1710
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 1373
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 1081
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 834
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 630
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 467
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 350
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 271
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 230
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 241
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 295
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 387
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 523
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 698
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 922
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 1185
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 1492
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 1844
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 2241
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 2683
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 3170
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 3708
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 4293
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 4925
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 5610
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 6346
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 7137
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 7982
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 8885
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 9851
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 10863
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 11966
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 13121
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 14350
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 15656
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 17043
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 18516
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 20078
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 21746
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 23520
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 25412
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 27433
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 29614
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 31956
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 34496
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 37264
## projected point(s) not finite
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : 32737
## projected point(s) not finite
predictors
## class : RasterBrick
## dimensions : 1739, 3469, 6032591, 38 (nrow, ncol, ncell, nlayers)
## resolution : 10000, 10000 (x, y)
## extent : -17344073, 17345927, -8692862, 8697138 (xmin, xmax, ymin, ymax)
## crs : +proj=wag4 +lon_0=0 +ellps=WGS84
## source : C:/Users/diesing_markus/AppData/Local/Temp/RtmpkbYcaP/raster/r_tmp_2020-07-08_153341_24480_97490.grd
## names : MS_bathy_5m, MS_biogeo05_dist_shore_5m, BO2_tempmean_ss, BO2_tempmax_ss, BO2_tempmin_ss, BO2_temprange_ss, BO2_dissoxmean_ss, BO2_dissoxmax_ss, BO2_dissoxmin_ss, BO2_dissoxrange_ss, BO2_ironmean_ss, BO2_ironmax_ss, BO2_ironmin_ss, BO2_ironrange_ss, BO2_phosphatemean_ss, ...
## min values : -1.043116e+04, 1.000000e+00, -1.797137e+00, -1.740006e+00, -1.940000e+00, 7.999992e-02, 1.496089e+02, 1.987813e+02, 1.036297e+02, 3.243045e+00, 5.003856e-06, 5.079576e-06, 4.983790e-06, 5.174933e-08, 1.444217e-05, ...
## max values : -5.000000e+02, 2.777553e+03, 3.003049e+01, 3.327256e+01, 2.934584e+01, 2.397654e+01, 3.855367e+02, 4.054187e+02, 3.765262e+02, 1.531691e+02, 6.892411e-03, 9.712671e-03, 3.843763e-03, 8.999437e-03, 2.233885e+00, ...
plot(predictors)
Predictors are scaled for further analysis.
sc_preds <- scale(predictors)
Values of predictor variables are extracted from the raster stack and attached to every sample point.
sdata <- data.frame(samples[c(4,1,2)],extract(sc_preds,samples))
names(sdata)[1] <- "class" #Change header name of first column to "class"
sdata <- sdata[-c(4:6)] #Removes unnecessary information
sdata <- subset(x = sdata, subset = !is.na(sdata$MS_bathy_5m)) #Removes NAs
A distance matrix between all points is computed. This will be used for spatial cross validation.
sdat <- sdata
coordinates(sdat) = ~ Longitude + Latitude
proj4string(sdat)<- CRS("+proj=longlat +datum=WGS84")
dist.mat <- distm(sdat)
sdata <- sdata[-c(2:3)] # Remove coordinates from sdata
Some variables that will be used subsequently are defined here.
seed <- 42 #Set a seed for reproducibility
preds <- sdata[2:ncol(sdata)] #Predictor variables without response variable
res <- sdata[,1] #Response variable
classes <- levels(res) #classes to predict
Nclasses <- nlevels(res) #Number of class labels
Npreds <- ncol(preds) #Number of predictor variables
Nobs <- nrow(preds) #Number of observations
nmin <- min(table(sdata$class)) #Number of observations in the minority class
#Colours for the sediment classes (in alphabetical order)
col.pal <- c(rgb(81,141,203, maxColorValue = 255), rgb(111,78,45, maxColorValue = 255), rgb(202,212,79, maxColorValue = 255), rgb(245,245,122, maxColorValue = 255), rgb(78,140,63, maxColorValue = 255))
Summary of the data frame “sdata”. This is especially useful to check whether there are NA values. Note that the Boruta algorithm (see below) will not execute in such a case.
view(dfSummary(sdata), method = "browser")
## Output file written: C:\Users\DIESIN~1\AppData\Local\Temp\RtmpkbYcaP\file5fa02cfd1b98.html
Predictor variable selection is carried out to reduce the dimensionality of the dataset. This consists of two steps: Boruta analysis to remove unimportant variables and correlation analysis to remove correlated variables.
First, the Boruta algorithm is used to identify important variables and remove tentative or unimportant variables.
set.seed(seed) #Set seed for reproducibility
B1 <- Boruta(as.factor(sdata[[1]]) ~ .,data=preds, pValue = 0.05,
maxRuns = 500)
Results of the Boruta algorithm.
print(B1)
## Boruta performed 10 iterations in 2.723114 mins.
## 38 attributes confirmed important: BO2_dissoxmax_ss,
## BO2_dissoxmean_ss, BO2_dissoxmin_ss, BO2_dissoxrange_ss, BO2_ironmax_ss
## and 33 more;
## No attributes deemed unimportant.
par(mar=c(13,4,1,1), cex = 0.6)
plot(B1, las=2, colCode = c("greenyellow", "yellow2", "red3", "cadetblue"), xlab = "")
Export of the Boruta figure as a jpg file for use in publication.
jpeg("../figs/VariableSelection/Boruta.jpeg", width = 25, height = 15, res = 500, units = "cm", pointsize = 8, quality = 100)
par(mar=c(12.5,4,1,1))
par(mfrow=c(1,1))
plot(B1, las=2, colCode = c("greenyellow", "yellow2", "red3", "cadetblue"), xlab = "")
dev.off()
## png
## 2
Second, colinearity is reduced through a correlation analysis. Of correlated variables only the one with the highest Boruta importance score is retained.
The predictor variables are sorted in decreasing order of Boruta variable importance. The function attStats() extracts importance statistics from the Boruta object (B1). The mean importance [1] is used here. As we want to exclude all variables that were unimportant or undecided, we subset the data to those that were “Confirmed”.
BorImp <- attStats(B1)[,c(1,6)]
BorImp <- subset(BorImp, decision == "Confirmed")
BorImp$Var <- rownames(BorImp)
rownames(BorImp) <- NULL
BorList <- list(BorImp[order(BorImp[1],decreasing=T),c(3,1)])
BorList
## [[1]]
## Var meanImp
## 1 MS_bathy_5m 54.10350
## 38 BO2_temprange_bdmean 43.91780
## 2 MS_biogeo05_dist_shore_5m 39.37547
## 6 BO2_temprange_ss 35.62872
## 24 BO2_ppmax_ss 34.85403
## 26 BO2_pprange_ss 34.32361
## 37 BO2_tempmin_bdmean 32.26956
## 10 BO2_dissoxrange_ss 31.90227
## 36 BO2_tempmax_bdmean 31.81061
## 28 BO2_salinitymax_ss 31.59613
## 27 BO2_salinitymean_ss 31.30111
## 35 BO2_tempmean_bdmean 30.49724
## 30 BO2_salinityrange_ss 30.32951
## 23 BO2_ppmean_ss 30.14413
## 25 BO2_ppmin_ss 29.66990
## 33 BO2_silicatemin_ss 27.04216
## 18 BO2_phosphaterange_ss 27.01037
## 29 BO2_salinitymin_ss 26.97217
## 14 BO2_ironrange_ss 25.93805
## 9 BO2_dissoxmin_ss 25.59439
## 13 BO2_ironmin_ss 25.49440
## 34 BO2_silicaterange_ss 25.39390
## 22 BO2_nitraterange_ss 25.18493
## 4 BO2_tempmax_ss 24.97719
## 11 BO2_ironmean_ss 24.89616
## 12 BO2_ironmax_ss 24.56195
## 17 BO2_phosphatemin_ss 24.25184
## 8 BO2_dissoxmax_ss 23.62954
## 3 BO2_tempmean_ss 22.90907
## 15 BO2_phosphatemean_ss 22.43548
## 5 BO2_tempmin_ss 22.40179
## 16 BO2_phosphatemax_ss 22.22922
## 31 BO2_silicatemean_ss 22.02477
## 32 BO2_silicatemax_ss 21.93855
## 19 BO2_nitratemean_ss 21.78520
## 21 BO2_nitratemin_ss 21.64518
## 20 BO2_nitratemax_ss 21.23472
## 7 BO2_dissoxmean_ss 19.97032
The effect of variable selection through the choice of the cutoff value for r on prediction performance is evaluated. For every r value between 0.1 and 1 (step 0.01), a random forest model is built with default parameters and the out of bag error is plotted over r.
df_total <- data.frame()
for (i in c(10:100)*0.01) {
r <- i
cr <- cor(preds[BorList[[1]][[1]]])
for(j in 1:length(cr[1,])){
if (j == 1){
pl <- c(names(cr[j,][1]),names( cr[j,][sqrt((cr[j,])^2)<r]))
pl1 <- pl
} else if (names(cr[j,])[j] %in% pl1){
rem <- names(cr[j,-c(1:j)][sqrt((cr[j,-c(1:j)])^2)>r])
if (length(rem) != 0L){
pl <- pl[!pl %in% rem]
}
}
next
}
clms <- c(names(sdata[1]),pl)
seldata <- sdata[clms]
set.seed(seed)
fit <- randomForest(class ~ ., data=seldata)
df <- data.frame(i, length(pl), fit$err.rate[nrow(fit$err.rate),1])
df_total <- rbind(df_total,df)
}
names(df_total)[2] <- "Npreds"
names(df_total)[3] <- "OOBerror"
max(df_total$OOBerror) - min(df_total$OOBerror)
## [1] 0.05633263
ggplot(df_total,aes(i,OOBerror)) + geom_point(aes(size=Npreds)) +
xlab("r") + ylab("OOB error ")
A maximum allowable r-value needs to be set as threshold to remove correlated variables.
r <- 0.5
view(dfSummary(seldata), file = "../figs/DataExploration/TrainSummary.html")
## Output file written: C:\Users\diesing_markus\OneDrive - Norges Geologiske Undersøkelse\R\Global deep-sea sediments\figs\DataExploration\TrainSummary.html
for (i in 2:ncol(seldata)) {
print(ggplot(seldata, aes(x = class, y = seldata[,i],fill = class)) +
geom_boxplot() +
scale_fill_manual(values = col.pal) +
scale_y_continuous(name = names(seldata[i])) +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.ticks.x = element_blank()))
}
for (i in 2:ncol(seldata)) {
print(ggplot(seldata, aes(x= seldata[,i], fill = class)) +
geom_density(position="identity", alpha=0.6)+
scale_fill_manual(values = col.pal) +
scale_x_continuous(name = names(seldata[i])))
}
A visual check to what extent the samples cover the environmental space. This is useful as legacy data were used and no formal sampling design was applied in the analysis.
Blue: Samples
Red: Environmental data (based on random subsample)
smp <- as.data.frame(sampleRandom(x = sel.preds, size = 10000))
for (i in 2:ncol(seldata)) {
print(ggplot() +
geom_density(data = seldata, aes(x=seldata[,i]),colour="darkblue",fill="darkblue", alpha=0.1,size=1)+
geom_density(data = smp, aes(x=smp[,i-1]), colour="red",fill="red", alpha=0.1, size=1)+
scale_x_continuous(name = names(seldata[i])))
}
A Random Forest (Breiman, 2001) model is built, based on the response variable (lithology class) and selected predictor variables.
The response data is likely spatially structured. Spatial autocorrelation might lead to over-optimistic estimates of performance metrics. To account for spatial autocorrelation, a spatial leave one out (LOO) cross validation approach is taken.
First, the buffer size is determined by estimating the spatial autocorrelation range of the selected predictor variables.
sac <- spatialAutoRange(sel.preds, sampleNumber = 10000, doParallel = T, showPlots = F)
## There are 8 raster layers
## Loading required namespace: rgeos
print(paste("Buffer size =", round(sac$range/1000, 2), "km"))
## [1] "Buffer size = 5299.09 km"
A spatial LOO CV is performed.
t <- data.frame() #initialize a results df and run spatial LOO CV using RF
for(i in 1:nrow(seldata)){
test <- seldata[i, ] #leave out one point at a time as test
#select the points from the distance matrix that are > some distance
pt.buffer <- which(dist.mat[ ,i] > sac$range)
buffered.pts <- seldata[pt.buffer, ]
nmin <- min(table(buffered.pts$class))
#run rf using all points but i and points within the buffer
rf <- randomForest(class ~ .,data=buffered.pts, replace = FALSE, strata = buffered.pts$class,
sampsize = rep(nmin, Nclasses))
#prefict the test point and bind it to results
pred <- predict(rf, test, type="response")
t <- rbind(t, data.frame(test[1], pred, nmin)) #t is a df of predicted and observed values
}
summary(t)
## class pred nmin
## Calc.Sed:5251 Calc.Sed:4554 Min. :46.00
## Clay :3714 Clay :2973 1st Qu.:78.00
## Dia.Ooze: 623 Dia.Ooze:1079 Median :89.00
## Lith.Sed: 751 Lith.Sed: 975 Mean :85.43
## Rad.Ooze: 99 Rad.Ooze: 857 3rd Qu.:96.00
## Max. :99.00
The accuracy of the model is assessed based on the spatial LOO CV.
acc <- confusionMatrix(data=t$pred, reference=t$class)
acc
## Confusion Matrix and Statistics
##
## Reference
## Prediction Calc.Sed Clay Dia.Ooze Lith.Sed Rad.Ooze
## Calc.Sed 3735 660 10 139 10
## Clay 760 2001 44 151 17
## Dia.Ooze 234 215 299 294 37
## Lith.Sed 277 456 111 128 3
## Rad.Ooze 245 382 159 39 32
##
## Overall Statistics
##
## Accuracy : 0.5935
## 95% CI : (0.584, 0.6029)
## No Information Rate : 0.5031
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3892
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: Calc.Sed Class: Clay Class: Dia.Ooze
## Sensitivity 0.7113 0.5388 0.47994
## Specificity 0.8421 0.8554 0.92053
## Pos Pred Value 0.8202 0.6731 0.27711
## Neg Pred Value 0.7424 0.7705 0.96538
## Prevalence 0.5031 0.3558 0.05969
## Detection Rate 0.3578 0.1917 0.02865
## Detection Prevalence 0.4363 0.2848 0.10337
## Balanced Accuracy 0.7767 0.6971 0.70023
## Class: Lith.Sed Class: Rad.Ooze
## Sensitivity 0.17044 0.323232
## Specificity 0.91256 0.920205
## Pos Pred Value 0.13128 0.037340
## Neg Pred Value 0.93416 0.993007
## Prevalence 0.07195 0.009485
## Detection Rate 0.01226 0.003066
## Detection Prevalence 0.09341 0.082104
## Balanced Accuracy 0.54150 0.621719
ber <- BER(t$class, t$pred)
print(paste("BER = ", round(ber, 2)))
## [1] "BER = 0.56"
write.table(acc$table, file = "../output/ContingencyTable.txt")
The model is now fit using all observations.
nmin <- min(table(seldata$class))
set.seed(seed) #Set seed for reproducibility
fit <- randomForest(class ~ ., data=seldata, replace = FALSE, strata = seldata$class, sampsize = rep(nmin, Nclasses), importance = TRUE)
RF also provides a relative estimate of predictor variable importance. This is measured as the mean decrease in accuracy associated with each variable when it is assigned random but realistic values and the rest of the variables are left unchanged. type = 1: mean decrease in accuracy, scale = FALSE: unscaled
randomForest::importance(fit, type=1, scale = FALSE) #table
## MeanDecreaseAccuracy
## MS_bathy_5m 0.06004868
## MS_biogeo05_dist_shore_5m 0.01831600
## BO2_temprange_ss 0.02225988
## BO2_ppmax_ss 0.02243810
## BO2_tempmin_bdmean 0.05228442
## BO2_salinitymax_ss 0.07163794
## BO2_salinityrange_ss 0.01670140
## BO2_silicatemin_ss 0.04281686
varImpPlot(fit, type=1, scale = FALSE, main="Variable Importance") #plot
Output as a jpg file for use in publication.
jpeg(filename = "../figs/Model/VarImp.jpg", width = 100, height = 100, units = "mm", pointsize = 8, bg = "white", res = 500)
par(mar=c(5,4,1,1))
varImpPlot(fit, type = 1, scale = FALSE, main="") #plot
dev.off()
## png
## 2
Finally, an RF model with the tuned parameters is applied to the whole area of interest.
Showing the selected (scaled) predictor variables.
sel.preds
## class : RasterStack
## dimensions : 1739, 3469, 6032591, 8 (nrow, ncol, ncell, nlayers)
## resolution : 10000, 10000 (x, y)
## extent : -17344073, 17345927, -8692862, 8697138 (xmin, xmax, ymin, ymax)
## crs : +proj=wag4 +lon_0=0 +ellps=WGS84
## names : MS_bathy_5m, MS_biogeo05_dist_shore_5m, BO2_temprange_ss, BO2_ppmax_ss, BO2_tempmin_bdmean, BO2_salinitymax_ss, BO2_salinityrange_ss, BO2_silicatemin_ss
## min values : -5.314264, -1.353266, -1.966612, -1.035104, -2.093250, -13.208592, -1.028287, -0.541951
## max values : 2.959405, 3.715002, 6.204855, 19.724588, 17.286411, 3.945611, 20.407955, 7.308377
plot(sel.preds)
jpeg(filename = "../figs/VariableSelection/Predictors.jpg", width = 180, height = 125, units = "mm", pointsize = 10, bg = "white", res = 300)
plot(sel.preds)
dev.off()
## png
## 2
The probabilities of individual lithologies are predicted. Then, maximum probabilities are derived for every pixel in the map. This gives the probability associated with the class in the categorical prediction, which is finally predicted.
type=“prob”
rfprob <- predict(sel.preds, fit, type="prob", index=classes)
names(rfprob) <- classes
rfprob
## class : RasterBrick
## dimensions : 1739, 3469, 6032591, 5 (nrow, ncol, ncell, nlayers)
## resolution : 10000, 10000 (x, y)
## extent : -17344073, 17345927, -8692862, 8697138 (xmin, xmax, ymin, ymax)
## crs : +proj=wag4 +lon_0=0 +ellps=WGS84
## source : memory
## names : Calc.Sed, Clay, Dia.Ooze, Lith.Sed, Rad.Ooze
## min values : 0, 0, 0, 0, 0
## max values : 0.932, 0.892, 0.932, 0.924, 1.000
plot(rfprob)
max.prob <- max(rfprob)
max.prob
## class : RasterLayer
## dimensions : 1739, 3469, 6032591 (nrow, ncol, ncell)
## resolution : 10000, 10000 (x, y)
## extent : -17344073, 17345927, -8692862, 8697138 (xmin, xmax, ymin, ymax)
## crs : +proj=wag4 +lon_0=0 +ellps=WGS84
## source : memory
## names : layer
## values : 0.206, 1 (min, max)
plot(max.prob)
type=“response”
rfres <- predict(sel.preds, fit, type="response")
rfres
## class : RasterLayer
## dimensions : 1739, 3469, 6032591 (nrow, ncol, ncell)
## resolution : 10000, 10000 (x, y)
## extent : -17344073, 17345927, -8692862, 8697138 (xmin, xmax, ymin, ymax)
## crs : +proj=wag4 +lon_0=0 +ellps=WGS84
## source : memory
## names : layer
## values : 1, 5 (min, max)
## attributes :
## ID value
## 1 Calc.Sed
## 2 Clay
## 3 Dia.Ooze
## 4 Lith.Sed
## 5 Rad.Ooze
plot(rfres, col = col.pal)
The results are exported as GeoTiffs for further analysis.
writeRaster(rfres, "../output/classes_v2a.tif", bylayer=TRUE, format="GTiff",overwrite=T)
writeRaster(rfprob, "../output/probabilities_v2a.tif", bylayer=TRUE, suffix=classes, format="GTiff",overwrite=T)
writeRaster(max.prob, "../output/max_probabilities_v2a.tif", bylayer=TRUE, suffix=classes, format="GTiff",overwrite=T)