A machine learning based global sea-surface iodide distribution

Abstract. Iodide in the sea-surface plays an important role in the Earth system. It modulates the oxidising capacity of the troposphere and provides iodine to terrestrial ecosystems. However, our understanding of its distribution is limited due to a paucity of observations. Previous efforts to generate global distributions have generally fitted sea-surface iodide observations to relatively simple functions of sea-surface temperature (Chance et al., 2014; MacDonald et al., 2014). This approach fails to account for coastal influences and variation in the bio-geochemical environment. Here we use a machine learning regression approach (Random Forest Regression) to generate a high resolution (0.125° x 0.125°, ∼ 12.5 km), monthly dataset of present-day global sea-surface iodide. We use a compilation of iodide observations (Chance et al., 2019b) that is 45 % larger than has been used previously (Chance et al., 2014) as the dependent variable and co-located ancillary parameters (temperature, nitrate, phosphate, salinity, shortwave radiation, topographic depth, mixed layer depth, and chlorophyll-a) from global climatologies as the independent variables. We investigate the regression models generated using different combinations of ancillary parameters and select the ten best-performing models to be included in an ensemble prediction. We then use this ensemble of models, combined with global fields of the ancillary parameters, to predict a new high resolution global sea-surface iodide field. Sea-surface temperature is the most important variable in all of the top ten models. We estimate a global average sea-surface iodide concentration of 106 nM (with an uncertainty of ∼ 20 %), which is within the range of previous estimates (60–130 nM). Similar to previous work, higher concentrations are predicted for the tropics than for the extra-tropics. Unlike the previous parameterisations, higher concentrations are also predicted for shallow areas such as coastal regions and the South China Sea. Compared to previous work, the new parameterisation better captures observed variability. The iodide concentrations calculated here are significantly higher (40 % on a global basis) than the commonly used MacDonald et al. (2014) parameterisation, with implications for our understanding of iodine in the atmosphere. The global iodide dataset is made freely available to the community (DOI: https://doi.org/10/gfv5v3) and as new observations are made, we will update the global dataset through a "living data" model.



Introduction
Iodine in seawater exists in two major forms, iodide (I − ) and iodate (IO − 3 ). Total inorganic iodine (I − + IO − 3 ) remains approximately constant across most of the oceans, but the ratio of iodide to iodate varies has been shown to vary by Chance et al. (2014) with latitude, depth and oxygen level . A small amount of iodine (<10%) is thought to be present in organic forms in the open ocean (e.g. Wong (1991)), however, this may be a larger fraction in coastal waters (e.g. Wong and Cheng (1998)). The 5 processes controlling the distribution of the ratio between iodide and iodate remain poorly understood .
A reason for gaps in our understanding is that the observational dataset of iodide and iodate remains relatively sparse (Chance et al., , 2019b. Despite this paucity in observations, iodine's role in the Earth system has driven multidisciplinary interest in the distribution of iodine compounds in seawater from a number of different research communities, including paleoceanography (Lu et al., 2016(Lu et al., , 2018Zhou et al., 2015), atmospheric composition Sherwen et al., 2016a), and 10 air-quality prediction (Sarwar et al., 2015;Luhar et al., 2017Luhar et al., , 2018. The atmospheric science community has seen a particularly large growth in interest in iodine chemistry in the atmosphere and at the sea-surface, as sea-surface I − is believed to be the main driver of atmospheric iodine emissions. The reaction of I − with ozone in the sea-surface micro-layer removes ozone from the atmosphere (dry deposition) (Ganzeveld et al., 2009) and results in the emission of inorganic iodine (HOI and I 2 ) into the atmosphere (Carpenter et al., 2013), which can subsequently 15 catalytically destroy ozone (Chameides and Davis, 1980). A number of model studies have discussed the impact of oceansourced iodine on atmosphere composition in the context of air quality (Gantt et al., 2017;Sarwar et al., 2016;Sherwen et al., 2017b), climate (Sherwen et al., 2017b;Saiz-Lopez et al., 2012), aerosols (Sherwen et al., 2017a), and stratospheric ozone . These atmospheric modelling studies have used relatively simple parameterisations for predictions of sea-surface iodide. 20 Early parameterisations for sea-surface iodide were based on limited datasets, and used either an observed range of iodide concentrations (Coleman et al., 2010;Chang et al., 2004), or a reported relationship with biogeochemical parameters (e.g. chlorophyll (Oh et al., 2008) or nitrate (Ganzeveld et al., 2009)). However, more recent attempts MacDonald et al., 2014) have focused on using correlation analysis to fit compilations of observed iodide concentrations to a variety of commonly measured sea-surface variables, notably sea-surface temperature, but also chlorophyll, salinity, and ni- 25 trate. A summary of parameterisations that have been used in previous studies is given in Appendix Table A1. Compilation of all available observations confirmed a strong latitudinal gradient, and identified sea-surface temperature as the strongest single predictor of iodide concentration . This approach has led to the equation Eqn. 1 from Chance et al. (2014) and Eqn. 2 from MacDonald et al. (2014).
(1) 30 I − aq (nM ) = 1.46 × 10 6 · exp( −9134 T ( • K) ) · 1 × 10 9 (2) Fig 1 shows the global annual mean distribution of sea-surface iodide calculated using these parameterisations (Eqn 1 and 2) and sea-surface temperature fields . Although both equations predict a similar distribution (higher concentrations in tropical waters and lower in polar waters), Eqn 1 generally predicts iodide concentrations 2-4 times higher than Eqn. 2. In developing Eqn. 1, Chance et al. (2014) compiled iodide observations from both coastal and non coastal sites. However, Eqn. 2 used a relatively small subset (14%) of these observations, which did not include coastal sites, which 5 may explain the lower concentrations. Eqn. 2 also has an Arrhenius form, which could be interpreted to suggest that iodide concentrations are controlled by abiotic reaction kinetics. However, this has not been demonstrated, and Chance et al. (2014) discussed how microbiological activity and oceanic mixing are currently thought to be the primary controls. The choice of different parameterisation (Eqn. 2 versus Eqn. 1) results in a difference of 50% in the calculated global emissions of iodine into the atmosphere (Sherwen et al., 2016a). 10 Considering the need for spatially-resolved sea-surface iodide field by models and the paucity of observations, parameterisations are required that can yield predictions from ancillary variables. This is a regression problem and a number of approaches are available. Conventional linear and linear multi-variant approaches have been used in the past (e.g. see summary in Appendix   Table A1). However, they need to assume a functional relationship between the dependent and independent variables. Another approach is machine learning, which uses algorithms to build predictive models. These algorithms take a different approach 15 and use a non-parametric formulations. Machine learning approaches range from interpretable options such as the "Random Forest" algorithm (Breiman, 2001) to less interpretable ones such as artificial neural networks (Gardner and Dorling, 1998).
On the more interpretable end, machine learning algorithms are being used increasingly within environmental sciences, with recent examples including linear Ridge Regression and Random Forest models to replace computationally-expensive processes Nowack et al., 2018) and Gaussian Process emulation to explore model biases on a global scale (Lee 20 et al., 2011;Revell et al., 2018).
Here, we use a recently expanded compilation of sea-surface iodide observations (Chance et al., 2019b) to build a new seasurface iodide parameterisation using a data-driven machine learning approach. We choose to use the Random Forest Regressor (RFR) algorithm (Breiman, 2001;Pedregosa et al., 2011), which is relatively simple and produces results that are also easy to understand. We aim to be able to predict global sea-surface iodide based on observations and ancillary physical and chemical 25 variables (e.g. sea-surface temperature, depth, and salinity etc.) from a number of publicly available sources. We first describe the input datasets we use (Sect. 2), then we explain the methodology taken (Sect. 3), and finally present the predictions at observational locations and globally (Sect. 4). We make the resulting high resolution, global, monthly dataset of predicted iodide available to the community (Sherwen et al. (2019); DOI:https://doi.org/10/gfv5v3). When new observations become available, they will be incorporated into the model and updated versions will be provided through a "living data" model.  It includes 45 % more data points, and has greater spatial coverage, than the previous compilation of 925 observations . Observations are categorised in Chance et al. (2019b) as "coastal" or "non-coastal", according to the designation of their static Longhurst biogeochemical province (Longhurst, 1998). We adopt the same categorisation here. This sea-surface iodide dataset then forms the dependent variable for our regression.
We require a number of physical, chemical and biological parameters as the independent variables in our regression models.

5
Consistent in-situ measurement of these parameters are not available for the iodide observations. Thus we have used a number of ancillary datasets (Table 1) to provide this information. There are a number of criteria for these datasets: they need to be available at an appropriate resolution as a gridded product; they need to represent potential processes that could control iodide concentrations and they need to be in some way orthogonal to the other independent variables. Gridded datasets of dissolved organic carbon (e.g. Roshan and DeVries (2017)) and phytoplankton primary productivity (e.g. Behrenfeld and Falkowski 10 (1997)) may have some usefulness, but they themselves are built using statistical models with other variables and thus we do not use those here. The selected ancillary variables (Table 1) were first extracted from their native resolution using the nearestneighbour method, onto a consistent high-resolution monthly grid (0.125 • x0.125 • , ∼12.5km). This horizontal resolution was used as this is the highest resolution of the current generation of global atmospheric chemistry simulations (Hu et al., 2018). We calculate monthly means because the chemical lifetime of iodide in the surface oceans is thought to be at least several months 15 (Campos et al., 1996;Žic et al., 2013), and possibly years (Edwards and Truesdale, 1997;Tsunogai Shizuo and Henmi, 1971).
Indeed, the lifetime of iodide is thought to be sufficiently long that, where deep vertical mixing occurs on a seasonal timescale, this may be the dominant loss process from surface waters (e.g. Chance et al. (2010)). The values for bathymetric ocean depth we set to a minimum depth of 2 metres, to remove terrestrial locations, and the same value was used for all months.
For each iodide observation, the nearest point in space and time was extracted from the high resolution gridded ancillary data. 20 For the 31 iodide observations where a month was not available (Luther and Cole, 1988;Tsunogai Shizuo and Henmi, 1971;Wong and Cheng, 1998), an arbitrary month was chosen (of March for Northern hemispheric observations and September for Southern hemispheric observations). Outliers within the observations are removed as described in Sect. 2. A further single dataset (Truesdale et al., 2003) was also excluded from this analysis. This is discussed in Appendix Sect. A1.

25
Here we first explain the way in which we use the machine learning algorithm (Sect.

Random Forest Regressor algorithm
As the aim here is to predict a continuous numerical value for sea-surface iodide, a regression approach is taken. As discussed in the introduction, previous approaches have been made to parameterise sea-surface iodide, and the most commonly used relationships employ sea-surface temperature as the predictor variable. Here we take a different multivariate and non-parametric approach, using the computationally cheap and interpretable Random Forest Regressor (RFR) algorithm (Breiman, 2001;5 Pedregosa et al., 2011).
Random forest regression is based on finding a number of decisions trees, which predict the dependent variable. As all of the trees contribute to the prediction and they are collectively referred to as a "forest". These trees can be explained as a record of the way the algorithm has linearly traversed a subset of the training data, splitting the data into two parts at each decision point or "node" in a way that minimised the internal differences of the parts. The best split is chosen between the available 10 variables based on an error metric (e.g. mean square error) and this process is continued until a criterion of purity is reached or a minimum number of data points are left from a split. This is essentially a classification problem. The prediction of the forest is the mean value of the prediction of all of the different decision trees, which attempts to make the results more of a regression problem. More details of this approach can be found in Friedman et al. (2009).
This approach differs to previous approaches which have individually tested proposed relationships and selecting the best 15 performing model(s) as a parameterisation (e.g. Table A1). Here, an algorithm uses the data it is provided to build a model that gives a prediction and therefore it is the data itself that defines the model that is used to predict new values. A key difference of this approach is also that only a subset, the "training" set, is used to build the model and the rest (or "withheld" set) is then used to test the performance of the model. Here we use 80 % of the data for the "training" set and use the remaining 20 % as the "withheld" set (also commonly referred to as the "testing set"). 20 To ensure that the models built are generalisable and mitigate overfitting, the Random Forest approach used here artificially increases the randomness within the forest (Pedregosa et al., 2011). This is done by randomly combining single decision trees by an approach referred to as "bootstrap aggregation" or "bagging" (Breiman, 2001;Tong et al., 2003). This additional "bagging" approach randomly samples observations within the training dataset and so mitigates over-fitting of the trees to the dataset (Friedman et al., 2009). Furthermore, to maintain the statistical distribution between the training and withheld datasets 25 and the dataset as a whole, a stratified sampling approach is used to randomly select data within the quartiles of the dataset.
Machine learning algorithms can generally be tuned to increase performance using settings called hyperparameters. However, Random Forests are known to generally perform well without tuning. The default hyperparameters therefore were used here (Pedregosa et al., 2011), except for increasing the number of trees ("n_estimatators") from 10 to 500. Mean Square Error (MSE) was used as the criterion for evaluating each split (also referred to as a "node"). The maximum number of "features" 30 (the ancillary variables provided to the algorithm, such as temperature or nitrate concentration) considered when looking for the best split is set to the number provided to the algorithm. The number of splits a tree is allowed to make ("max_depth") is not restricted and further nodes are made until leaves contain less than two samples ("min_samples_split") and a minimum of one ("min_samples_leaf"). All forests are built here use bootstrapping.

Error and uncertainty calculations
Understanding the errors and uncertainties in the global iodide distribution is important due to any sensitivities to this value within the modelled Earth system. We consider three sources of error in our predictions: the "dataset selection" error due to the splitting of the dataset into training and withheld parts; the "model selection error" due to the choice of dependent variables; and the "observational error" on the iodide measurements.

5
To quantify the "dataset selection" error, we construct models from 20 pseudo-random splits of the dataset into training and withheld parts. The hyperparameters and input ancillary variables are kept the same for the generation of the 20 models, so that the only difference between the models is the training dataset. These 20 models are then used to predict the withheld data. Performance metrics (Root Mean Square Error (RMSE) and average absolute prediction etc.) can then be calculated for each model. This gives a range of 20 values, which can then be converted to a percentage range as the error. This is done by 10 dividing the maximum within the range over the maximum value to give a maximum value and minimum within the range over the maximum value to give minimum value. Significant differences between model's performance metrics would suggest important sensitivity to the training / withheld dataset splits.
We define the "model selection" error as the uncertainty resulting from the choice of input ancillary variables. A number of combination of input variables are possible in generating the models, and each will generate a different prediction. We 15 quantify this error as the difference in performance against the withheld dataset and prediction value (e.g. average global value). Similarly to our calculation of "dataset selection" error, this can be converted to percentage error by considering the range in these values and dividing them by minimum and maximum values.
For the "observational error" we refer to Chance et al. (2019b), who provide individual error estimates for each of the iodide observations in the data compilation. Over half (51 %) of the data points have an error of 5 % or less, and a further ∼25% have 20 an uncertainty in the range of 5-10 %.We therefore use a value of 10 % as a conservative estimate of the "observational error".

Outlier identification and removal
Our dataset consists of values for ancillary variables and iodide concentration for all of the 1342 measurement locations in the observational dataset (Sect 2). As discussed in Sect 3.1, we split this dataset into two parts: (i) a training set for use in building and optimising models, and (ii) a withheld set to evaluate the models built. Particular care was taken to ensure the withheld and 25 training datasets were representative of the entire dataset in the way the models built, therefore improving performance and "generalisability" to unseen data (See Sect. 3.1).
We take a Random Forest Regressor (RFR) model built with variables that were intuitively assumed to give a reasonable ability to differentiate the observations (using depth, temperature, and salinity as the independant variables -abbreviated to "RFR(DEPTH+TEMP+SAL)" following Table 1). The "RFR(DEPTH+TEMP+SAL)" model was then used to explore the in the predicted iodide for withheld data as summarised in the final column of Table 2 and shown graphically in Appendix Fig.   A1.
We define outliers here as values greater than the 3 rd quartile plus 1.5 times the interquartile range (Frigge et al., 1989).
Removing these forty nine values categorised as outliers (>309.5 nM) leads to a vast improvement in the RMSE error in the ensemble prediction from 95.1 nM to 37.6 nM ( Table 2). This is shown graphically in Appendix Fig A1, with the other subsets 5 of the data explored (Table 2). This demonstrates that the high values are not well enough represented by the dataset to be able to be captured by the RFR approach. The removal of these high values from the dataset can also be justified as the driver for these concentrations is not yet well understood (Chance et al., , 2019cCutter et al., 2018).
Removing these outliers reduces RMSE in the prediction with the twenty independent model builds from 48.2 nM to 2.3 nM (3 rd quartile -1 st quartile). Once these outliers are excluded, more modest changes in average RMSE are then seen if models 10 are built only using coastal or non-coastal data. Fig. A1 also shows this is seen when removing lower salinity data ('Salinity ≥30 PSU & no outliers'), which is indicative of estuarine water. This highlights the strength in this approach's ability to predict iodide in different biogeochemical regions (i.e. not just coastal or non-coastal locations).
An additional removal of a single dataset of nineteen observations from the Skagerrak strait (Truesdale et al., 2003) was made due to it exerting a disproportionate influence on iodide prediction in high Northern latitudes (>=65 • N), an area that is 15 almost entirely unconstrained by local observations. We note that the Skagerrak is relatively unusual oceanographically, being an estuarine location with high ship traffic, and is considered unlikely to be an analogue for iodine speciation in the Arctic. This is decision is discussed further in Appendix A1 and the predictions made including this dataset are also included in the shared output (Sect. 5).
From here, only the 1293 observational points excluding outliers and the data from the Skagerrak strait (Truesdale et al.,20 2003) are used.

Selection of ancillary variables and building an ensemble prediction
To decide which ancillary variables (temperature, salinity, etc, -see Table 1 and Sect. 2) should be used to predict sea-surface iodide concentration, RFR models were built and evaluated with different combinations of variables. Thirty eight combinations were considered (see 1 st column of Appendix Table A2) . 25 The top twenty performing models, based on their Root Mean Square Error (RMSE) against the withheld data, are plotted in Fig. 2, alongside existing parameterisations. The standard deviation for all predicted values is also shown to illustrate variation in the predictions. A complete list of the performance and of all models built here and their performance is given in the appendix (Table. A2).
The RMSE values in Fig. 2 shows the increased skill present in the new predictions compared to the existing parameter- parameterisations, respectively, to 33.2-37.4 nM for the top ten models created here. Only modest gains are seen in RMSE between models with three variables or more. The best-performing model in the list is only marginally better than the 10 th best performing, therefore there is not an obvious "best" performing set of ancillary variables. Thus going forwards we use an ensemble prediction approach based on the mean value from an ensemble of the 10 top-performing models.

Model Descriptions
Unlike many machine learning approaches, the Random Forest Regressor algorithm is interpretable. The decision trees can be 5 visualised to explain the main features driving the splits. Figure 3 shows schematically the whole regression approach taken here. Panel (a) shows single trees, of which 500 are built with the same input variables and then combined into forest (b). Then this forest is combined with the nine other top-performing models (made from different combinations of ancillary variables) to make an ensemble (c). The ten predictions of (c) are then arithmetically averaged into a single prediction, which thus includes the predictions of 5000 trees with 10 different combinations of input variables. In Fig 3a, the colour of a limb or "branch" 10 following a node is given by the variable driving that split within the training dataset. For Fig. 3b and 3c it shows the percent of that times that a variable drives that node within the forest. The value of the ancillary variable that sets the split is shown inside the circle (a,b,c). The thickness of the branch scales to the throughput of training dataset samples contained within that split. The trees are shown to a depth of five nodes for aesthetic reasons and due to increased divergence of the trees within a forest the deeper you go. However the trees themselves are unlimited in the depth they can reach. 15 The first and larger splits in the data at decision "nodes" in the models can be simply read, which can provide understanding of the main variables driving the initial and largest splits in the prediction. For all models in the ensemble, the initial split is driven by temperature, with a split occurring at around 21.1 • C (with a standard deviation of 1.2 • C). The data is then split by two further nodes from this, a left and right hand split (e.g. Fig. 3b). If depth or temperature is present as a variable, then they drive the majority of the next splits. If depth is not present as a variable, then either nitrate or mixed layer depth (MLD) is the 20 most common variable to dictate the split in the data at the next node in the tree. Thus a qualitative way of interpreting the initial splits of the dataset would be to say that the model is primarily differentiating between warmer and shallower locations.

Results
Here we evaluate the performance of the ensemble prediction against the observational dataset (Sect. 4.1) and then we explore the predicted global monthly surface concentrations (Sect. 4.2).  Table 3.  The "dataset selection" error, which shows the the influence of the choice of how the dataset is split into training and with data on model prediction, is described in Section 3.2. Within the 20 member ensemble of different testing/withdrawn choices, 15 the average variation in RMSE was 8.4 nM (5.9-11.02 nM) and in the range of average predicted values was 6.1 nM (5.4-6.6 nM). This translates to a percentage error of 16.1-29.5 % on the RMSE and 5.6-7.0 % on the average predicted value.
The "model selection" error, which is the influence of the different independent variables used, is described in Section 3.2.
The difference in the average prediction of the 10 members of the ensemble is 1.8 nM (with a range of average prediction from 96.0 to 97.8 nM) and the range of the difference in model performance is 3.9 nM (33.2-37.2 nM). As a percentage this "model 20 selection" translates to a percentage uncertainty on the the RMSE of 10.6-11.9 % and on the average of 1.8-1.9 %.
The "dataset selection" and "model selection" compares to an error on the observations of ∼10 %. Uncertainty from "dataset selection" has a far greater effect on the prediction error than "model selection". This is can be expected due to the small dataset size. The combined error in the prediction ("dataset selection"+"model selection" error) is either comparable to (7.4-8.9 % in terms of average prediction) or greater (27-41 % in terms of RMSE) than the observational error. 25 From this analysis we have shown that the new ensemble RFR model performs significantly better than those currently in the literature. We now turn to explore the predicted global distribution of sea-surface iodide using our ensemble model.

Global sea-surface iodide distribution
From the ensemble prediction system we calculate monthly global grids (0.125 • ×0.125 • , ∼12.5 km) of sea-surface iodide using the gridded ancillary data (Sect. 2). The annual average spatial predictions are shown in Fig. 6 with the observations South China sea, a region without any observations (Fig. 6). Some features are visible in the concentration field appear to be associated with deep bathymetric features (e.g. the higher concentrations over the mid Atlantic ridge - Fig. 6)), even though a physical explanation for such a link seems unlikely.
Summary statistics on the global predictions are shown in Table 4. These show that, as for comparisons at the observed locations (Section 4.1), the ensemble prediction is broadly in between the two existing parameters. The new ensemble model The "model selection" error due to variability within ensembles' 10 members, generated with different independent variables, gives a global average surface concentration between 102.3 to 108.8 nM. This range in prediction gives a "model selection" error of 6.45 nM, which equates to 6.0-6.3 %. Like with the global uncertainty from "dataset selection", the global value would be expected to be lower than the uncertainty at the specific locations of the observations (Sect. 4.1) due to the 25 more homogeneous nature of the predicted areas. However, a greater variation is seen from different model predictions than within predictions for the observation locations. This highlights the importance of the different ancillary variables considered here and also therefore the strength gained from the ensemble approach taken here.
Within members of the ensemble, variation is modest except for two ensemble members which divergence north of >=65 • N (Appendix Fig A2). As noted earlier (Sect 2), the values in this region are very poorly constrained by the observational dataset 30 (Fig 6).
In addition to the three errors we described above, we also attempt to gain to an understanding of the spatial uncertainty in the ensemble prediction. We do this via calculating the differences in the predicted spatial fields from the 10 ensemble members.

Data availability
The monthly ensemble mean and standard deviation between ensemble members for the main prediction presented here  Table A5) using the open-source Python xESMF package (Zhuang, 2018). We recommended use of the standard output provided, but have also provided the predictions made by the model with the Skagerrak dataset (Truesdale 10 et al., 2003) included (which was excluded from the analysis presented here, as discussed further in Appendix Sect A1).
Ancillary data extracted for observation locations and used to predict spatial fields is available from sources stated in Table   1. Iodide observations are described by Chance et al. (2019b) and made available by the British Oceanographic Data Centre (BODC, Chance et al. (2019a); DOI:https://doi.org/10/czhx).
6 Discussions and conclusions 15 Here we have explored the ability of an algorithmic approach combined with various physical and chemical variables to predict sea-surface iodide, without aiming to represent the biogeochemical or abiotic processes occurring. This approach instead gives a data-driven "best guess" at concentrations and an ability to quantify where the greatest uncertainty lies. However, certain features such as prediction of an apparent relationship between ocean bathymetry and sea-surface iodide concentrations, where the ocean is very deep (e.g. over the Mid-Atlantic Ridge) are unlikely to have a plausible physical explanation (Fig 6). 20 The new spatial prediction presented here differs from what has been used previously in atmospheric models (e.g.  Table A1) to calculate ocean iodine emissions, a higher emission would therefore now be expected. This would result in larger decreases in tropospheric ozone burden than 25 previously suggested (Sherwen et al., 2016a). A higher iodide sea-surface concentration would also result in a greater calculated ozone deposition (Luhar et al., 2017;Sarwar et al., 2016).
We have calculated the errors in sea-surface iodide concentrations at observational locations due to the "dataset selection" of 16.1-29.5 %, and due to "model selection" of 1.8-1.9 % (Sect 4.1 and 4.2). These error estimates can be compared to an approximated error in the observations of ∼10 % (Chance et al., 2019b). Considering the average predicted concentration 30 globally here is 106 nM (Sect. 4.2), these errors are notable. The greatest driver in error is the "dataset selection". More observations, and particularly observations representative of under-sampled areas and seasons, will be required to reduce this error. The error caused by "dataset selection" is also reduced when the predictions are considered spatially over the global sea-surface.
The choice of the algorithm used here is subjective and numerous other options are available. The Random Forest Regressor was chosen due to appropriateness for the continuous regression task performed here, its relatively cheap computation cost, 5 and its interpretability. Considering the greatest uncertainty is driven by the paucity and sparsity of observations, using more complex techniques would not be expected to yield particularly different or drastically better results, considering other tradeoffs.
We have developed a new way to build a spatially and temporally resolved dataset from a spatially and temporally sparse input of observations. This has allowed for use of use more of the observations, than traditional approaches, which is partic- . Spatial re-gridding used the xESMF package (Zhuang, 2018). Plots presented here were created using the Matplotlib (Hunter, 2007) and Seaborn (Waskom et al., 2017)  Ideally with sparse datasets, as much data as possible would be included for training the regression models used. If a feature in the data is different enough to the rest of the dataset and not sufficiently be represented for the regressor model to characterise it, then it has potential to introduce a large "dataset error" (See Sect. 3.2 for details). This was shown when the iodide values 5 above the outlier threshold were included (Sect. 3.2). There could be many other affects of including data that is significantly different to the rest of the dataset.
The data from the Skagerrak strait (Truesdale et al., 2003), which is included in the (Chance et al., 2019b) compilation of iodide data, was excluded from this analysis. This is because upon inclusion, high iodide at high latitudes (>= 65 • N) are calcuated (Appendix Fig. A6). An increasing trend is seen with latitude, reaching values comparable to the highest predicted The Skagerrak strait data (Truesdale et al., 2003) is also from region where the observed ancillary variables compare poorly with those extracted from ancillary datasets. Observed salinity is between 24.0 and 33.5 PSU, whereas the climatological value 15 is 31.7 to 35.8 PSU. This equates to a bias of the climatology versus the in-situ observations of up to 9.6 PSU or 40 %. The Skagerrak is biogeochemically different from the Arctic, and its large influence on predicted values in the Arctic may arise simply from its latitudinal proximity, given the lack of observations from the regions itself.
The area this dataset is sampling in is also unusual in the Chance et al. (2019b) compilation due to its estuarine nature.
However, this cannot entirely explain its behaviour as their are other estuarine datasets included (such as those from around the 20 Chesapeake Bay (Luther and Cole, 1988;Wong andCheng, 1998, 2008)) which do not cause the same issue.
As the feature of high predicted Arctic iodide is driven by a single dataset of 19 samples (of which 4 would be removed as outliers) from a different region, it is highly uncertain. Not only do the in-situ salinity observations compare poorly to the extracted ancillary ones, but the location itself represents a heterogeneity within the Chance et al. (2019b) compilation as it has relatively high observed iodide concentrations. It was therefore omitted from the analysis presented within this paper. However          Figure A1. Combined kernel density and boxplots ("violin plots") showing the distribution of Root Mean Square Error (RMSE) for 20 different models built from 20 different pseudo-random initialisations for different selection of the dataset as described in Table 2 and Sect 3.2. Models built using the whole dataset ("all"), including outliers, show a significantly higher RMSE due to observations with higher iodide concentrations. The model used here includes ancillary variables of temperature, depth and salinity which were thought to intuitively give a reasonable result.