The Ocean Carbon States Database: a proof-of-concept application of cluster analysis in the ocean carbon cycle

Latto, Rebecca; Romanou, Anastasia

doi:https://doi.org/10.5194/essd-10-609-2018

Articles | Volume 10, issue 1

https://doi.org/10.5194/essd-10-609-2018

Articles | Volume 10, issue 1

27 Mar 2018

| 27 Mar 2018

The Ocean Carbon States Database: a proof-of-concept application of cluster analysis in the ocean carbon cycle

Rebecca Latto and Anastasia Romanou

Abstract

In this paper, we present a database of the basic regimes of the carbon cycle in the ocean, the “ocean carbon states”, as obtained using a data mining/pattern recognition technique in observation-based as well as model data. The goal of this study is to establish a new data analysis methodology, test it and assess its utility in providing more insights into the regional and temporal variability of the marine carbon cycle. This is important as advanced data mining techniques are becoming widely used in climate and Earth sciences and in particular in studies of the global carbon cycle, where the interaction of physical and biogeochemical drivers confounds our ability to accurately describe, understand, and predict CO₂ concentrations and their changes in the major planetary carbon reservoirs. In this proof-of-concept study, we focus on using well-understood data that are based on observations, as well as model results from the NASA Goddard Institute for Space Studies (GISS) climate model. Our analysis shows that ocean carbon states are associated with the subtropical–subpolar gyre during the colder months of the year and the tropics during the warmer season in the North Atlantic basin. Conversely, in the Southern Ocean, the ocean carbon states can be associated with the subtropical and Antarctic convergence zones in the warmer season and the coastal Antarctic divergence zone in the colder season. With respect to model evaluation, we find that the GISS model reproduces the cold and warm season regimes more skillfully in the North Atlantic than in the Southern Ocean and matches the observed seasonality better than the spatial distribution of the regimes. Finally, the ocean carbon states provide useful information in the model error attribution. Model air–sea CO₂ flux biases in the North Atlantic stem from wind speed and salinity biases in the subpolar region and nutrient and wind speed biases in the subtropics and tropics. Nutrient biases are shown to be most important in the Southern Ocean flux bias. All data and analysis scripts are available at https://data.giss.nasa.gov/oceans/carbonstates/ (DOI: https://doi.org/10.5281/zenodo.996891).

Download & links

Article (PDF, 5037 KB)

Supplement (3503 KB)

Download & links

How to cite.

Received: 29 Sep 2017 – Discussion started: 17 Oct 2017 – Revised: 06 Mar 2018 – Accepted: 06 Mar 2018 – Published: 27 Mar 2018

1 Introduction

The ocean carbon cycle plays an important role in controlling the airborne fraction of CO₂ in the atmosphere, thereby regulating the rate of global warming, i.e., the rising temperatures in the Earth's troposphere. However, the ocean carbon cycle is controlled by a plethora of physical, biological and biogeochemical processes over a broad range of temporal and spatial scales. In this paper, we seek to present and assess a data mining/pattern recognition technique, namely cluster analysis, for the purpose of defining the basic regimes, or “ocean carbon states”, that describe the oceanic carbon cycle variability. The goal is to increase our understanding of the marine carbon cycle by revealing patterns and information that other techniques do not provide.

For geophysical applications, climate datasets have inherent complexities that are not easily identifiable in the age of “big” data. Cluster analysis is a highly effective uni- or multi-variate classification method for large, high frequency datasets because it can find structure in a body of complex, geophysical data (Anderberg, 1973; Peron et al., 2014). Clustering seeks to identify the critical modes and natural patterns of a dataset without any training or predetermined spatial–temporal guidelines; therefore, it is an “unsupervised” graph theory method. The merit of a novel, unsupervised method such as clustering is that it can recognize connectivity between multiple variables. This can be understood as connectivity in a temporal sense where cluster analysis can identify joint interannual or seasonal patterns and in a spatial sense where clustering has the power to identify patterns that relate different regions or basins (Jain, 2010; Phillips et al., 2015).

Traditional methods of univariate analysis, such as principal component analysis or spectral decomposition, cannot fully describe important physical states of the climate system or adequately detect change (Hoffman et al., 2011) because these methods neglect interactions between state variables as well as spatial and temporal co-variability. In contrast, cluster analysis has been successfully applied to various dynamical systems in order to extract the organized states and detect change as well as in novel applications of model–data intercomparison (Hoffman et al, 2008). For example, this technique has been used to define atmospheric weather states by identifying cloud regimes (Jakob and Tselioudis, 2003; Rossow et al., 2005; Williams and Webb, 2009; Tselioudis et al., 2013; Bodas-Salcedo et al., 2014; Oreopoulos et al., 2016). Bankert and Solbrig (2015) were able to extract a 3-D cloud representation using cluster analysis. This technique has also been used to characterize water types in lakes (Trochta et al., 2015), hydraulic habitat composition in rivers (Hugue et al., 2016), phenology patterns in forests (Trans Mills et al., 2011), solar variability (Zagouras et al., 2013), ENSO phenomena (Radebach et al., 2013), and regions with characteristic hydrological responses (Halverson and Fleming, 2015), among many other applications.

Beyond identifying regimes, cluster analysis can be useful in model assessment applications, like that of Wood et al. (2015), which used weather states derived from cluster analysis for process studies, satellite calibration, and model evaluation. Both regime identification and model evaluation are the focus of the cluster analysis presented in this paper as well.

Elsewhere in ocean carbon cycle science, clustering-type methods (self-organizing maps and neural networks) have been used to build reconstructions or as regression analysis alternatives for surface ocean pCO₂ (Lefèvre et al., 2005; Telszewski et al., 2009; Sasse et al., 2013; Landschützer et al., 2013, 2014; Nakaoka et al., 2013). Unlike these studies, here we seek to obtain the co-variability maps and conditions of different ocean-related variables and understand where, why and how they change.

Other non-statistical studies, but similar in concept to multivariate regime identification, have focused more on larger-scale geographic variations (Fay and McKinley, 2014; Trochta et al., 2015) than on the regional aspects of the ocean biogeochemistry and its interaction with physical circulation like in the western boundary current regions, in the upwelling zones on the eastern boundaries, and in the eddying field.

The structure of the paper is as follows. Section 2 describes the datasets used in this study, both the observation-based sources as well as the model experiments. Section 3 presents the k-means cluster analysis methodology and application, including discussion of the k-means clustering technique and sensitivity to number of clusters chosen, to binning, and to data normalization. The results of the methodology are provided in Sect. 4. Section 4.1 focuses on how the methodology is applied in observations from the North Atlantic basin. The observed ocean carbon states are then characterized temporally and spatially in order to reveal their physical meaning. Next, the model carbon states are computed and characterized in a similar way to the observations. Using the ocean carbon states, model biases are also discussed and evaluated. Section 4.2 repeats the analysis presented in Sect. 4.1, but now applied to the Southern Ocean. Finally, general discussion and conclusions are provided in Sect. 5. A note about the figures in the paper: some interesting but non-critical figures are offered in the Supplement and are denoted as Fig. S#. All data and analysis scripts are available at the https://data.giss.nasa.gov/oceans/carbonstates/ website (DOI: https://doi.org/10.5281/zenodo.996891).

2 Data

2.1 Choice of variables to represent ocean carbon regimes

One critical question to answer at the onset of any clustering analysis is what key geophysical variables should be used to base the analysis on. For the purposes of this study, we picked sea surface temperature (SST) and partial pressure of CO₂ in the ocean surface water (pCO_2 sw). The rationale for this choice will be explained now. There are two main pathways that determine the ability of the ocean to take up CO₂ (Sarmiento and Gruber, 2006): the chemical disequilibrium, expressed by pCO₂, dissolved inorganic carbon (DIC is the sum of all inorganic carbon species) and nutrients, and the physical processes, such as air–sea interaction (expressed by the wind speed) and ocean circulation (expressed by sea surface temperature and salinity). Greater insight into the ocean's biogeochemical processes that control these pathways can inform the improved use of field measurements, the development of better metrics for model evaluation, and the selection of more suitable parameterizations in climate models in order to provide more accurate predictions. We select pCO_2 sw and SST because they are able to represent a broad range of biogeochemical and physical processes. We use them in cluster analysis to find temporal and spatial patterns in their joint parameter space that can be used to understand CO₂ flux distributions and its fluctuations. Other variable pairs can be alternatively used here; a comparison between choices is set aside for future work.

This study will focus on two oceanic basins, namely the North Atlantic (defined as 80^∘ W to 45^∘ E, 0 to 90^∘ N) and the Southern Ocean (defined as 180^∘ W to 180^∘ E, 90 to 40^∘ S), because of their importance in the global carbon cycle (Takahashi et al., 2009).

2.2 Observation-based data

2.2.1 Air–sea flux of CO₂ and pCO₂, surface wind speed, sea surface temperature and salinity

The 12-month climatology of the air–sea flux is obtained from the Carbon Dioxide Information Analysis Center (LDEO database, NDP-088; Takahashi et al., 2009). It is derived from the difference between surface water pCO₂ (pCO_2 sw), air pCO₂, and the air–sea gas transfer rate. Surface water pCO₂ climatological mean distribution was obtained from 3 million measurements from 1970 to 2007, and normalized to a reference year 2000. The pCO₂ of the air is computed from the GlobalView CO₂ concentration zonal mean, NCAR monthly mean barometric pressure, SST, and salinity. Other variables in the dataset pertinent to this analysis are wind speed (derived from the 1979–2005 climatological mean NCEP-DOE AMIP-II Reanalysis wind speed field), climatological sea surface temperature (from NOAA Climate Diagnostic Center Objective Interpolation), and salinity (from the NODC World Ocean Database 1998). All variables are available as a 12-month climatology at a $4^{\circ} \times 5^{\circ}$ resolution.

2.2.2 Nitrate

The nitrate monthly climatology at 1^∘ horizontal resolution is obtained from the World Ocean Atlas 2013 version 2 (Boyer et al., 2013). It is collected from in situ measurements at standard depth levels and is available as annual, seasonal, and monthly climatologies. Nitrate is an essential nutrient that limits the growth of phytoplankton, which is responsible for fixating carbon dioxide from the atmosphere. Therefore, pCO₂ levels in the surface ocean depend partially on the abundance of nitrate.

2.3 Numerical simulations

The NASA-GISS modelE2.1 output used for this analysis comes from five ensemble coupled model simulations of the 20th century with realistic greenhouse gas, aerosol, land use and solar forcing, as used in CMIP5 experiments. The model physics is somewhat different than the modelE2 used in the CMIP5 experiments, mostly due to improved representation of the ocean mesoscale mixing. The physical ocean and the biogeochemistry modules are described in detail in Romanou et al. (2013, 2017). Briefly, here we note that the ocean model is a non-Boussinesq mass-conserving ocean model with 32 vertical levels and $1^{\circ} \times 1.25^{\circ}$ horizontal resolution. The vertical coordinate is a stretched z-level coordinate and has a free surface and natural surface boundary fluxes of freshwater and heat that are obtained by the atmospheric model. In addition to advection and turbulent mixing, it also includes a scheme for isopycnal eddy fluxes and isopycnal thickness diffusion. The interactive ocean carbon cycle model consists of a biogeochemical model (NASA Ocean Biogeochemistry Model, NOBM; Gregg and Casey, 2007) and a gas exchange parameterization for the computation of the CO₂ flux between the ocean and the atmosphere (Romanou et al., 2013). Specifically, the air–sea exchange of CO₂ (Sarmiento and Gruber, 2006; Takahashi et al., 2009) is described by Eq. (1):

\begin{matrix} (1) & F = kw K_{0} (p {CO}_{2 atm} - p {CO}_{2 sw}), \end{matrix}

where kw is the piston velocity for CO₂ (in m s⁻¹) that depends on the wind speed, K₀ is the solubility coefficient – dependent on sea surface temperature (SST) and sea surface salinity (SSS) (expressed in mole, CO₂ kg⁻¹ atm⁻¹) – and pCO₂ is the partial pressure of CO₂ (Wanninkhof et al., 2013) in the atmosphere (atm) and the surface ocean (sw). Equation (1) describes the chemical disequilibrium of CO₂ in the oceanic and atmospheric reservoirs due to the solubility and biological pumps. As discussed in Sarmiento and Gruber (2006), the pCO_2 sw in Eq. (1) is a function of temperature and salinity, wind speed, DIC, nutrients, and alkalinity (a measure of the excess of bases over acids) which can be expressed as follows:

\begin{matrix} (2) & p {CO}_{2 sw} = f (SST,  SSS,  DIC,  windspeed,  nutrients,  alkalinity) . \end{matrix}

NOBM utilizes ocean temperature and salinity, mixed layer depth and the ocean circulation fields, and the horizontal advection and vertical mixing schemes obtained from the host ocean model as well as shortwave radiation (direct and diffuse) and surface wind speed obtained from the atmospheric model to produce horizontal and vertical distributions of several biogeochemical constituents. The carbon submodel parameterizes the cycling of carbon through the phytoplankton, herbivore and detrital components, affecting the dissolved inorganic and organic carbon in the ocean and interacting with the atmosphere. Alkalinity is assumed analogous to surface salinity, which is an acceptable approximation for the sea surface but does not take into account changes in the carbonate pump. Temperature and salinity are affected only by physical processes such as circulation, advection, eddy mixing and stirring, and local upwelling/downwelling, while DIC distributions are influenced by all these physical processes and also several biogeochemical processes such as air–sea gas exchange, production by organisms, biological export to depth and remineralization there and nutrient availability in the water column. Atmospheric pCO₂ (pCO_2 atm) is the saturation concentration of CO₂ in equilibrium with a water-vapor-saturated atmosphere at a total atmospheric pressure P and a given atmospheric pCO₂ level:

\begin{matrix} (3) & p {CO}_{2 atm} = \frac{P}{P^{0}} {CO}_{2}^{0}, \end{matrix}

where P₀=1 atm and [CO₂]⁰ is the saturation concentration at 1 atm total pressure.

The gas transfer velocity is given by

\begin{matrix} (4) & kw = c {(\frac{Sc}{660})}^{- 1 / 2} u^{2}, \end{matrix}

where u is the surface wind speed and c is the piston velocity coefficient taken here equal to $0.337 / (3.6 \times 10^{5})$ . The value of c has been agreed upon by the Ocean Carbon Model Intercomparison Project, phase II (OCMIP-II) so that the global, annual mean gas transfer coefficient for carbon dioxide (kw, K₀) is equal to 0.061 mol m⁻² yr⁻¹ (µatm)⁻¹ for preindustrial times. Sc, the Schmidt number, is computed using the temperature of the host ocean model following Wanninkhof (1992). The gas transfer velocity kw is computed only over open water. The solubility of CO₂ in the water K₀ is also parameterized based on OCMIP using prognostic temperature, salinity and sea level pressure. In these model runs, the global average of the atmospheric concentration of CO₂ follows the Mauna Loa measurements (Dlugokencky and Tans, 2014), although regionally atmospheric CO₂ is allowed to vary due to the distributions of the ocean sources and sinks.

The five ensemble member runs were averaged into one ensemble mean to account for the intrinsic climate variability that is not adequately resolved in climate models of low spatial resolution. The model output for the years 1995–2005 was then averaged again to produce a 12-month climatology for the purpose of direct comparison with the observationally based data in the Takahashi database.

The model output and the observational data were interpolated onto the same grid, which is the Takahashi ocean grid at $4^{\circ} \times 5^{\circ}$ resolution, with no Arctic Ocean, and the ocean mask was conformed across all observational and model datasets.

In the rest of the paper, some conventions with regards to nomenclature should be noted. Firstly, the Takahashi carbon flux, pCO₂ and ancillary data as well as the nitrate climatology will be referred to as “observations”, for brevity, keeping in mind that they are really observation-based estimates and not direct observations. Secondly, “model” will exclusively refer to the numerical simulations using the NASA-GISS climate model, and by “algorithm”, “method” or “technique” we will refer to the clustering technique.

All data products are available in the Ocean Carbon States Database (https://data.giss.nasa.gov/oceans/carbonstates/).

https://www.earth-syst-sci-data.net/10/609/2018/essd-10-609-2018-f01

Figure 1Schematic diagram of the clustering methodology used in this paper: (a) 12-monthly mean climatological year data of two variables, pCO₂ and SST, (b) monthly 2-D histograms, (c) clustering of the 2-D histograms into groups by similarity in the bivariate distributions, and (d) clusters resulting when k=3 is assumed.

The Ocean Carbon States Database: a proof-of-concept application of cluster analysis in the ocean carbon cycle

2.1 Choice of variables to represent ocean carbon regimes

2.2 Observation-based data

2.2.1 Air–sea flux of CO2 and pCO2, surface wind speed, sea surface temperature and salinity

2.2.2 Nitrate

2.3 Numerical simulations

3.1 pCO2-SST 2-D histograms

3.2 k-means clustering

3.3 Sensitivity to predefined number of clusters

3.4 Data normalization

4.1 The North Atlantic Ocean carbon states

4.2 Temporal attribution for the North Atlantic carbon states

4.3 Spatial attribution of the North Atlantic carbon states

4.4 The NASA-GISS climate model North Atlantic carbon states

4.5 Model North Atlantic air–sea flux of CO2 error analysis and bias attribution

4.6 The Southern Ocean carbon states

4.6.1 Temporal attribution for the Southern Ocean carbon states

4.6.2 Spatial attribution of the Southern Ocean carbon states

4.6.3 Model Southern Ocean air–sea flux of CO2 error analysis and bias attribution

2.2.1 Air–sea flux of CO₂ and pCO₂, surface wind speed, sea surface temperature and salinity

3.1 pCO₂-SST 2-D histograms

4.5 Model North Atlantic air–sea flux of CO₂ error analysis and bias attribution

4.6.3 Model Southern Ocean air–sea flux of CO₂ error analysis and bias attribution