Description of a global marine particulate organic carbon-13 isotope data set

Abstract. Marine particulate organic carbon stable isotope ratios (δ13CPOC) provide insights into understanding carbon cycling through the atmosphere, ocean and biosphere. They have for example been used to trace the input of anthropogenic carbon in the marine ecosystem due to the distinct isotopically light signature of anthropogenic emissions. However, δ13CPOC is also significantly altered during photosynthesis by phytoplankton, which complicates its interpretation. For such purposes, robust spatio-temporal coverage of δ13CPOC observations is essential. We collected all such available data sets and merged and homogenized them to provide the largest available marine δ13CPOC data set (https://doi.org/10.1594/PANGAEA.929931; Verwega et al., 2021). The data set consists of 4732 data points covering all major ocean basins beginning in the 1960s. We describe the compiled raw data, compare different observational methods, and provide key insights in the temporal and spatial distribution that is consistent with previously observed large-scale patterns. The main different sample collection methods (bottle, intake, net, trap) are generally consistent with each other when comparing within regions. An analysis of 1990s median δ13CPOC values in a meridional section across the best-covered Atlantic Ocean shows relatively high values (≥-22 ‰) in the low latitudes (<30∘) trending towards lower values in the Arctic Ocean (∼-24 ‰) and Southern Ocean (≤-28 ‰). The temporal trend since the 1960s shows a decrease in the median δ13CPOC by more than 3 ‰ in all basins except for the Southern Ocean, which shows a weaker trend but contains relatively poor multi-decadal coverage.


their source values.
Averaging was only applied for depth ranges, these were included as their arithmetic mean. Sample timeframes were only included when lying completely within one month and year. Sample depth given as "surface" was denoted as 1 m.
Wherever multiple types of δ 13 C P OC e.g. similar measurements based on different methods, were given within one source, we chose only one type. In Westerhausen and Sarnthein (2003), we chose the "mass spectrometer" data set because this was 130 the originally measured one. In Trull and Armand (2013a) and in Trull and Armand (2013b), we used the "blanc corrections" data set of δ 13 C, since this set of δ 13 C org values is recommended to be considered (Trull and Armand, 2001).
The primary source of the Tuerena and Lorrain data is mentioned in our data set in the "Project/cruise" column. In the data set from Tuerena et al. (2019), this was originally labeled as "source", in the Lorrain data set as "campaign". In both data sets We disregarded trap duration given in Voss and von Bodungen (2003), which was given as the negative value −1.

Content and structure of the data set
The data collection is made available in files of raw and interpolated values respectively. (Verwega et al., 2021). The raw data 140 is a csv file that includes the anomalies of the δ 13 C P OC with respect to their mean and all available meta information. The interpolated data is provided as NetCDF files on two different global grids: a 1.8 • × 3.6 • -resolution grid from a model that simulates δ 13 C P OC and the 1 • × 1 • -resolution grid of the World Ocean Atlas.

Raw data file
The csv-format data file includes δ 13 C P OC anomalies and meta information in its columns. A full description of the content, 145 value range and coverage of the individual columns is given in Table 3. Anomalies of δ 13 C P OC were calculated, based on the arithmetic mean of the full data collection. The mean was calculated and used as mean δ 13 C P OC = −23.955615278114315 ‰ Anomalies contain all relevant information with respect to variability of the δ 13 C P OC data in space and time. This way it becomes easier to analyze bias information separately, e.g. during first steps of model calibration.  The reference includes the citations as detailed as possible. Wherever available, this is taken from the original source.
Otherwise, we tried to include author, title, publication year and platform and doi. For unpublished data like Harrison's from the Goericke's data set or those included by the coauthors, we denoted, from where we took the data. Anomalies of δ 13 C P OC are given in the δ-ratio described in Equation 1. A sample method was added, wherever available.
Any special sampling circumstances are given in the "Note" column. Activity duration of sediment traps is denoted in the last column.
The "Origin" columns lists the associated project or cruise or author's note. Some samples were given with multiple project 160 connections, all of them are given in this column.

Interpolated data sets
The interpolated δ 13 C P OC data are available as Network Common Data Form (NetCDF) files on two global grids with different resolutions. NetCDF files are machine-independent and support creation, accessing and sharing of array-oriented scientific data.  On the coarser grid, we provide seven different files, where six of them each contain data of an individual decade (from the 165 1960s through the 2010s). The seventh file comprises a combined set of all interpolated δ 13 C P OC data. On the finer grid, we provide one file including all δ 13 C P OC measurements with complete spatial-temporal information.
For the coarser interpolation, we chose the grid of the version 2.9 UVic model, as used e.g. in Schmittner and Somes (2016), functions are based on a work by Kessler and McCreary (1992) and can be summarized as follows: Let (x 1 , y 1 ) , ..., (x n , y n ) ⊆ R 2 be an equidistant grid and (x 1 ,ỹ 1 ) , ..., (x m ,ỹ m ) ⊆ R 2 be irregular measurement locations of a real tracer D j , j ∈ {1, ..., m}. where is the Gaussian weight function and X, Y ∈ R are scaling arguments and C ∈ R the cut-off parameter. We set to X = 1.8, Y = 0.9 and C = 1 in the our script.

185
Since the interpolation into the finer grid excluded all data without full spatio-temporal metadata coverage, we focus following descriptions of interpolated data on the coarse grid interpolations.

Main dataset characteristics
The final data set includes 4732 individual δ 13 C P OC measurements of seawater samples. We show the distribution of δ 13 C P OC values by Gaussian kernel density estimation (KDE). KDEs are a non-parametric density estimation (Silverman, 1986) for 190 approximation of probability density functions, which is theoretically similar to a histogram but with a continuous curve not dependent on rigid intervals. We applied a Python implementation from the SciPy stats-package (Virtanen et al., 2020) to create the results presented here. Likewise, we derived conditional probability densities of δ 13 C P OC values, given the different measurement method applied.

Range and outlier values 195
The data distribution is presented by its KDE in Figure 1. The interval of δ 13 C P OC values ranges over [−55.15, −4.5] with a mostly smooth distribution. Most of our data exhibit values around δ 13 C P OC ≈ −24 ‰, which becomes clearly identifiable as a single maximum in the KDE. Two smaller modes are visible at around δ 13 C P OC ≈ −27.5 ‰ and δ 13 C P OC ≈ −22 ‰. A steep decline to zero follows after the two outer modes. The steep decline of the KDE stops at around δ 13 C P OC = −37 ‰ and δ 13 C P OC ≈ −14 ‰. Between δ 13 C P OC ≈ −37 ‰ and δ 13 C P OC ≈ −55.15 ‰ as well as between δ 13 C P OC ≈ −14 ‰ and  Below δ 13 C P OC = −37 ‰ we find 17 data points ranging down to δ 13 C P OC = −55.15 ‰. Down to δ 13 C P OC = −48‰ these were all taken from Lein and Ivanov (2009)  Since more than 98 % of the data (4668 of the 4732 data points) have values that lie between δ 13 C P OC = −35 ‰ and δ 13 C P OC = −15 ‰, we will focus on this range in our following analyses.
We tested the robustness of our KDE approach in a subsampling experiment. We considered 500 random subsets of 20 % of the original data over the range with the highest data density [−35, −15] and visualized their KDEs in Figure 2. They show peaks at δ 13 C P OC ≈ −23 ‰ fitting the maximum and the second smaller mode right from it, and at δ 13 C P OC ≈ −27.5 ‰.  shows the mean and the variance of the full ensemble of densities by a graph and the shaded area around it, respectively.

Sampling methods
Various sampling methods were involved in obtaining the δ 13 C P OC data. We identified eighteen different sampling methods that could be attributed to 67 % of the data as meta information. In principle, all eighteen methods can be grouped into five classified to any of these groups and was assigned to a cluster that we refer to as "diverse". All sample devices provide data over all sample depths. Deeper samples were mainly taken from traps and pump systems, the upper from bottle and net data. Most data sampled deeper than 2600 m was collected by sediment traps. At 3800 m there were several trap contributions by Calvert (e.g. Calvert, 2002), mostly from the late 1980s. Data sampled by a deep-sea manned submersible is given at locations down to 2520 m (Lein and Ivanov (2009)).
We resolved differences between sampling methods in the Atlantic Ocean by comparing the KDE of all δ 13 C P OC data with 230 conditional probability densities of the same data distinguished by the four major methods in Figure 3. The Atlantic Ocean covers in this context the area between 45 • S and 80 • N and 70 • W and 20 • E. Overall, after accounting for spatial sampling bias by comparing with regions, the different methods are generally consistent with each other (Figure 3).
In the full Atlantic Ocean, densities of intake and net data are most representative of the maximum full δ 13 C P OC sample.
From the intake data shown here, ≈ 80 % were sampled within 30 • S and 30 • N. When restricting to this area, net data 235 resembles the full data better. But other than the intake data, of the net samples were ≈ 80 % collected between 30 • N and 60 • N, where it also fits the overall δ 13 C P OC density best, followed by trap data. Trap and bottle data deliver lowest δ 13 C P OC measurements in the Atlantic Ocean. Both data kinds were with ≈ 74 to 85 % sampled north from 60 • N. A restriction to this area shows trap and bottle samples being close aligned to the full data in this region.
The variance of the intake and trap data is with ≈ 3 ‰ a bit lower than the variance of all δ 13 C P OC together, which is 240 with ≈ 5 ‰ the highest here presented. Bottle and net data both show a variance less than 2 ‰. Furthermore, trap, net and full δ 13 C P OC show a clearly pronounced second second mode in their densities, while bottle and net data show a mostly clear individual maximum.

Spatial distribution
We show how the measurements are distributed over the ocean depths and surface. Most δ 13 C P OC data has been measured in 245 the uppermost few ocean meters and best surface coverage is available for the Atlantic Ocean. Changes in δ 13 C P OC on the ocean surface were evaluated based at the coarse resolution gridded NetCDF data.

Vertical distribution of the data set
Depth values are available for more than 80 % of the sample data locating most of them in the upper ocean. This makes depth one of the least well covered metadata after temperature and sample method. The distribution of depth values is shown in 250 Table 4 and an approximation by Gaussian KDE visualized in Figure 4. The KDE resolves best data coverage for the uppermost ≈ 500 m of the oceans and a second far smaller maximum at ≈ 3800 m. The depths ranges presented in Table 4

Horizontal distribution of the data set
All global oceans are covered with δ 13 C P OC data. In Figure 5 the horizontal distribution of available data is depicted for both interpolations. Here, the coarse resolution interpolation is independent of time and the fine resolution averaged over all 260 included times. A similar plot, but with a different purpose, is given later in this work in Figure 9 showing only surface data locations.
Many cruises are visible as lines formed by connected grid cells in Figure 5. Especially in the Atlantic and Indian Ocean and shorter in the Southern Ocean. Also, data locations of smaller individual or connected grid cells occur. These are mainly located in the Pacific, Arctic and Southern Ocean. The Atlantic Ocean provides best data coverage. Following, the Southern 265 and Indian oceans contain the next best coverage with the northern Pacific having the sparsest.
Highest  Table 4. Vertical data coverage in depth layers inspired by the coarse interpolation grid: The first column lists the observed depth layers.
Below 50 m they are as defined by the coarse grid used for interpolation of the δ 13 CP OC data. The second columns gives the explicit number of δ 13 CP OC data points available in this depths range.   to Fay and McKinley (2014) was applied to the gridded data, thereby defining latitudinal zones in the entire Atlantic Ocean.  Distributions of δ 13 C P OC within the biomes are shown in Figure 6. Different colors mark the individual biomes and a black line shows the general global δ 13 C P OC distribution. The distribution of δ 13 C P OC samples over the years is resolved in Table 5 and visually approximated by Gaussian KDE in Figure 5. The 1990s show best data coverage. More than half of the data points is associated to a year in this decade, which is visible by a pronounced maximum in the estimated density. Sparsest data is found in the 1960s, where only 74 data points were sampled. All other decades come with between around 300 and 600 δ 13 C P OC data points. The latest data is mostly from

Seasonal trends
Monthly clustered data of northern and southern hemisphere show seasonal variations, but more observations are required to demonstrate robust seasonality within different regions. Since most of the available δ 13 C P OC data originates in the 1990s, 300 we selected only data from this decade to exclude changes that might be introduced by longer term changes. In Figure 8 we displayed all months with enough data points for construction of a comparable KDE and indicate same months by same colors.    We show the changes in δ 13 C P OC values over the available decades in Figure 10. The plot includes approximated densities of the δ 13 C P OC measurements for each decade and median-vs-years graphs. The Southern Ocean was excluded from the main analysis due to the sparse coverage outside of the 1990s and showed its few available results in the two lower separate panels. to δ 13 C P OC ≈ −20 ‰ with a pronounced dip in the 1980s down to less than δ 13 C P OC ≈ −30 ‰. The densities support this observation in the 1980s, where the maximum is below δ 13 C P OC ≈ −30 ‰. Nevertheless, we need to take into account that most Southern Ocean data were sampled in the 1990s, while the 1970s and 2000s provide only few data and might not deliver comparable results.

340
The aim of this work was to construct the largest publicly accessible δ 13 C P OC data set. We tackled this by merging all known data sets and collecting all available additional seawater samples from a free data distribution platform (PANGAEA). This newly constructed δ 13 C P OC data set currently contains 4732 data points with the potential to grow in the future. It is provided in a csv structure and interpolated on two different resolution global grids as NetCDF format. The csv file contains the δ 13 C P OC with respect to their mean and all available meta information. The interpolations are provided on a coarse 1.8 • × 3.6 • grid of a 345 δ 13 C P OC simulating model and a finer 1 • × 1 • grid by the World Ocean Atlas. We provided a detailed description of our data collection procedure, all added meta information and their coverage as well of the interpolation procedure carried out. We took highest care to make all data coherent, comparable and back trackable and all adjustments transparent. Assumptions, changes and deletions of the used data sets are described in detail.
We described the general spatial and temporal trends of the sampled δ 13 C P OC data by the raw data file. Distributions were 350 always approximated by Gaussian kernel density estimators. The data ranges from 1964 -2015 with far best coverage in the 1990s. Sample locations reach down to a depth of nearly 5000 m and best covers the uppermost 10 m in some areas. We were able to show our δ 13 C P OC data values are mostly located between δ 13 C P OC = −15 ‰ and δ 13 C P OC = −35 ‰ with two maxima at around δ 13 C P OC = −27 ‰ and δ 13 C P OC = −23 ‰, the latter one being the more pronounced. A comparison of the main sample methods showed consistent results when compared with regions. δ 13 C P OC data separated by months indicate 355 counteracting seasonal trends on both hemispheres, but more data is required to demonstrate robust seasonality.