Collection and analysis of a global marine phytoplankton primary-production dataset

Mattei, Francesco; Scardi, Michele

doi:https://doi.org/10.5194/essd-13-4967-2021

Articles | Volume 13, issue 10

https://doi.org/10.5194/essd-13-4967-2021

Articles | Volume 13, issue 10

Data description paper

27 Oct 2021

Data description paper |

| 27 Oct 2021

Collection and analysis of a global marine phytoplankton primary-production dataset

Francesco Mattei and Michele Scardi

Abstract

Phytoplankton primary production is a key oceanographic process. It has relationships with marine-food-web dynamics, the global carbon cycle and Earth's climate. The study of phytoplankton production on a global scale relies on indirect approaches due to the difficulties of field campaigns. Modeling approaches require in situ data for calibration and validation. In fact, the need for more phytoplankton primary-production data was highlighted several times during the last decades.

Most of the available primary-production datasets are scattered in various repositories, reporting heterogeneous information and missing records. We decided to retrieve field measurements of marine phytoplankton production from several sources and create a homogeneous and ready-to-use dataset. We handled missing data and added variables related to primary production which were not present in the original datasets. Subsequently, we performed a general analysis highlighting the relationships between the variables from a numerical and an ecological perspective.

Data paucity is one of the main issues hindering the comprehension of complex natural processes. We believe that an updated and improved global dataset, complemented by an analysis of its characteristics, can be of interest to anyone studying marine phytoplankton production and the processes related to it. The dataset described in this work is published in the PANGAEA repository (https://doi.org/10.1594/PANGAEA.932417) (Mattei and Scardi, 2021).

Download & links

Article (PDF, 3893 KB)

Download & links

How to cite

How to cite.

Dates

Received: 02 Jul 2021 – Discussion started: 07 Jul 2021 – Revised: 17 Sep 2021 – Accepted: 30 Sep 2021 – Published: 27 Oct 2021

1 Introduction

Phytoplankton primary production is a pivotal process in biological oceanography. It accounts for roughly 98 % of marine-system autotrophic production and 50 % of global productivity (Carvalho et al., 2017; Field et al., 1998). Accordingly, this process provides the main source of energy for structuring the marine food webs (Duarte and Cebrián, 1996; Kwak and Park, 2020a). Furthermore, it influences the absorption of carbon dioxide from the atmosphere and the flux of carbon to the deep ocean, generating a process known as biological pump (Giering et al., 2014; Longhurst and Glen Harrison, 1989). The estimated global phytoplankton production is comprised of between 30 and 70 Gt C yr⁻¹ (Carr et al., 2006; Friedrichs et al., 2009; Saba et al., 2010; Siegel et al., 2013), i.e., most probably still larger than global anthropogenic CO₂ emissions (roughly 37 Gt CO₂ yr⁻¹) (Caldeira and Duffy, 2000; Falkowski and Wilson, 1992; Jackson et al., 2019; Peters et al., 2020; Sabine et al., 2004).

These features highlight the strong link between phytoplankton production and both ecosystem services and Earth's climate (Barange et al., 2014; Behrenfeld et al., 2006; Blanchard et al., 2012a; Blythe et al., 2020). This link in turn reflects the central role of this biological process not only in the oceans' dynamics but also in those of the whole geobiosphere.

The availability of remotely sensed information allowed for the study of phytoplankton production at a global scale, providing a synoptic view of several ocean features, such as chlorophyll a surface concentration, sea surface temperature (SST) and photosynthetic active radiation (PAR) (Groom et al., 2019; Platt and Sathyendranath, 1988; Sammartino et al., 2018; Westberry and Behrenfeld, 2014). Several models which exploit satellite information to estimate primary production have been proposed (e.g., Behrenfeld and Falkowski, 1997; Friedrichs et al., 2009; Mattei and Scardi, 2020; Westberry and Behrenfeld, 2014). In fact, estimators of this process provide valuable tools to assess characteristics and patterns of global phytoplankton production, which in turn could provide insights into the dynamics of several phenomena, e.g., fishery yields and climate change effects (Fox et al., 2020; Richardson and Schoeman, 2004; Russo et al., 2019).

Nevertheless, the lack of field data negatively affects the power and the reliability of both satellite information and model estimates. In fact, these data are essential to calibrate satellite sensors and develop primary-production estimators.

The most complete and freely accessible phytoplankton production dataset is available at http://sites.science.oregonstate.edu/ocean.productivity/field.data.c14.online.php (last access: 26 January 2020). From now on we will refer to these data as the Ocean Productivity dataset. This dataset contains data from several oceanographic cruises accounting for roughly 3000 production profiles. Accordingly, this dataset has been widely used to develop several models (Behrenfeld and Falkowski, 1997; Scardi, 2001), since it contains depth-resolved ¹⁴C phytoplankton production estimates coupled with chlorophyll a profiles, SST and PAR measurements. Such data are crucial for both studying phytoplankton production and developing models for estimating this process. Despite being a precious source of information, these field data cover only some ocean basins, are affected by missing values and have not been updated since 1994 (orange dots in Fig. 1). As the amount and quality of field data are paramount characteristics to understanding the dynamics of natural processes, we wanted to create a new global dataset expanding both the temporal and the spatial coverage of the previously cited one. Moreover, we decided to associate more production-related information to each record, e.g., production-to-biomass ratio, bottom depth of the sampling station, distance from the coastline, etc. The extra information could be extremely valuable for analysis and modeling purposes, especially when machine-learning techniques come into play (Peters et al., 2014; Recknagel, 2001).

In order to retrieve phytoplankton production data, we consulted several sources which provide freely accessible information such as PANGAEA, the Biological and Chemical Oceanography Data Management Office and the National Centers for Environmental Information (a complete list of the exploited datasets with their respective references can be found in the Supplement).

To select suitable data, we adopted only four compulsory criteria that the newly found information had to meet. The first two criteria were related to the spatiotemporal context of the observations. Accordingly, we kept only the data for which date (yyyy-mm-dd) and geographical coordinates (latitude and longitude) of the field measurements were recorded. The third fundamental requirement was the presence of depth-resolved ¹⁴C measurements, i.e., phytoplankton production profiles. Depth-resolved data are more informative with respect to the depth-integrated ones, since they provide information not only on the production magnitude but also on its vertical distribution. The final requirement was the measurement of chlorophyll a profiles associated with the production data. Chlorophyll a is the most abundant pigment in photosynthetic organisms, and it is responsible for light energy absorption. The concentration of this pigment is intimately related to phytoplankton productivity, i.e., the production of organic matter. In fact, the energy gathered from sunlight allows for fixing carbon dioxide into matter. Even if several studies suggest that the chlorophyll-to-carbon ratio could be extremely variable depending on physical forces and phytoplankton physiological adaptation (Huot et al., 2007; Westberry et al., 2008), chlorophyll a is one of the most commonly used proxies for phytoplankton biomass, which in turn is a key parameter for studying phytoplankton production. This is especially true when the relationship between the pigment and the biomass is not explicitly formulated, i.e., in the machine-learning field. Furthermore, chlorophyll a can be easily measured with probes during sampling cruises, and its surface concentration has also been estimated from remote-sensing platforms since 1978 thanks to the Coastal Zone Color Scanner (CZCS). The former feature is important to exploit these measures to develop production models, while the latter is crucial in a synoptic application of these estimators.

On the other hand, we did not discard records lacking other variables, such as SST or PAR. In fact, if these measurements were not available, we filled the gaps by using interpolation techniques or retrieving the missing information from satellite platforms (see Sect. 2.1).

Retrieving phytoplankton production data that were not present in the Ocean Productivity dataset and the gap-filling operation allowed for expanding both the spatial and the temporal coverage of this dataset. Spatial and temporal variability are important features in dealing with global assessments of natural processes such as phytoplankton primary production. The new dataset comprised 6084 production profile collected between 1958 and 2017, 2214 of which derived from the Ocean Productivity dataset. The need for a larger amount of data related to the phytoplankton production process was already highlighted by several studies which either developed or compared primary-production models (Campbell et al., 2002; Carr et al., 2006; Friedrichs et al., 2009; Lee et al., 2015; Saba et al., 2010; Scardi, 1996). In fact, from the latter type of studies emerged a high level of uncertainty in determining global phytoplankton production. The range of estimated global production which resulted from comparison papers was extremely large, highlighting how challenging modeling this process on a large scale could be.

Additionally, we enriched the new dataset with several qualitative and quantitative variables. These variables were either derived from the existing one, retrieved from satellite platforms or extracted from freely accessible dataset (see Sect. 2.2).

Once the new dataset was structured, we highlighted its characteristics using several descriptive techniques and analyzed the results from an ecological perspective (Sect. 3).

2 Materials and methods

2.1 Data merging and reconstruction

As stated in the previous section, the most complete dataset of phytoplankton primary production was freely downloadable from the Ocean Productivity website. It contained roughly 3000 production profiles (Fig. 1, orange dots) associated with ancillary information such as chlorophyll profiles, SST and PAR measurements. We used this dataset as starting point and searched for data that could improve its spatial and temporal coverage. We conducted our searches mainly on the PANGAEA and NOAA websites, which are freely accessible data repositories. Each dataset that we used in this work has been cited as specified by the repository or the owners (see Supplement). We limited our search to datasets which contained depth-resolved measurements of net phytoplankton production such as ¹⁴C associated with the respective chlorophyll a concentration. The main reasons for this choice were the additional vertical distribution information provided by phytoplankton profiles with respect to depth-integrated estimates and the biomass proxy provided by the chlorophyll concentration. This feature allowed for analyzing several characteristics of phytoplankton production, thus contributing to a deeper understanding of the whole process.

The retrieved data were incorporated into the new dataset only if the geographic coordinates and the sampling date had been recorded. These data allowed for accounting for both the spatial and temporal variability of phytoplankton production in the analysis.

For each retrieved dataset that met our requirements, the first step was to merge it with the Ocean Productivity one. From the latter dataset we kept the following variables: date of the sampling (yyyy-mm-dd), geographical coordinates of the sampling station (latitude and longitude and degrees), day length as hours of the photoperiod (h), sampling depth (m), Pb_opt (mg C mg Chl a⁻¹ h⁻¹), SST (^∘C), surface PAR (Einstein m⁻² d⁻¹), sampling depth chlorophyll a concentration (mg m⁻³), sampling depth daily primary production (mg C m⁻³ d⁻¹) and integrated daily primary production (mg C m⁻² d⁻¹). To perform the merging procedure, we filled all the gaps in the newly retrieved data relative to the abovementioned variables. We computed the day length from the latitude and the day of the year of the sampling. SST missing values were filled using MODIS daily data for observations from 2003 to the present (MODIS Aqua Mapped Daily 4 km (https://doi.org/10.5067/MODSA-1D4D9, NASA OBPG, 2020), multiple-sensor daily data for records from 1981 to 2003 and the 1981–1990 mean for the data prior to 1981 (Copernicus SST) (Merchant et al., 2019). We also used the MODIS values for filling the PAR gaps from 2003 to the present. The profiles previous to this date that lacked PAR measures were discarded, since daily PAR estimates are available only through the MODIS platform (late 2002 to the present). Discarding these data, the Ocean Productivity dataset dropped from roughly 3000 profiles to 2214. We estimated the Pb_opt parameter using the procedure proposed by Behrenfeld and Falkowski (1997a). Finally, we estimated the missing values in chlorophyll a and primary-production profiles with a depth-weighted average of adjacent values. Once the merging procedure was finished, the new dataset contained 37 722 records from 6084 profiles with respect to the 14 300 and 2214 of the old one (Fig. 1).

https://essd.copernicus.org/articles/13/4967/2021/essd-13-4967-2021-f01

Figure 1Map of the 6084 phytoplankton production profiles comprised in the new dataset. In orange are the profiles derived from the Ocean Productivity dataset (2214), and in red are the newly retrieved ones (3870).

2.2 Ancillary data association

We added several variables related to phytoplankton primary production to the dataset (Tables 1 and 2). These variables can be divided into three groups: (i) data extracted from freely available datasets, (ii) numerical measures computed from the existing ones and (iii) categorical data derived from the previous two groups.

Among the variables that belong to the first group, we list the bottom depth (m) and the statistics related to it. We retrieved the bathymetry information from the GEBCO website (General Bathymetric Chart of the Oceans; https://doi.org/10.5285/c6612cbe-50b3-0cff-e053-6c86abc09f8f, GEBCO Compilation Group, 2021). We queried the GEBCO dataset using the geographic coordinates of the sampling stations to extract the bottom depth data. We also exploited up to eight neighbor pixels to compute the bottom depth variance of the sampling-point neighborhood.

We retrieved information about the mixed-layer depth (MLD) using the Levitus model datasets (Levitus et al., 1994; Levitus and Boyer, 1994), which are freely available on the Levitus web page (https://psl.noaa.gov/data/gridded/data.nodc.woa94.html, last access: 26 January 2020, NOAA/OAR/ESRL PSL, 2021)

The last data that we gathered from an external dataset were the distance from the coastline (km). The 0.04 ^∘ distance dataset was downloaded from the NASA website (https://oceancolor.gsfc.nasa.gov/docs/distfromcoast/, NASA Ocean Biology Processing Group (OBPG) and Stumpf, 2012).

The second group of new variables was computed from information already present in the new production dataset at this stage. We computed the day of the year from the date, i.e., the first day of January and the last day of December were represented by 1 and 365 respectively.

Table 1Production dataset numerical variables.

^* last access: 26 January 2020.

Download Print Version | Download XLSX

We also estimated the euphotic-zone depth and the total chlorophyll a in the euphotic zone (mg Chl a m⁻²) using a model developed by Morel and Berthon (1989).

Moreover, we extracted both the max sampling depth of non-null production values (m) and the depth at which maximum production occurred for each profile (m), thus creating two new variables.

We estimated depth-integrated chlorophyll a and depth-integrated primary production (IPP) by the trapezoidal integration of in situ measurements (mg Chl a m⁻² and mg C m⁻² d⁻¹ respectively). Subsequently, we estimated the production-to-biomass ratio by dividing the depth-integrated phytoplankton production by the depth-integrated chlorophyll a concentration (mg C d⁻¹ $/$ mg Chl a).

The last group of variables were generated by dividing the production profiles in classes on the basis of the previously computed variables. We created the hemisphere variable by assigning each profile to the Northern Hemisphere, Southern Hemisphere or Equator on the basis of the sampling latitude. We also created a season variable on the basis of the date and Northern Hemisphere season. We divided the year into four groups of 3 months each starting from January and tagged them as winter, spring, summer and fall respectively.

For numerical data, we applied the Jenks optimization algorithm (Jenks, 1967) to define the boundaries of six classes from very low to huge (very low, low, moderate, high, very high and huge). Then we used these boundaries to assign each pattern to one of the six classes. It is important to note this class segmentation is relative to our data rather than an absolute classification criterion. Finally, we added two columns to provide information about the nature of the SST and PAR measures. These flag columns specify if the variable's value is either in situ (flag value = 0) or reconstructed (flag value = 1), and these are placed near the flagged variable.

Finally, we investigated the relationship between the variables which are more intimately related to phytoplankton primary production, i.e., SST, PAR, chlorophyll a, max sampling depth, max production depth and the production-to-biomass ratio. We produced heatmaps to provide an insight into the categorical variables and performed a principal-component analysis (PCA) for their numerical counterparts.

Table 2Production dataset categorical variables.

Download Print Version | Download XLSX

3 Results and discussion

With this work we aimed at building a global phytoplankton production dataset updating the Ocean Productivity one. Moreover, we wanted to expand the available information by associating several variables related to primary production. The data underlying this article are available in the article's online Supplement.

The comprehension of natural phenomena deeply relies on available data. These complex processes often involve nonlinear and not well-known relationships among their components. Accordingly, we believe that one crucial way to enhance our understanding of natural systems is provided by gathering information and then analyzing it.

In this framework, we extended both the spatial and the temporal coverage of the Ocean Productivity dataset. These two features are paramount to boosting our knowledge about the spatiotemporal distribution of phytoplankton production. In fact, the former allows for taking temporal trends into account in the processes which are linked with climate-related issues and food web dynamics. The Ocean Productivity dataset contained data from cruises carried out between 1958 and 1994, which is a large span of time, but it has not been updated since then. Our data retrieval added 23 422 new patterns from 3870 production profiles which in most cases do not overlap with the Ocean Productivity dataset's temporal coverage. In fact, 2210 of the 3870 new phytoplankton profiles, i.e., roughly 57 % of the total, were collected between 1995 and 2017. Even if roughly 43 % of the new profiles share the time coverage with the Ocean Productivity ones, the majority of these data do not overlap with the spatial coverage of the older dataset, thus enhancing the heterogeneity of the data.

Although the Ocean Productivity dataset was the most comprehensive source of information about phytoplankton primary production, the bulk of its data were restricted to three main regions. These areas were the northwestern Atlantic, the eastern equatorial Pacific and the northeastern Pacific along the western coast of the United States. The other ocean basins were undersampled or not sampled at all (Fig. 1, orange markers). The new data improved the global coverage of the previous dataset. Several profiles were added in the Arctic Ocean, specifically in the Chukchi Sea, the Beaufort Sea, the Greenland Sea, the North Sea, the Norwegian Sea, the Barents Sea and the Kara Sea. In the Pacific Ocean the newly represented areas were the Bering Sea, the Gulf of Alaska, and the areas off of the Oregon and California coasts in addition to a few production profiles gathered off the eastern coast of New Zealand. In the western Atlantic, new information was available for the Gulf of St. Lawrence, the Florida coast and the Caribbean Sea. In the central Atlantic the newly represented areas were located southeast of Ireland, south of Cabo Verde and off of the Gulf of Guinea with a few records in the Bay of Biscay, off of the coast of Morocco and in the Mediterranean Sea. Few of the data in the Indian Ocean were present in the old dataset, but we reconstructed missing information and added new profiles from different datasets. The Southern Ocean remains strongly undersampled, with the addition of a few production profiles.

The temporal and spatial coverage of a dataset are crucial features. The first one allows for taking the evolution of the studied process into account. This aspect is important in any type of assessment work, especially in a climate change context. In the phytoplankton production framework, the temporal span covered by the available in situ data could be used to study several aspects. For example, repeated observations through the years for the same area could highlight temporal patterns of the investigated region. Moreover, this feature could be used to investigate the relationships between phytoplankton production and large-scale phenomena, e.g., El Niño–Southern Oscillation (ENSO). From a spatial perspective, the larger the global oceans area represented in the dataset is, the larger the spatial variability of the phytoplankton production process taken into account is. This feature is crucial, since both depth-resolved and depth-integrated phytoplankton production estimates are deeply influenced by the geographic characteristics of the investigated area, e.g., latitude, distance from the coastline and bottom depth. Therefore, to deepen the understanding of this biological process, we need to gather and analyze information from different areas. Finally, if we want to exploit a dataset to perform any global assessment of the phytoplankton primary production or tackle production-climate-related issues, we need an information pool that takes into account as much variability of the process as possible (Behrenfeld et al., 2016; Gibert et al., 2018; Hays et al., 2005).

One of the fields which heavily relies upon the amount and quality of the data is modeling. Several studies stressed how most of the limits in modeling phytoplankton production depend upon the data availability (Campbell et al., 2002; Carr et al., 2006; Mattei and Scardi, 2020; Scardi, 2001). For these reasons, we believe that the enhancement of both spatial and temporal coverage of a freely available production dataset is an important contribution to modern oceanography.

Not only did we limit our work to homogenize several data sources into a single one, but also we enhanced the amount of phytoplankton-related available information. This type of information could be useful for boosting our understanding of primary production. Moreover, the ancillary data could be extremely valuable to model development, especially when machine-learning techniques come into play. In fact, these approaches allow for the use of variables as predictors even if the relationship with the target variable (primary production here) is not known (Catucci and Scardi, 2020; Franceschini et al., 2019; Olden et al., 2008; Peters et al., 2014; Recknagel, 2001).

The first two descriptors added to the new dataset were the hemisphere of the sampling station and sampling season, indicated as the Northern Hemisphere season. These two variables provided an insight into the global temporal and spatial distribution of the data (Fig. 2).

https://essd.copernicus.org/articles/13/4967/2021/essd-13-4967-2021-f02

Figure 2(a) Number of profiles gathered in the two hemispheres or at the Equator (5578, 478 and 28 respectively). (b) Number of profiles sampled in Northern Hemisphere winter (January to March), spring (April to June), summer (July to September) and fall (October to December).