First release of the Pelagic Size Structure database: global datasets of marine size spectra obtained from plankton imaging devices

. In marine ecosystems, most physiological, ecological, or physical processes are size dependent. These include metabolic rates, the uptake of carbon and other nutrients, swimming and sinking velocities, and trophic interactions, which eventually determine the stocks of commercial species, as well as biogeochemical cycles and carbon sequestration. As such, broad-scale observations of plankton size distribution are important indicators of the general functioning and state of pelagic ecosystems under anthropogenic pressures. Here, we present the ﬁrst global datasets of the Pelagic Size Structure database (PSSdb), generated from plankton imaging devices. This release includes the bulk particle normalized biovolume size spectrum (NBSS) and the bulk particle size distribution (PSD), along with their related parameters (slope, intercept, and R 2 ) measured within the epipelagic layer (0–200 m) by three imaging sensors: the Imaging FlowCytobot (IFCB), the Underwater Vision Proﬁler (UVP), and benchtop scanners. Collectively, these instruments effectively image organisms and detrital material in the 7–10 000 µm size range. A total of 92 472 IFCB samples, 3068 UVP proﬁles, and 2411 scans passed our quality control and were standardized to produce consistent instrument-speciﬁc size spectra averaged to 1° × 1° latitude and longitude and by year and month. Our instrument-speciﬁc datasets span most major ocean basins, except for the IFCB datasets we have ingested, which were exclusively collected in northern latitudes, and cover decadal time periods (2013–2022 for IFCB, 2008–2021 for UVP, and 1996–2022


The relevance of plankton size in approximating ecological processes
Plankton size structure observations are essential to bridge the gap between marine biogeochemical processes and biological stock assessments, including those of important commercial species (Boyd and Newton, 1999;Armstrong et al., 2001;Finkel et al., 2009;Guidi et al., 2009;Taniguchi et al., 2014;Hillebrand et al., 2022).Historically, ecosystems dominated by small phytoplankton were thought to support regenerated production, being rapidly recycled in the epipelagic layer and contributing little to carbon export.Conversely, larger phytoplankton were thought to fuel higher trophic levels and contribute, to a large extent, to carbon sequestration by sinking relatively fast to the mesopelagic (200-1000 m) layers (Legendre and Le Fèvre, 1995;Wassmann, 1997;Durkin et al., 2015).Although this paradigm has shaped almost all current biogeochemical models and their projections of marine ecosystem services under climate change, recent studies have challenged this concept.Indeed, plankton of intermediate size and/or trophic levels have been shown increasingly to contribute significantly to biogeochemical functioning and carbon export (Lomas and Moran, 2011;Choi et al., 2014;Durkin et al., 2015;Guidi et al., 2016;Ward and Follows, 2016;Biard et al., 2016;Leblanc et al., 2018;Richardson, 2019;Juranek et al., 2020;Schvarcz et al., 2022).These studies call for a global assessment of the plankton size continuum rather than of the discrete size categories defined by Sieburth et al. (1978) (i.e., picoplankton: 0.2-2 µm, nanoplankton: 2-20 µm, microplankton: 20-200 µm, mesoplankton: 200-20 000 µm, and nekton: 2000-20 000 000 µm) to study ecosystem functioning or to model ecosystem services under current and future anthropogenic pressures (Lombard et al., 2019;Ljungström et al., 2020;Atkinson et al., 2021).The first estimates of plankton and particle size spectra across several orders of magnitude yielded global and robust patterns of roughly equal amounts of biomass distributed across particle sizes (Sheldon et al., 1972).Since this seminal study, there has been increasing recognition that plankton size structure is an effective way to summarize the inherent complexity of community structure (Stemmann and Boss, 2012) and how it relates to key ecosystem processes such as primary productivity (Marañón et al., 2001), fishery yields (Sheldon et al., 1977), and sequestration of carbon dioxide (CO 2 ) from the atmosphere (Basu and Mackey, 2018).This is possible because organism body size serves as a "master trait" from which other biological properties are derived, such as metabolism (Huete-Ortega et al., 2012;Ikeda, 2014;Kiørboe and Hirst, 2014;Maas et al., 2021), growth rates (Hopcroft et al., 1998;Chen and Liu, 2010;Edwards et al., 2012), consumption rates (Hansen et al., 1994;Kiørboe and Hirst, 2014), predator-prey size ratios (Hansen et al., 1994;Hauss et al., 2023), mortality (Hirst and Kiørboe, 2002), active transport through diel vertical migration (Ohman and Romagnan, 2016), and sinking (Smayda, 1971;Cael et al., 2021).These size-dependent processes have been historically represented by allometric relationships, also referred to as power-law functions, whose parameters were derived empirically (see reviews from Chisholm, 1992, andHillebrand et al., 2022) or mechanistically (see review from Andersen et al., 2016).Given the use of plankton and particle size structure as a proxy for complex ecological processes, estimates of pelagic size structure, with large spatial and temporal coverage, are essential to assess ecological trends across space and time.

The emergence of marine imaging devices and size structure observations
The need to capture pelagic size spectra at unprecedented scales has sparked the emergence of a multitude of in situ and laboratory-based plankton imaging systems in the past 20 years, with individual instruments designed to capture the continuous size distribution of organisms and detrital particles in a specific size range (Davis et al., 2005;Olson and Sosik, 2007;Gorsky et al., 2010;Picheral et al., 2010;Sieracki et al., 2010;Ohman et al., 2019).Plankton large enough to be identified and sized at the resolution of commercially available imaging systems include (1) nano-and microplanktonic protists (comprising photoautotrophs, mixotrophs, and heterotrophs), typically imaged by the FlowCam (Sieracki et al., 1998) or the IFCB flow cytometer (Sosik and Olson, 2007); (2) micro-and mesoplankton (comprising large chain-forming photoautotrophs, mixotrophs, and heterotrophs), routinely imaged in situ by UVPs (Picheral et al., 2010;Stemmann et al., 2012), CPICSs (Gallager, 2016), or VPRs (Davis et al., 2005) or collected with nets and later imaged on board with a ZooCAM (Colas et al., 2018) or in the lab with benchtop scanners like the ZooScan (Gorsky et al., 2010;Lehette and Hernández-León, 2009;Kiko et al., 2020); and (3) micronekton, which can complement the size range of mesoplankton that are well detected by ISIIS instruments (Cowen and Guigand, 2008).
Collectively, these imaging systems can capture a wide size range of marine plankton, spanning a few micrometers to tens of centimeters (Lombard et al., 2019), providing accurate estimates of plankton community structure and trophic dynamics (Atkinson et al., 2021).More recently, they also provided insight into diverse detrital pools, which comprise fecal pellets, deadfalls, or marine snow aggregates linked to specific biogeochemical properties (Kiko et al., 2017;Trudnowska et al., 2021).Such particles generally dominate UVP images across all size classes (Stemmann and Boss, 2012;Kiko et al., 2022), highlighting yet another continuum in particle transformation and degradation (Durkin et al., 2021).
As part of the digital revolution, these advancements in new technologies have been matched with an equally rapid diver-sification in sampling strategies (e.g., towed-, net-, moored-, or profiling-based sampling), available platforms (e.g., floats, gliders, buoys, moorings, ships of opportunity, and research vessels), data processing and management tools (e.g., collaborative platforms for image classification like EcoTaxa), or automated taxonomic (Luo et al., 2018;Irisson et al., 2022) and functional (Schröder et al., 2020;Orenstein et al., 2022) classification schemes, such that plankton imaging systems have become widespread for research and monitoring applications alike.
Phytoplankton and zooplankton biomass and diversity, as well as bulk particulate matter, were identified as essential ocean, biodiversity, and climate variables by the Global Observing Systems (Miloslavich et al., 2018;Chiba et al., 2018;Batten et al., 2019), and imaging systems offer a unique opportunity to accurately measure these variables at multiple spatial and temporal scales.Thus, plans are now underway to use plankton imaging systems in observing programs with large spatial and temporal scales.For example, the IFCB will be routinely deployed in the Bio-GO-SHIP program (Clayton et al., 2022), and the UVP6 (Picheral et al., 2022) will be included in the BGC-Argo floats (Claustre et al., 2020;Picheral et al., 2022).Longterm time series such as the California Ocean Observing System (CalOOS, https://data.caloos.org/,last access: October 2023) and the Northeast US Shelf Long-Term Ecological Research (NESLTER, https://nes-lter.whoi.edu/,last access: October 2023) rely mostly on IFCB data, and Point B in the Bay of Villefranche has already generated a ZooScan dataset that spans over 30 years.More recently, the combination of ZooScan and ZooCAM (Grandremy et al., 2024) has enabled the analysis of a regional-scale, long-term zooplankton survey (2004-2019, ongoing) on a temperate European continental shelf (Grandrémy et al., 2023a, b, c).Overall, sustained observations from IFCBs and UVPs have been ongoing since 2006 and 2008 respectively, and even track back to 1966 for laboratory-based ZooScan observations from preserved samples (García-Comas et al., 2011).Despite the large temporal and spatial coverage, it was not until recently that the first curated global dataset of particle sizes between 64-50 000 µm from UVP5 measurements was published (Kiko et al., 2022).This release was facilitated by a collaborative management platform, EcoPart (https://ecopart.obs-vlfr.fr,last access: October 2023), which enables the collection of the count and size information of bulk particles detected by the UVP.This unique platform, along with other collaborative platforms such as EcoTaxa (https://ecotaxa.obs-vlfr.fr,last access: October 2023) and the IFCB dashboards (https://ifcb-data.whoi.edu/dashboard,last access: October 2023, https://ifcb.caloos.org/dashboard,last access: October 2023) and their corresponding application programming interfaces (APIs) allow us to find and access size structure estimates easily and repeatedly, which satisfies two of the FAIR (findable, accessible, interoperable, https://doi.org/10.5194/essd-16-2971-2024 Earth Syst.Sci.Data, 16, 2971-2999, 2024 reusable) data principles (Wilkinson et al., 2016) guiding current data management strategies (Lombard et al., 2019).

The Pelagic Size Structure database project
With the support of many international data providers, we developed the Pelagic Size Structure database (PSSdb, https://PSSdb.net, last access: October 2023) to provide global datasets of particle and plankton size distributions.
Our project capitalizes on largely untapped size structure observations from plankton imaging devices, which can image plankton and particles across the 7-10 000 µm size range (Romagnan et al., 2015;Lombard et al., 2019), and aims to become a global data source like the NOAA World Ocean Database (https://www.ncei.noaa.gov/products/world-ocean-database, last access: October 2023) and COPEPOD (https://www.st.nmfs.noaa.gov/copepod,last access: October 2023).The objectives for PSSdb were both (1) to implement a workflow able to retrieve counts, sizes, and taxonomic information from online imaging data streams to calculate particle size spectra and (2) to provide multilevel, harmonized products matching the spatio-temporal resolution of current biogeochemical models.Our workflow is programmed in Python and can be fully tuned to specific instruments; spatio-temporal resolutions; and research questions regarding mesoscale plankton distribution, patchiness, short-term trophic dynamics, or diel vertical migration, with little modification.To achieve this, we favored a general framework to estimate size spectra from existing data sources that can also be updated with new data from current and new technologies.Expected products will range from low (bulk particles and planktonic size spectra, presented in this paper) to high taxonomic resolutions, matching the functional groups in biogeochemical models.Currently, our pipeline includes size spectra estimates from two widespread, synoptic approaches, namely the particle size distribution (PSD) and the normalized biovolume size spectrum (NBSS), developed by ecologists and optic scientists in the mid-1960s and 1990s to summarize and link size structures to ecosystem properties, communities, and ecological processes (Sheldon et al., 1972;Jonasz and Fournier, 1996;Kostadinov et al., 2009;Stemmann et al., 2012;Sprules and Barth, 2016).Both metrics have been adopted to represent the exponential decrease in particle abundance typically observed as size increases, with abundance traditionally expressed as either normalized particle number or biovolume or biomass.This exponential decrease in abundance with size is mostly linear when transformed to a logarithmic scale (Sheldon et al., 1977) unless abiotic or biotic perturbations lead to local peaks of intermediatesized organisms (Moscoso et al., 2022).Both the slope and intercept of the log-linear regression between particle abundance and size are important indicators of pelagic ecosystem changes (Sprules and Munawar, 1986).They represent the equilibrium between lower and upper trophic levels, which can be indicative of trophic transfer efficiency and the ecosystem carrying capacity, respectively (Zhou, 2006).In this paper, we present the first version of PSSdb instrumentspecific datasets, consisting of bulk size spectra and derived parameters (slope, intercept, and R 2 ) measured by the IFCB, the UVP, and benchtop scanners (e.g., ZooScan) within the epipelagic layer.First, we highlight the large spatio-temporal coverage of our observations before describing the shape of the size spectra and the patterns of their derived parameters.Finally, we discuss how PSSdb provides a way to study the links between plankton community structure and global biogeochemical fluxes and thus to inform the development of biogeochemical and data-driven models.

Materials and methods
In the following sections, we first highlight the key aspects of data acquisition and pre-processing by the three imaging instruments considered in PSSdb (Sect.2.1).Then we provide details on the current pipeline for PSSdb ingestion that enables the computation of instrument-specific size spectra, currently available at https://doi.org/10.5281/zenodo.11050013(Dugenne et al., 2024b) (Sect.2.2).

Acquisition and pre-processing of imaging datasets
Datasets from several plankton imaging systems were included in PSSdb: the IFCB (Olson and Sosik, 2007), the UVP (UVP5, Picheral et al., 2010, andUVP6, Picheral et al., 2022), and benchtop scanner systems such as the ZooScan (Gorsky et al., 2010) and other generic scanners (Gislason and Silva, 2009).In addition to the detailed description provided in their associated publications, further considerations of these instruments' deployments and operational specifications relevant to the generation of the database are provided in Appendix A. Here, we provide a brief overview of the main principles guiding image acquisition and preprocessing steps, leading to the incorporation of the mentioned imaging datasets in PSSdb.
All instruments were designed to image plankton or particles in situ or in the laboratory based on user-defined thresholds (e.g., minimum size for all instruments, laser-induced fluorescence or scattering for the IFCB, or pixel intensity for the UVP and scanners).Prior to their use, instruments are generally calibrated to ensure that particles detected can be effectively sized (by measuring the pixel size) and counted in a quantitative volume (e.g., calibrated syringe for the IFCB, dimensions of the illuminated frame for the UVP, and flow meters mounted on nets for scanners).Particles that pass these thresholds are then segmented (i.e., the process of identifying target particles from background pixels) in near-real time to produce cropped thumbnails of the regions of interests (ROIs).These thumbnails are automatically saved on the computer piloting the instrument for further process-ing.Notably, common processing steps across all imaging instruments include the automated identification of pixels enclosing these ROIs (with instrument-specific algorithms) to compute morphometric features, including area or ellipsoidal axis, as well as pixel intensity descriptors.These can be used to train machine learning algorithms which predict taxonomic annotations of the entire set of ROIs, although new classifiers now directly use the thumbnails and extract their own "features".Thumbnails, morphometric features, and potential taxonomic annotations are then all uploaded to online platforms -such as EcoTaxa or EcoPart for scanners and UVPs or dashboards for IFCBs -that are not longterm storage repositories per se but that help to visualize and check incoming datasets or to curate the classifier predictions by taxonomic experts (in the case of EcoTaxa).Importantly, all datasets are typically uploaded with sufficient metadata, comprising the GPS coordinates, sampling time, and camera pixel size and calibrated volume, to support their ingestion in large data aggregation projects like PSSdb.We only selected datasets with taxonomic annotations for the generation of PSSdb to ensure that bulk size spectra did not include methodological artifacts like bubbles or calibration beads and for further work on taxon-specific data products.

Selection of imaging data streams
The first objective of PSSdb is to estimate particle size spectra from plankton imaging devices following the FAIR principles.We thus relied primarily on online and accessible platforms created by the instrument developers to manage their datasets: IFCB dashboards (of generation 2 exclusively, as generation 1 does not include metadata like longitude and latitude) and EcoTaxa or EcoPart, developed for ZooScan and UVP but also for the IFCB and other imaging systems since a few IFCB datasets ingested in PSSdb were available on Eco-Taxa.IFCB dashboards are deployed by individual labs or regional networks with specific URLs and are publicly accessible.Conversely, EcoTaxa datasets are not accessible by default; thus, data providers who wanted to contribute to the PSSdb were asked to provide access to their projects.
Both IFCB dashboards and EcoTaxa contain sample metadata; raw images; their related morphometric measurements; and, optionally, their taxonomic annotation.To ensure that size distributions were representative of living (i.e., planktonic and micro-nektonic organisms) and non-living particles (i.e., marine snow, fecal pellets) only, we selected datasets with predicted and/or curated image classification.This allowed for the exclusion of particles labeled as artifacts, bubbles, calibration beads, microplastics, and others.Of the 37 datasets on the IFCB dashboards and the 3290 UVP, scanner, and IFCB datasets on EcoTaxa (last checked on October 2023), only 6 projects from IFCB dashboards and 250 projects from EcoTaxa were downloaded and integrated into the first PSSdb release products.The list of the datasets (and their URLs) that are ingested in the first PSSdb release can be downloaded at https://doi.org/10.5281/zenodo.11050013(Dugenne et al., 2024b) from the "data sources" spreadsheet included in the compressed release data files.The dataset list was generated automatically using the EcoTaxa and IFCB dashboards' application programming interface (API), which also provides fast and automatic access to both data (morphometric measurements and taxonomic annotation) and metadata.

Extraction of imaging data streams
All functions to list (funcs_list_projects.py) and export (funcs_export_projects.py) datasets from IFCB dashboards and EcoTaxa or EcoPart automatically are available at https://github.com/jessluo/PSSdb/tree/main/scripts(last access: April 2024).To export IFCB datasets, sample-specific queries to the IFCB dashboards are executed sequentially to retrieve sample metadata such as location, time, and depth plus the morphometric measurements of individual ROIs stored in the "features" files and the top five taxonomic predictions stored in the "autoclass" files.Metadata, feature, and autoclass files are then combined in a single master table, with a row for each ROI, and are saved into multiple files comprising ∼ 500 000 rows to limit the size of the exports and the processing time.
Scanner and UVP datasets were automatically exported from EcoTaxa using the API with the default option.This option retrieves all the information relative to individual ROIs (e.g., area, taxonomic annotation) and samples (e.g., location, depth, time), as well as specific acquisition (e.g., size fraction scanned and associated volume) and processing (e.g., pixel calibration factor) steps.
To further retrieve the size and count information of small particles processed by UVPs in real time, which are only uploaded to EcoPart, we wrote a custom script based on existing web-scraping python modules (function "EcoPart_export" at https://github.com/jessluo/PSSdb/tree/main/scripts/funcs_export_projects.py,last access: April 2024).We selected the "raw" export option for all datasets hosted on EcoPart rather than the default export option, which provides summary statistics consisting of the summed particle counts and biovolume in individual size bins, computed in 5 m depth bins. https://doi.org/10.5194/essd-16-2971-2024 Earth Syst.Sci.Data, 16, 2971-2999, 2024 With the raw export option, we were able to retrieve the number of particles (column "nbr") of a given pixel-based size measurement (column "area"), as well as the number of individual image frames (column "imgcount", used to calculate the cumulative volume) in 1 m depth bins.This strategy has multiple advantages as it allows the conversion of pixels into metric area estimates using either the power-law function described in the Appendix A2 or a fixed pixel size.It also allows for the construction of size spectra using custom size bins and for an assessment of the uncertainty of the size spectra estimates using the bootstrap approach published by Schartau et al. (2010).
Using a pair of identifiers allowing us to link each UVP dataset uploaded on EcoPart to its corresponding EcoTaxa ID, the datasets on both platforms were consolidated into a single table to account for all particles detected by the UVP.Since EcoPart raw data files are summarized in 1 m depth bins, it is impossible to link a specific area estimate to the corresponding EcoTaxa vignette and, thus, its taxonomic annotation.To consolidate data for all particles in 1 m depth bins without losing further information and without including the same particle twice, we used the threshold for vignette generation to select particles with and without a taxonomic annotation (particles larger than ∼ 910 or 690 µm in equivalent circular diameter (ECD) for the UVP5 and UVP6, respectively).The consolidated UVP data files thus include the area estimates from all particles smaller than this threshold, which are assigned an empty taxonomic annotation, along with the area and taxonomic annotations of each ROI stored in Eco-Taxa data files, whose sampling depth precision is reduced to the resolution of EcoPart data files (i.e., 1 m bin levels).All metadata for the sampling locations, depth range, and pixel size were merged into this unique table using the metadata file exported from EcoPart.

Standardization of imaging datasets
Since raw datasets exported from the API queries are generated with different formats, with specific headers and units, we developed instrument-specific "standardizer" spreadsheets in order to re-format all datasets to the same standard.Each spreadsheet contains the dataset IDs for a given instrument, including the pair of IDs required to consolidate UVP datasets (see Sect. 2.2.2) and the information required for the standardization and quality controlling of these datasets.The dataset ID lists are generated automatically, but the data information (headers and units) is manually filled to map the native headers and units of the data files to standard names (following the variablename_field nomenclature) and units (following the variablename_unit nomenclature).After listing and exporting all datasets from EcoTaxa, EcoPart, or IFCB dashboards, a member of PSSdb thus enters the name and corresponding unit found in the native export files into each variable needed in future steps of the pipeline so that they can be mapped and converted into the standards defined in the product documentation.This mapping and conversion is then done automatically using the script developed for the standardization (https://github.com/jessluo/PSSdb/blob/main/scripts/2_standardize_projects.py, last access: April 2024).The spreadsheets can be downloaded at https://github.com/jessluo/PSSdb/tree/main/raw(last access: April 2024) under project_Instrument_standardizer.xslx.
The mapped variables include longitude, latitude; sampling time (with time format); minimum and maximum sampling depth; volume sampled and potential dilution or concentration factors; the lower and upper sample size limit; and optional additional metadata describing the sampling effort, protocol or downstream processing, the pixel size, and the ROI size estimates with taxonomic annotation.In the case of size-fractionated samples, the sampling size limits were determined by the mesh or filter sizes.Otherwise, the dimensions of the imaging frame are used to specify the theoretical upper size range imaged by the device.ROI size estimates may include biovolume, area, or ellipsoidal axis for comparison.However, the size spectra for PSSdb were all computed using ROI area for consistency across devices since not all imaging instruments provide biovolume estimates and the derived equivalent spherical diameter (see Sect. 2.2.6 for more details).In addition, the value(s) for "not available" or NA were specified, if necessary, since we found some inconsistencies in the values reported, particularly for datasets generated by Zooprocess (i.e., UVP and scanner datasets), depending on the software version used, but also across variables for the same dataset.While the standardizer spreadsheet needs to be filled manually, we found this approach to be optimal to account for the variable formats of existing and future datasets, both accessible online or directly sent to us.
Native units, defined in the standardizer spreadsheets, are converted to standard units using the Python package Pint (https://pypi.org/project/Pint/, last access: October 2023), designed to define, operate, and manipulate physical quantities based on units from the International System or defined in a custom text file.Custom units defined for PSSdb included the pixels per micrometer and pixels per millimeter used to convert pixel-based size estimates to metric-based estimates (https://github.com/jessluo/PSSdb/blob/main/scripts/units_def.txt,last access: April 2024).After standardization, an interactive report is generated to check that units were correctly assigned by displaying the NBSS computed according to Sect.2.2.6 and the average particle size and/or concentration for individual samples.PSSdb developers can then check that both the size range and the overall concentration recorded are consistent with the particle size targeted by specific instruments (Lombard et al., 2019).This step ensures that the file format and units in all data files are consistent, enabling the further merging of the data in the following PSSdb workflow steps.

Quality controlling of imaging datasets
After morphometric measurements, taxonomic annotations, and metadata from the imaging data streams are downloaded (see Sect. 2.2.2); the standardizer spreadsheets are filled (see Sect. 2.2.3); and all datasets are standardized, a quality control (QC) check is performed on individual IFCB, UVP, and scanner samples.The objective of this step is to ensure the good quality of the datasets ingested in PSSdb by automatically flagging individual samples whose size spectrum computation was either impossible (missing required information) or biased (incorrect GPS coordinates, pixel size, or low ROI number).We used a Boolean factor to characterize each flag, assigning 0 (false) to non-flagged samples that passed the quality control and 1 (true) to flagged samples.Currently, seven criteria are checked during the QC, and the overall flag is assigned 0 if the sum of the individual flags equals 0; otherwise, it is assigned a value of 1.
The first flagging criterion stands for missing required data or metadata, as specified in the standardizer spreadsheets.Second and third, GPS coordinates are checked to verify whether they are located on land, according to the georeferenced Global Oceans and Seas dataset (version 1 automatically downloaded from https://www.marineregions.org/,last access: October 2023), or located at 0°× 0°latitude and longitude, which sometimes indicates that this information has not been filled correctly.Fourth, to determine whether the number of ROIs (n) in a sample was sufficient to accurately estimate a size spectrum, we estimated count uncertainty assuming that particle detection followed a Poisson distribution (Schartau et al., 2010;Bisson et al., 2022;Haëntjens et al., 2022).According to this distribution, the accuracy of ROI counts decreases significantly with lower count numbers n.We could thus estimate the probability of effectively observing n ROIs given that the mean occurrence (the main parameter of the Poisson distribution) was equal to n, and we assigned a flag to samples whose ROI counts yielded more than 5 % uncertainty.Fifth, the percentage of manual taxonomic annotations (verified by a human expert) is calculated in order to flag samples that are less than 95 % validated.This criteria is only applied to scanners and UVP datasets as the larger number of IFCB images per sample makes it more difficult to manually validate automated classifications.Sixth, the percentage of artifacts per sample is evaluated using the predicted or validated annotations so that any sample with 20 % or more artifacts is flagged.Finally, samples with multiple pixel size factors are also flagged since we do not expect the camera to be re-calibrated or replaced during deployment.
After the completion of the QC, a table summarizing individual samples with their flags and an interactive report providing an overview of the samples flagged for each dataset are automatically generated.The interactive report is checked by PSSdb developers and sent to the data providers for an overview of the dataset sample locations, the num-ber of ROIs, the percentages of validations and/or artifacts per sample, and the overall percentage of flagged samples.Hyperlinks are inserted in the interactive report to verify the sample information directly from the data source.Flags may be overruled by the data provider if they consider a sample to be suitable (or not) for incorporation into PSSdb.For example, samples that have been size-fractionated could record a low ROI number, samples with a high percentage of artifacts may not necessarily be completely biased, and low validation may be acceptable if all artifacts have been correctly identified.

Binning of imaging datasets
After standardization and QC, we first selected datasets where the sampling depth was between 0 and 250 m (Fig. A1).Samples were aggregated spatially in cells of 0.5°× 0.5°latitude and longitude and temporally per week.This data aggregation approach allowed us (1) to increase the overall volume analyzed per sample, which increases the number of particles observed and decreases the instrumental detection limit, and (2) to avoid the over-representation of data from fixed time series stations with high temporal sampling compared to co-located "snapshot" samples in a given grid cell.The size spectra calculations described in the next section were performed on these weekly, 0.5°× 0.5°s amples.Since, unavoidably, some weeks of a year might be shared between 2 months, we assigned that week to the month that counted the most samples.This approach prevented the creation of duplicate weekly samples per year.The final data products included in the first release (1a: bulk normalized biovolume or abundance per size bin; 1b: slopes, intercepts, and determination coefficients of the size spectra) are reported as monthly, 1°× 1°grid averages, such that each mean size spectrum, slope, and intercept had a maximum sample size of 16, the product of four 0.5°× 0.5°subcells in a 1°× 1°cell and 4 weeks per month.As mentioned above, reporting monthly, 1°× 1°grid parameter averages from the subgrid values instead of calculating directly the size spectra for these larger bins prevents a certain location or time series with a higher number of samples from skewing the size spectra estimate, especially in a 1°× 1°cell that contains both open-ocean sites sampled during research cruise(s) and coastal time series sites.

Computation of bulk particle size spectra and regression parameters from binned, instrument-specific datasets
The particle size classes used in PSSdb were previously defined in Kiko et al. (2022).These are logarithmically spaced using a base of 2 and an increment of 1/3 so that a doubling in equivalent circular diameter (ECD) is observed in every third bin (equivalent to a doubling in biovolume observed in every bin), with a range between 1-50 000 000 µm.
The diameter of each particle, with the exception of artifacts which are excluded from the size spectra computation, was estimated using area according to Eq. ( 1) and then was converted into biovolume assuming a spherical shape of that diameter following Eq.( 2).
Area-based biovolume, rather than the more widely used distance map estimates for IFCB datasets (Moberg and Sosik, 2012), and ellipsoidal fits for scanners and UVP datasets (Dubois et al., 2022) were selected to keep the size spectra calculations consistent across instruments.However, a sensitivity analysis of the slopes and intercepts as a function of the different size proxies (ellipsoidal, distance map, and areabased biovolume) is presented in Fig. A2.Despite some differences in size spectra thresholding, likely due to elongated particles being assigned to different size classes (Fig. A2a, b and c), our sensitivity analysis does not show any substantial differences between size spectra parameters from different biovolume estimates (Fig. A2d, e and f).This aligns with previous comparisons of elliptical or spherical biovolumederived size spectra, which found no or little statistical difference between these estimates (Vandromme et al., 2012;Dubois et al., 2022).The database includes size spectra calculated by two widely used methods: the normalized biovolume size spectrum (NBSS), routinely reported in zooplankton studies (e.g., Zhang et al., 2019;Grandrémy et al., 2023c), and the particle size distribution (PSD), calculated from particle counters (broadly) (Kiko et al., 2022) or derived from satellite algorithms (Kostadinov et al., 2009).For NBSS, the normalized biovolume (NB) (µm 3 L −1 µm −3 ) for each biovolume size class (i) in a sample (0.5°× 0.5°grid cell, grouped by week) was calculated as the summed biovolume (µm 3 ), normalized by the cumulative volume sampled (L) and the biovolume bin width (µm 3 ), as in Eq. ( 3): (3) For PSD, the normalized abundance (NA) (number of particles L −1 µm −1 ) for each size class (i) in a sample was calculated as the total number of particles in ECD size class i, normalized by the cumulative volume sampled (L) and the ECD bin width (µm), as in Eq. ( 4): Retrieved size spectra were generally biased at the lower and upper size limits (Fig. 1e).At the lower end, the main bias is due to the sampling collection method (e.g., mesh of the net) or the segmentation threshold (e.g., minimum area or mean gray level), which randomly excludes small particles, such that the closer the particles are to the camera resolution, the less likely they are to be imaged and segmented.At the higher end, imaging systems overestimate larger, rarer particles whose concentrations are close to the instrument detection limit, as determined by the imaging volume.As a result, size spectra would typically display an inflection point at the lower size limit and remain quasi-constant (e.g., flatter) at the upper size limit.The unbiased portion of the size spectrum was identified before computing the size spectra slopes and intercepts by log-linear regression.To do so, we first exclude data from size classes with either a size measurement or particle count uncertainty greater than 20 % assuming Gaussian and Poisson error distributions, respectively.These distributions are based on the statistical analysis developed by Schartau et al. (2010) to quantify the size spectrum uncertainties, which assumes that size measurement uncertainties follow a Gaussian distribution with a variance equal to the camera resolution and that the uncertainty of effectively observing ROIs given a similar occurrence of particles within the volume sampled follows a Poisson distribution.According to this distribution, counting four or fewer ROIs in each size bin would yield an uncertainty greater than 20 %.We thus reset the normalized biovolume or abundance of size classes with four or fewer ROIs -mainly larger size classes -as empty size classes and selected the upper size limit as the largest size class before observing three consecutive empty size classes.Our choice for the upper size limit definition was a compromise between unnecessarily excluding large organisms and including too many large bin values that would bias the size spectra calculation towards flatter slopes.Next, we selected the size bin of the maximum normalized biovolume or abundance value as the lower size limit.It is important to clarify that this thresholding is applied to the weekly, 0.5°× 0.5°bins; thus, it is possible that 1a products present low normalized or abundance values at the lower end if the smallest size class is present in only some sub-bins.After selection, size spectra followed a power-law function in the form of Eq. ( 5), with a log-transformation resulting in a linear equation of the form described in Eq. ( 6): (5) log 10 (NB i ) = log 10 (I) + b • log 10 (Biovolume i ) ; The slope (b, L −1 µm −3 for NBSS and L −1 µm −2 for PSD), intercept (I , µm 3 L −1 µm −3 for NBSS and # L −1 µm −1 for PSD), and coefficient of determination (R 2 ) of the size spectra were computed by log-linear regression following Eq.( 6).
An easy way to interpret the intercept values is that they refer to the normalized biovolume and abundance predicted for standard 1 µm 3 and 1 µm particles, respectively.
https://doi.org/10.5194/essd-16-2971-2024Earth Syst.Sci.Data, 16, 2971-2999, 2024 All products (1a: size spectra, 1b: regression parameters) generated are subjected to an additional QC to provide a flag (0 if a spatio-temporal bin passed the QC, 1 otherwise) that can help data users filter out questionable data.The current QC is based on three criteria, whereby a positive flag is assigned to (1) slope values exceeding the mean ±3 standard deviations of each instrument-specific product, (2) a spectrum that only record four or fewer non-empty size classes, and (3) a log-linear fit whose regression fit R 2 ≤ 0.8.

Spatio-temporal coverage of imaging datasets
Up to 92 472 individual samples are included in the first release of PSSdb, which benefited from long-term IFCB time series collected at a 20 min frequency (Table 1).In comparison, the UVP and scanner datasets comprise fewer profiles and nets, with a total of 3068 profiles and 2411 net samples, respectively.These datasets span most major ocean basins, although all basins are undersampled in the Southern Hemisphere.IFCB datasets that have been ingested in our database were all restricted to the middle to high latitudes of the Northern Hemisphere (Fig. 2, Table 1).Further, the majority of IFCB samples are located on the shelf of the eastern and western United States due to the presence of long-term time series sites of the California Ocean Observing System and the Northeast US Shelf Long-Term Ecological Research programs (Fig. 2a).UVP and scanner datasets are distributed more evenly across the ocean basins, mostly due to the Tara Ocean (2009)(2010)(2011)(2012) and Tara Polar Circle (2013) global expeditions, even though specific monitoring programs increased the density of samples in the tropical Atlantic, the eastern temperate Atlantic, and the Mediterranean Sea (Fig. 2b, c).These monitoring programs resulted in a large temporal coverage of the three instrument-specific datasets, with repeated observations sustained for periods of 10-25 years (Fig. 2d, e,  f; Table 1).Notably, the scanners show the largest temporal coverage, from 1996 to 2022, by including samples collected at the long-term monitoring sites located in the Bay of Villefranche-sur-Mer and the Bay of Biscay (France).The gap observed between 1998 and 2003 is caused by the exclusion of samples that had not been validated to at least 95 %.This high-frequency dataset affected the monthly variability of the scanners' sample density, shown in Fig. 2e, since the Bay of Biscay monitoring program only takes place in May (Grandrémy et al., 2023a).UVP datasets have the second longest coverage, with observations collected between 2008-2021 (Fig. 2f).In PSSdb's first release, the climatology of UVP sampling density is slightly biased towards spring months (March, April, and May); however, this may not reflect actual sampling efforts as UVP images also need to be more than 95 % validated to be ingested in PSSdb.This threshold is not applied to IFCB datasets, which comprise too many images to be manually curated, yet the datasets also show a strong bias towards the summer months (June, July, August).This bias reflects the sampling strategy of both the NESLTER broad-scale program, limited to the summer months, and the CalOOS sampling program, both of which partially operate with IFCBs serviced during the wintertime to avoid damage.IFCB has been routinely deployed at the Martha's Vineyard Coastal Observatory since 2006; however, only samples from 2013 and after were included in PSSdb as previous observations did not include taxonomic predictions, which were required to filter artifacts out of the data products (Table 1; Fig. 2d).

Size spectra obtained from individual imaging devices
The IFCB effectively detects and images plankton and detrital particles in the nano and micro size fractions.This size range is supplemented by UVP and scanner datasets, which include predominantly living microplankton and mesoplankton (Table 2).We used two metrics to evaluate pelagic size structure from plankton imaging devices: the NBSS, computed with normalized biovolume (Eq.3), and the PSD, computed with normalized abundance (Eq.4).Both metrics showed similar patterns, resulting in high correlations between the fitted parameters, namely the NBSS and PSD slopes (r = 0.99), intercepts (r = 0.99), and determination coefficient R 2 (r = 0.99) (Fig. 3).For simplicity, we further describe observed patterns of the instrument-specific size spectra parameters derived from NBSS only since both PSD and NBSS co-vary.However, all patterns and trends described in the following sections, including in the discussion, hold for the PSD releases.
Global size spectra slopes and intercepts were relatively consistent between instruments, with average values of ∼ −1 L −1 µm −3 and ∼ 10 7.6 µm 3 L −1 µm −3 (corresponding to an approximate concentration of 4×10 7 µm 3 L −1 µm −3 for particles of 1 µm 3 ), respectively (Table 2, Fig. 3).The UVP's size spectra presented an intercept slightly above that of the IFCB and scanners, given the additional particles they can detect in situ, with overall higher R 2 estimates, although relatively large R 2 values were observed across all instruments (Table 2, Fig. 3).

Size spectra regression fit, slopes, and intercepts
In addition to the average, instrument-specific differences reported above, PSSdb allows for an exploration of the spatial and temporal variations in the NBSS and PSD (not shown since they co-vary with NBSS) for individual instruments.Figure 4 shows the average NBSS slopes, intercepts, and R 2 values obtained for each grid cell in the global ocean.Despite their similar size targets, there were substantial differences in the global distribution of NBSS slopes and intercepts derived from the three imaging approaches.Indeed, while the majority of the slopes were around −1 L −1 µm −3 , the scanner slopes showed no clear variation with space.Meanwhile, the UVP slopes tended to show steeper size spectra within oligotrophic gyres and flatter size spectra in the northernmost latitudes or by the coasts (Fig. 4c).This pattern was inverted with regards to the intercepts as the abundance of 1 µm 3 particles was lower in the Arctic and increased near the shore (Fig. 4f).Likewise, the IFCB NBSS slopes were indicative of flatter size spectra, with lower intercepts, in the northernmost latitudes and along the eastern coast of the United States compared to the western coast (Fig. 4a, d).The determination coefficients seemed to follow an inverse relationship with the slope for the IFCB NBSS as flatter NBSSs were also marked by lower R 2 (Fig. 4g).The scanner data did not follow such clear trends and seemed less variable than the UVP and IFCB (Fig. 4b, e), although there seemed to be a clear decrease in NBSS linearity, or R 2 , towards the pole (Fig. 4h).
To check whether these trends were specifically linked to sampled latitude, we looked at the latitudinal variability of the NBSS parameters (Fig. 5).IFCB measurements were all https://doi.org/10.5194/essd-16-2971-2024 Earth Syst.Sci.Data, 16, 2971-2999, 2024  restricted to a small latitudinal range; however, we observed a notable decrease in the linearity of the size spectra with latitude.Higher latitudes (> 50°N and S) also showed higher variation in both slope and intercept estimates compared to lower latitudes, as well as lower coefficients of determination for scanner and UVP size spectra.Both show a reduced variability in derived slopes and intercepts within the tropics, with flatter slopes and increased intercepts notably being located at 0°N, near the equatorial current system.Since latitudinal trends can be impacted by different dynamics in specific regions but also by differences across seasons, we computed the instrument-specific monthly climatologies of NBSS parameters in ocean regions where there were at least 10 months of data (Fig. 6).This excludes the Arctic Ocean, Red Sea, South Atlantic, Southern Ocean, and Baltic Sea, which are represented in PSSdb but do not have enough data to resolve seasonal cycles.Time series analysis of the instrument-specific NBSS showed pronounced seasonal cycles but high variability by region and, in some cases, between instruments.Seasonal variations in NBSS parameters were apparent for most ocean basins, as well as in the Mediterranean Sea (Romagnan, 2013), which showed high variation in terms of the scanner mean slope and intercept throughout the year (Fig. 6b,g), with rather constant R 2 values (Fig. 6l).Stable R 2 values were generally observed across all instruments and ocean basins, with the exception of the scanner datasets located in the Indian Ocean, which presented a large dip in NBSS linearity in October (Fig. 6k).Interestingly, the North Atlantic presented opposite trends between the IFCB and the scanner, whose NBSS slopes indicated a steepening of the size spectra and whose intercepts increased during the spring (Fig. 6c,  h), and the UVP datasets, for which the NBSS flattened and intercepts decreased during spring and summer (Fig. 6c, h).In the Southern Hemisphere, UVP slopes were at a minimum by the end of austral summer (January-May), with a concurrent increase in NBSS intercepts only observed by the end of this period (Fig. 6e,j).Scanner datasets showed similar trends to those of the UVP in the South Pacific, except that the minimum slope and maximum intercept were observed by March, earlier in the year.The Indian Ocean followed the same seasonal cycle, with large differences between seasons as spectra steepened and intercepts increased during the spring-summer transition while remaining relatively stable at high slope and low intercept values from September through November (Fig. 6a, f).Lastly, the IFCB datasets collected in the North Pacific presented two peaks in NBSS intercepts, with concurrent dips in the slopes, indicative of a steeper NBSS, by spring (April) and fall (Oct) (Fig. 6d, i).
Given the opposite spatial (Figs. 4, 5) and temporal (Fig. 6) patterns typically observed between the size spectra slopes and intercepts (Sprules and Barth, 2016) across instruments and oceanic regions, we used the yearly time series correlation of these two parameters in any given grid cell within the same oceanic region as a way to detect potential decoupling, lag, or feedback between the two.The (de-)coupling between NBSS slopes, which represent the balance between relatively small and large particles and plankton, and intercepts, which approximate the carrying capacity of a given ecosystem, across the years is presented in Fig. 7.As expected, the majority of PSSdb grid cells were strongly anti-correlated, with coefficients close to −1, since steeper size spectra tend to indicate an increased proportion of smaller particles, which are generally more abundant.Noticeably, though, there are also areas of a strong positive relationship between the two parameters, especially within the IFCB datasets located in the North Atlantic.Flatter NBSS values were thus associated with increased abundances of 1 µm 3 organisms, which could be indicative of the relief of nutrient stress, allowing for multiple phytoplankton size groups to co-exist (Armstrong and McGehee, 1980); other complex interactions between primary producers dictated by resource competition; or trophic shunt between small and large plankton for zooplankton.In this region, we also observed a de-coupling between the NBSS parameters for 2-3 years, as indicated by grid cells with low absolute correlation coefficients.A de- coupling between size spectra parameters could arise from temporal lag in trophic transfer and complex trophic cascading, similarly to that mentioned above.Care should be taken when testing for significant long-term trends in the coupling of the NBSS parameters and when detecting yearly perturbations; however, we expect such analyses to become more robust as more datasets are ingested into the future releases of PSSdb.

Discussion
Workflows that provide estimates of plankton size distributions with an extensive spatial and temporal coverage will greatly accelerate efforts to characterize and understand ecological plankton dynamics at a global scale.With this goal, the first PSSdb data products were generated to determine patterns in trophic transfer efficiency across plankton sizes and ecosystem carrying capacity through consistent measurements of particle sizes analyzed by three state-of-the-art plankton imaging devices.In this section, we discuss how the spatial and temporal coverage of these instrument-specific https://doi.org/10.5194/essd-16-2971-2024 Earth Syst.Sci.Data, 16, 2971-2999, 2024  datasets effectively reduce the gap in available size structure observations before presenting potential uses of the datasets and future directions for the database.

The contribution of PSSdb and other data compilations to reducing the gap in available size structure estimates
A global compilation of particle size distribution has recently been published by Kiko et al. (2022) using the UVP5 bulk particle size distribution accessible from EcoPart.Other recent studies (e.g., Hatton et al., 2021) have also constructed global estimates of size distribution in marine organisms using indirect biomass estimates of arbitrary plankton size classes derived from satellite proxies, models, or data compilations like COPEPOD (Moriarty and O'Brien, 2013) and the MARine Ecosystem biomass DATa (MAREDAT, Buitenhuis et al., 2013) without relying on direct size estimates.More recently, Atkinson et al. (2024) compiled estimates of spectral slopes measured in 41 sites, mostly located in the Atlantic Ocean and a number of lakes, displaying important characteristics relevant to studying the impact of climate change.Such databases and compilation efforts have benefited from exponentially growing sampling efforts during the past decades, with hundreds to thousands of new UVP profiles generated each year (Kiko et al., 2022).Yet, to our knowledge, our workflow is the first attempt to compile the counts, size measurements, and taxonomic information of individual particles from multiple imaging devices to generate global particle and planktonic size spectra datasets, which are intended to be accessible to the broad scientific community.Similarly to the COPEPOD database, we have focused our effort on compiling data from different instruments, sampling regimes, and data collection methodologies in a self-consistent and crosscalibrated manner, enabling ease of comparison between all the major ocean basins and across sampling systems.So far, size spectrum studies have been restricted to accessible areas and clement weather conditions (Hatton et al., 2021), leading to fewer sampling efforts at high latitudes, specifically in the Southern Ocean and the South Pacific Ocean.Similarly, the sampling density is skewed towards the continental shelves as opposed to open-ocean stations.Like other global compilations, our datasets of marine pelagic size structure highlight multiple undersampled regions by means of plankton imaging systems.All imaging sensors were mostly deployed in the Northern Hemisphere in contrast to the fewer deployments in the South Pacific, the western Indian Ocean, and the Southern Ocean.While the latter is covered in the UVP-based compilation of particle size structure (Kiko et al., 2022), the absence of the Southern Ocean in our database results from the need for the manhttps://doi.org/10.5194/essd-16-2971-2024 Earth Syst.Sci.Data, 16, 2971-2999, 2024 ual validation of taxonomic annotations to pass our current quality controlling.However, as more autonomous vehicles will be equipped with UVP6 and its embedded classifier (Ricour, 2023), notably in the Bio-Argo program (Claustre et al., 2020), we anticipate that UVP6-derived datasets will grow substantially in the upcoming decade.To accommodate the growth of datasets derived from large-scale surveys, we could relax such criteria to generate specific data products in near-real time.Since a few UVP6 datasets are already incorporated in this initial release, we expect further ingestion of additional UVP6 data to be relatively straightforward.Unlike the UVP datasets, the IFCB and scanner datasets are more difficult to compile due to the lack of a common platform to manage incoming datasets and due to the increased efforts needed during the sampling (e.g., net deployments and recoveries), pre-processing (e.g., concentration or size fractionation of particles before imaging), and postprocessing steps.Notably, the large number of images collected at an hourly or sub-hourly frequency by the IFCB devices and their classification constitute a veritable bottleneck in producing near-real-time datasets.To generate smaller, more manageable datasets, user-specific settings that trigger the image acquisition based on a specific size or fluorescence value may help reduce the total number of images to be classified and the presence of smaller cells (4-7 µm) that are harder to identify, even manually.Alternatively, newer, more efficient, automated classifiers can also help manage upcoming observations (Kraft et al., 2022).

Global patterns and trends in plankton size
spectrum: insights from the PSSdb first release and potential future uses Parameters derived from the plankton and particle size spectrum (slope, intercept, and determination coefficients) are all important indicators of ecological processes (Sprules and Munawar, 1986;Trudnowska et al., 2021).As such, they can inform us about the general functioning and state of pelagic ecosystems and the eventual perturbations or shifts in plankton community structure.The compilations from Hatton et al. (2021), Atkinson et al. (2024), and Kiko et al. (2022) seem to support the presence of an equal stock of living biomass across increasing Earth Syst.Sci.Data, 16, 2971Data, 16, -2999Data, 16, , 2024 https://doi.org/10.5194/essd-16-2971-2024size classes (slope of the biomass spectrum is equal to ∼ 0), driven by the log-linear decline in particle abundance with increasing size and/or biomass (slopes of the normalized biovolume and abundance spectrum are equal to ∼ −1 and −4, respectively), as postulated by Sheldon et al. (1972).Small differences across instruments can be attributed to certain plankton groups being measured with more accuracy by one instrument.For instance, the UVP additionally samples fragile organisms and non-living particles, which are disrupted by the net collection (e.g., Biard et al., 2016;Soviadan et al., 2023).As a result, the UVP-specific spectra were consistently higher than the scanner spectra.Like these studies, the majority of PSSdb NBSS slopes are relatively close to −1 (equivalent of −4 for PSD), indicating a stable equilibrium between small and large particles and a similar trophic transfer efficiency (Fig. 4, Table 2).Nevertheless, substantial divergences from the canonical slope were observed for all the instruments used in this release, notably in the northernmost latitudes and close to the coasts.Size spectra have been shown to flatten with increasing nutrient supply (e.g., upwelling, coastal, and polar systems), as observed by other data compilations (see Atkinson et al., 2021, for freshwater ecosystems), modeled by size-structured plankton systems (Barton et al., 2013;Hatton et al., 2021;Serra-Pompei et al., 2022), or approximated from satellite data (Kostadinov et al., 2009;Hirata et al., 2011;Roy et al., 2013).Interestingly, we did not observe flatter size spectra in stable upwelling ecosystems located along the Californian, Peruvian, Namibian, and northwestern African coasts (Fig. 4).The shallowing of size spectra slopes with increasing nutrient supply is not a universal pattern since flatter size spectra have also been reported in stable, oligotrophic ecosystems compared to in more productive ecosystems (Marcolin et al., 2013;Atkinson et al., 2021).The former are typically considered to be at steady state, as reflected in the stable daily oscillations of total particulate organic carbon, yet significant variability in time and space raises substantial concerns regarding our ability to extrapolate plankton size spectra and their slopes from crude or fragmented observations (Rodriguez and Mullin, 1986).A simple explanation for this lack of consistency is that all spatial patterns are effectively impacted by sampling time.Notably, our extended temporal coverage in the Indian, Pacific, and Atlantic Oceans, as well as in the Mediterranean Sea, has highlighted that there is significant variability in size spectra slopes and intercepts from month to month (Fig. 6).Most temperate regions presented a trend consistent with the formation of a spring bloom, indicated by a flattening of the size spectra, and its progression towards a more stratified environment, marked by steeper size spectra due to the predominance of smaller plankton; this is in agreement with other regional and global studies (Clements et al., 2022;Haëntjens et al., 2022).However, coastal regions sampled by the IFCB showed an opposite progression, with steeper size spectra during the spring and fall seasons, which is consis-tent with a shift of the phytoplankton community towards smaller dinoflagellates compared to larger diatom chains, as described in Fischer et al. (2020).Seasonal plankton dynamics in coastal systems are much harder to predict given the large number of variables that determine plankton blooms.Due to this, high-frequency monitoring with imaging systems like the IFCB can quickly detect changes in size spectra and related slope and intercept anomalies indicating shifts in the plankton community, such as the occurrence of harmful algal blooms, which represent an important threat to human health around the globe (Glibert, 2020).Temporal variations in the coefficient of determination might also be relevant in detecting community shifts.For instance, the appearance of small dinoflagellates (Fischer et al., 2020) was also linked to a lower coefficient of determination.This parameter decreases with the non-linearity of particle size spectra, and, as such, can be an important indicator of ecosystem perturbations and non-steady-state conditions.
Most studies assessing marine plankton size structure have focused largely on analyzing the slope and, to a lesser extent, the intercept of pelagic size spectra, with much less interest given to the coefficient of determination (R 2 ).However, differences in size spectrum linearity can arise from abiotic or biotic perturbations, leading to local peak(s) of intermediatesized organisms (Moscoso et al., 2022)."Bumps" in the plankton size spectrum have been reported or modeled under harmful algal blooms (Harred and Campbell, 2014) and transient trophic interactions (Schartau et al., 2010;Banas, 2011;Rossberg et al., 2019) and as the result of mesoscale circulation (Noyon et al., 2022) or the omission of specific groups in the observed size range (e.g., heterotrophic nanoflagellates not detected by most imaging flow cytometers targeting fluorescing organisms; see Chisholm, 1992).Non-steady-state conditions are increasingly observed, particularly in nutrientrich systems (Cavender-Bares et al., 2001), and represent a considerable topic of interest for environmental policies.For this reason, we carefully assessed and reported size spectra non-linearity in our database, along with the other, widely analyzed parameters.Our first-release products show that regions with lower R 2 were mostly located towards the North Pole and were linked, in particular, to flatter size spectra in these regions (Figs. 4,5).Like a lower R 2 , a decoupling between size spectra parameters is also indicative of important perturbations or, inversely, of the resilience of a given ecosystem as a result of complex trophic interactions (e.g., temporal lag, resource competition, grazing cascades).We suggest following the yearly correlation between slopes and intercepts, as presented in Fig. 7, to detect potential deviation from the expected seasonal trends, showing anti-correlation between size spectrum slopes and intercepts (Fig. 6).More data will greatly improve the accuracy of such an analysis and will potentially help inform policy stakeholders by revealing significant, climate-driven trends in size spectra decoupling.
A more detailed interpretation of our observed patterns and trends is out of the scope of this paper.However, we https://doi.org/10.5194/essd-16-2971-2024 Earth Syst.Sci.Data, 16, 2971-2999, 2024 hope PSSdb will be further exploited by individual research groups or stakeholders to contextualize their studies or policies.In addition, current modeled (Serra-Pompei et al., 2022) and satellite-derived (Hirata et al., 2011;Roy et al., 2013;Kostadinov et al., 2023) plankton size distributions have yet to be compared to extensive size structure observations.PSSdb could represent a potential avenue to assess the performance of models and satellite proxies, especially as new and future model outputs (Negrete-García et al., 2022) and satellite datasets (e.g., NASA "Plankton, Aerosol, Cloud, ocean Ecosystem" mission) will provide biomass measurements for an ever increasing number of plankton functional groups.Such validation is key to constraining some of their uncertainties and to gaining a mechanistic understanding of how physiological and ecological processes structure current and future marine ecosystems (Menden-Deuer et al., 2021).
In addition, PSSdb users could investigate important factors driving the observed spatial patterns and temporal trends of plankton size spectra.PSSdb products could thus improve our understanding of the temporal and spatial variability of particle size spectra in specific regions, as well as provide a broader context to case studies, as showcased in Figs. 4 to 7, and support global data-driven interpolation, similarly to Hatton et al. (2021) or Clements et al. (2022).
4.3 PSSdb successes, challenges, and further considerations to maintain and expand the database In our effort to access and compile imaging datasets from multiple devices, we found the open-source platforms (and associated APIs) developed for the IFCB, UVP, and scanner users to manage their incoming datasets to be instrumental.For example, the online dashboards are a useful tool for the IFCB data generators to assess image quality during and following deployment by quickly checking the raw images and monitoring the number of ROIs per sample and can be used to alert potential stakeholders when a species of interest is detected.However, the possibility of linking a set of metadata and a tag (e.g., in the case of suspicion of any bias) for each sample was only added recently to second-generation dashboards.As a result, a significant number of datasets accessible from first-generation IFCB dashboards were not ingested in this initial release.It is difficult to assess how many IFCB samples were not ingested due to such a lack of metadata as an exhaustive list of IFCB dashboards, which would enable better data traceability, is still missing.Similarly, a portion of scanner and other net-collected imaging datasets is not easily traceable or usable for PSSdb as some data collectors still use early tools (Zooprocess and PlanktonIdentifier, the latter of which is no longer supported) to manage their datasets.Even though our pipeline is able to ingest datasets directly sent to us, these datasets eventually become harder to trace and compile compared to the UVP datasets, which are, to our knowledge, all uploaded on EcoTaxa and EcoPart.Both web platforms offer secure, easy, and reproducible access to nu-merous datasets, and EcoTaxa provides access to image annotations, a key feature to follow the status of the UVP and scanner datasets that should be validated to at least 95 % to be ingested in PSSdb.These open-source management platforms have been available to the scientific community for a decade but still suffer from a general lack of funding to support their development and maintenance.This is in contrast to the increasing funding to develop new imaging prototypes and commercial instruments (Lombard et al., 2019;Martin-Cabrera et al., 2022).Examples of imaging instruments that were not ingested in the PSSdb initial release include the Planktoscope (Pollina et al., 2022), the CytoSense (Dubelaar and Gerritzen, 2000), the FlowCam (Sieracki et al., 1998), the ZooGlider (Ohman et al., 2019), the ISIIS (Cowen and Guigand, 2008), the CPICS (Gallager, 2016), the VPR (Davis et al., 2005), and the LOKI (Schulz et al., 2010).From their associated publications, it is unclear how these datasets are archived in long-term repositories, although a few datasets collected with Planktoscope, ZooCAM, CytoSense, and FlowCam instruments have already been uploaded on EcoTaxa.Ingesting such datasets in the PSSdb database would be extremely valuable to assess extended plankton size spectra in the millimeter-centimeter size range and to bridge some of the gaps introduced by specific instrument operational ranges while providing overlapping size bins (Haëntjens et al., 2022).The latter are key for pooling datasets obtained from multiple imaging devices deployed in spatial and temporal proximity.In some cases, merging imaging datasets integrated over specific depth layers (e.g., net-collected datasets) with profiling or towed datasets is facilitated by simply integrating observations using the lowest sampling resolution (Soviadan et al., 2023); however, merging discrete (e.g., surface only) and integrated observations is more problematic without a good understanding of how the discrete measurements might change with depth.Despite such challenges, the relatively small differences between the overall intercepts and slopes of the PSSdb first-release products are greatly encouraging (Table 2).Prior to PSSdb, efforts to set guidelines and best practices for obtaining plankton observations with imaging instruments (see Lombard et al., 2019;Neeley et al., 2021) had yet to establish protocols for harmonizing these datasets across platforms given the large variability between sampling strategies, instrument detection limits, size estimates, organisms targeted, and classification schemes.We hope to build upon this first data release and recent work from Soviadan et al. (2023) to provide merged data products that will effectively span the 5 orders of magnitude that can be captured by commercially available plankton imagers (Lombard et al., 2019).
Further, we planned to release taxonomically resolved PSSdb products, which will allow for the analysis of temporal and spatial shifts in plankton community composition since individual size observations collected from imaging devices are mostly paired with taxonomic annotations.Thus, it will be possible to assess taxon-specific size spectra using the same pipeline that we developed for the raw particle products with only minor modifications.These products, now available at https://doi.org/10.5281/zenodo.11059581and described in Dugenne et al. (2024a), incorporate different levels of taxonomic resolution, allowing a global assessment of group-specific size structures and derived biomass based on published relationships linking biovolume to carbon content (Menden-Deuer and Lessard, 2000;Lehette and Hernández-León, 2009;McConville et al., 2017).The lack of standardization across classification schemes and taxonomic experts was a challenge as they both lead to disparate rankings of taxonomic annotations across imaging datasets, which are harder to homogenize.In the future, a fine taxonomic resolution could be achieved by following the recent guidelines and standards for image annotation published by Neeley et al. (2021) (Kraft et al., 2022;Eerola et al., 2023), these could greatly accelerate the turnover for data processing and availability to reach operational plankton monitoring.More practically, for the current heterogeneity of image classification schemes, annotations have been grouped into broad categories, like plankton functional groups used in current ocean biogeochemical (OBGC) models.

Data availability
The first-release datasets for the Pelagic Size Structure database project are available at https://doi.org/10.5281/zenodo.11050013(Dugenne et al., 2024b).Future updates to these data products can be found at https://doi.org/10.5281/zenodo.7998790.Further information about PSSdb can be found on the project's web page (https://pssdb.net/,last access: October 2023).

Summary and conclusion
In this paper, we present a first compilation of pelagic size spectra obtained from three imaging systems: the IFCB, the UVP, and scanners.They represent state-of-the-art technologies to count, size, and identify living and non-living marine particles in the 7-10 000 µm size range, but their datasets have not been accessed, compiled, and shared in a consistent and interoperable manner so far.To facilitate a global compilation of size observations obtained with imaging instruments and to promote near-real-time assessments of plankton size distributions, we thus developed an open-source pipeline, available at https://github.com/jessluo/PSSdb(last access: April 2024).Using this pipeline, we gathered hundreds of specific datasets spanning most of the global ocean, with the exception of the Southern Ocean and the South Pacific.
Our first-release products, available at https://doi.org/10.5281/zenodo.11050013(Dugenne et al., 2024b), show a consistent decline in raw particle numbers with increasing sizes across the 7-10 000 µm size range, with a slope close to −1 L −1 µm −3 (for NBSS and −4 L −1 µm −2 for PSD), in agreement with other size structure compilations, and an average intercept of 4 × 10 7 µm 3 L −1 µm −3 .Substantial divergences were observed in space and time for both parameters, which could point toward changes in the trophic efficiency and overall carrying capacity of marine ecosystems, especially in regions of increased nutrient supply.Those changes were sometimes linked to a change in size spectrum linearity and in the coupling between size spectra parameters, which can be driven by specific processes and perturbations, such as blooms.Targeted analysis of the spatio-temporal variations and perturbations of the plankton size spectra will improve our understanding of important processes and feedbacks governing marine ecosystems and will help constrain the uncertainty around future projections of marine diversity, services, and biogeochemistry from data-driven and mechanistic models.
We plan on adding datasets to PSSdb and, to this end, encourage all research groups that generate plankton imaging data to support this development by contributing datasets from the currently supported instruments.Our pipeline is easily transferable in that other imaging instruments and datasets, either new or unpublished, can be ingested into PSSdb; we, hence, also invite users of other imaging devices to contact us (info available at https://pssdb.net/, last access: April 2024) to discuss options.
According to the camera resolution, IFCB instruments may detect particles in the 4-420 µm size range (Olson and Sosik, 2007).In continuous mode, individual samples with a 5 mL maximum volume are automatically drawn by a syringe approximately every 20 min.Instruments can be deployed on underwater moorings (down to 40 m depth), on land-based piers and wharves, or on research vessels, where they can be connected to the flow-through system of the vessel to automatically collect new samples throughout the cruise.Alternatively, they may also be used to analyze discrete samples obtained from Niskin bottles from the CTD-rosette, though, in general, most IFCB sampling efforts included in PSSdb are limited to a single depth, located within the mixed layer (Fig. A1).In this instrument, a sheath fluid is recycled continuously through a set of two cartridge filters to align single, colonial, or chain-forming particles and to drive them through the flow cell, where they are intercepted by a red laser beam (630 nm).The resulting scattering and fluorescence emissions are captured and transformed by photomultipliers (PMTs), whose function is to amplify (depending on the PMT-relative gain set) and convert the emitted photons into an electronic signal.Image acquisition may be triggered by either scattering or fluorescence, given the individual gain and threshold set by the instrument user prior to sampling, if the particle size exceeds a minimum area threshold (> 160 pixels or ∼ 4 µm in equivalent circular diameter).Raw IFCB data include the individual images detected in real time (.roi files), the summary statistics of the electronic PMT signals (.adc files), and the configuration settings (.hdr files).The morphometric measurements, including image area, Feret diameter, and biovolume estimates based on distance map matrices (Moberg and Sosik, 2012), of individual or multiple (in the case of chain-forming or colonial organisms) ROIs are extracted from the masked images (also referred to as blobs) using custom feature extraction MATLAB code (code and documentation available at https://github.com/hsosik/IFCB-analysis/,last access: October 2023) and can be further used to predict taxonomic annotations (Sosik and Olson, 2007).

A2 Underwater Vision Profiler (UVP)
The fifth generation of the UVP (hereafter UVP5) consists of a system of two red LED lights (625 nm) that illuminate a 22 × 18 cm frame, which is imaged by a camera with a resolution of ∼ 8 pixels per mm facing the illuminated plane.This system was routinely mounted on CTD-rosettes (Picheral et al., 2010) before its miniaturization led to the next generation of UVPs (UVP6, Picheral et al., 2022).Both UVP5 and UVP6 are rated to 6000 m depth.UVP6 instruments only have one red LED light and image a smaller frame (15 × 18 cm) with a higher resolution (∼ 12 pixels per mm).As a result of its miniaturization, the UVP6 can be mounted on autonomous platforms like gliders, floats, or moorings to record images at a preset time interval, although acquisitions have mostly been done in profiling mode so far (Fig. A1).Upon descent, pressure sensor readings and images are recorded at a frequency of 6 to 20 Hz, depending on the configuration setting and the in situ concentration of particles, whereby low concentrations require less buffering time before each new acquisition and hence allow a higher acquisition frequency.The configuration setting allows users to record the raw image frames, the vignettes of particles larger than a fixed size threshold generated after segmentation (i.e., the process of extracting individual ROIs from the initial image), or a combination of both (full process mode).The size threshold is typically set to 44 ± 22 pixels (∼ 910 ± 80 µm in equivalent circular diameter or ECD) and 70 ± 15 pixels (∼ 690 ± 120 µm in ECD) for the UVP5 and UVP6, respectively.In mixed acquisition mode (the recommended setting to limit processing time during and following deployment), image frames are segmented in real time to extract the individual area and the mean gray-level estimates for each particle larger than 1 pixel (∼ 150±30 and ∼ 80±10 µm in ECD for UVP5 and UVP6, respectively), and vignettes of larger particles are saved as BMP thumbnails.Post-recovery, the metadata are manually filled, and the vignettes' BMP files are converted to binary masks whose morphometric features, including area and ellipsoidal axis, are extracted by a custom ImageJ toolbox named Zooprocess (Gorsky et al., 2010) for the UVP5 or via the UVP app for the UVP6 (Picheral et al., 2022).Size estimates for all particles can be further stored in EcoPart (https://ecopart.obs-vlfr.fr,last access: October 2023), while vignettes can be uploaded to the collaborative platform EcoTaxa (https://ecotaxa.obs-vlfr.fr,last access: October 2023) for automatic class predictions and manual validation.Prior to instrument shipping, both the effective volume (0.98 ± 0.18 L for UVP5 and 0.6 ± 0.02 L for UVP6) of the image frame and the two size conversion factors, Aa (the intercept) and Exp (the slope), linking metric-based area estimates to pixel-based area estimates by a power-law function, are calibrated against the unique reference unit (Picheral et al., 2010(Picheral et al., , 2022)).However, the size conversion factors are used to account for light scattering around small particles only and are not required for size estimates of large particles, and the use of these factors can result in larger error propagation compared to a fixed pixel size conversion factor (data not shown).Therefore, all pixel-based area estimates were converted to metric area using a fixed pixel size factor (corresponding to the camera resolution reported above) for the UVP data included in the current PSSdb version.For further details regarding UVP data processing, see Kiko et al. (2022).

A3 Net sampling and benchtop scanners
Traditionally, zooplankton samples are collected via a wide range of net systems (reviewed by Wiebe and Benfield, 2003), preserved with a fixative reagent (mostly a buffered Earth Syst.Sci.Data, 16, 2971-2999, 2024 https://doi.org/10.5194/essd-16-2971-2024formaldehyde seawater solution), and processed in the laboratory.Benchtop flatbed scanning systems allow for a relatively high sample throughput compared to the traditional microscopic approach.PSSdb currently includes data collected from vertical or oblique tows with nets of various mesh sizes and aperture diameters (Fig. A1), mostly equipped with flow meters, and analyzed with the ZooScan system (Gorsky et al., 2010) or alternative generic scanners (Gislason and Silva, 2009;Lehette and Hernández-León, 2009;Kiko et al., 2020).tion of ∼ 96 pixels per mm, with the frame being illuminated from above and scanned from below.These scanners are typically used to scan and digitize preserved zooplankton samples as the organisms must be immobile during scanning.Prior to scanning, a background image of the frame filled with distilled water is scanned to facilitate ROI segmentation.The samples are typically rinsed to remove the fixative and the seawater, size-fractionated using sieves of various mesh sizes, and subsampled into aliquots to reduce the number of organisms per scan and to avoid overlapping objects in the image (Jalabert et al., 2022).Similarly to UVP5 profiles, Zooprocess is used to save the scanner frame and to manually fill the metadata of each sample, including the GPS coordinates, sampling depth range, sampling time, volume of filtered seawater, and dilution factor of the scanned subsamples.Each scan will generate three files, containing the log, the metadata, and the overall scan saved as TIFF files.A first segmentation is performed to separate the ROIs from the background and to extract their morphometric features (see supplemental material of Gorsky et al., 2010), depending on a lower size threshold (370±360 µm in ECD on average) and the mean gray-level intensity (default is 243).If necessary, a second segmentation may be done after manually separating overlapping ROIs (Vandromme et al., 2012).Once the separation of ROIs is optimal, their corresponding vignettes, along with the automatically generated EcoTaxa table, may be uploaded to EcoTaxa to predict and validate the taxonomic annotations.As a starting point and for reproducibility, we only ingested datasets uploaded on EcoTaxa as they can be repeatedly accessed and shared amongst collaborators, notably to assess the annotation status, which is important for ingestion into PSSdb (see Sect. 2.2.4).Once datasets are ex-ported from EcoTaxa, we consider the reported size-based fractionation of the net tow sample: if the sample was sieved into separate size fractions after the collection, (i.e., a sample collected with a 333 µm mesh net that was subsequently sieved through 150 µm, 500 µm and 1 mm meshes), the size spectra are first calculated for each size fraction based on the dilution factor of the aliquots taken for each sieved sample ("acq_sub_part" column in EcoTaxa) and the volume of filtered seawater of the net (as determined by the flowmeter; "sample_tot_vol" column in EcoTaxa) to account for the volume effectively scanned within a size fraction.The total size spectrum is then obtained by summing the fraction-specific spectra since size-fractionated scans originate from the same volume.

Figure 1 .
Figure 1.Schematic of the PSSdb processing pipeline.The main steps of the pipeline include (a) the selection and automatic download of imaging datasets that include predicted or validated taxonomic annotations (to ensure that bulk datasets do not include artifacts and to generate taxa-specific products), (b) the standardization of their native formats and units, (c) quality controlling involving an exchange between PSSdb developers and the concerned principal investigators, (d) the binning of samples in spatio-temporal proximity to match the current resolution of other databases and biogeochemical models, and (e) the computation of size spectra and the generation of the data products released at https://dx.doi.org/10.5281/zenodo.11050013(Dugenne et al., 2024b).

Table 1 .Figure 2 .
Figure 2. Spatio-temporal coverage of PSSdb first-release datasets obtained from the IFCB (a, d), scanners (b, e), and the UVP (c, f).Maps and Hovmöller diagrams are color-coded according to the density of temporal bins (a-c), corresponding to the year and month, and spatial bins (d-f), corresponding to 1°× 1°grid cells, respectively.The sizes of the grid cells are expanded (∼ ×2) in panels (a), (b), and (c) to help visualize the color scale and represent a coarser spatial coverage of the dataset.

Figure 3 .
Figure 3. First release of PSSdb: pelagic size spectra (product 1a) approximated from normalized biovolume (a) and normalized abundance (b) and comparison between fitted (products 1b) slopes (c) and intercepts (d) for the three plankton imaging systems included in the first release.Solid lines in panels (a) and (b) represent the median spectrum, restricted to size classes that were observed in at least 50 % of the samples to avoid misalignment due to different sampling efforts (e.g., different mesh sizes for scanners, different photo-multiplier (PMT) settings for IFCB).

Figure 4 .
Figure 4. Average NBSS parameters in 1°× 1°latitude-longitude cells (product 1b) from imaging data obtained by the IFCB (a, d, g), scanners (b, e, h), and the UVP (c, f, i).Slopes correspond to panels (a), (b), and (c); intercepts correspond to panels (d), (e), and (f); and determination coefficients correspond to panels (g), (h), and (i).The sizes of the grid cells are expanded (∼ ×2) in all panels to help visualize the color scale and represent a coarser spatial coverage of the dataset.

Figure 5 .Figure 6 .
Figure 5. Latitudinal variability of NBSS slopes, intercepts, and determination coefficients for the IFCB (a), scanners (b), and the UVP (c).Dots represent the mean parameter value per 1°latitudinal bins, and the horizontal bars represent the standard deviation.

Figure 7 .
Figure 7. Pearson correlation coefficients between NBSS slopes and intercepts across ocean basins (columns) and years for the IFCB (a-e), scanners (f-j), and the UVP (k-o).Each data point represents an individual grid cell within the five major ocean basins, with enough data to show seasonal fluctuations: Indian Ocean (a, f, k), Mediterranean Sea (b, g, l), North Atlantic (c, h, m), North Pacific (d, i, n), and South Pacific (e, j, o).

Figure A1 .
Figure A1.Distribution of the sampling depth ranges of accessible (all bars) and ingested (black bars) IFCB (a), scanners (b), and UVP (c) datasets.Note that depth limits were rounded to a 10, 50, and 100 m resolution to reduce the number of ranges reported.

Figure A2 .
Figure A2.Normalized biovolume size spectra (a-c) and associated linear regression parameters (d-f) calculated from three methods: areabased biovolume (this study, black); distance-map-based biovolume (Moberg and Sosik, 2012, light purple), which is part of the processing pipeline of the IFCB images; and ellipsoidal biovolume (aquamarine), which is more commonly used for processing ZooScan and UVP datasets.Dots represent individual samples (defined by temporal and spatial bins); solid lines in panels (a), (b), and (c) represent the median spectrum for the size classes that were present in at least 50 % of the samples (to avoid misalignments due to different sampling efforts).Violin plots in panels (d), (e), and (f) represent data density on the y axis, and horizontal lines represent the median.The data included in this analysis are restricted to particles that have length estimates for both the major and minor axes, resulting in only large particles being uploaded to EcoTaxa for the UVP datasets.

Table 2 .
Size spectra description for each instrument included in the first release of PSSdb.Parameters are reported as mean (± standard deviation), with the exception of the untranslated intercept, which is reported as a geometric mean, with the range of observed values given as the first and third quartiles in the parentheses.