Database of diazotrophs in global ocean abundances, biomass and nitrogen fixation rates

. Marine N 2 ﬁxing microorganisms, termed di-azotrophs, are a key functional group in marine pelagic ecosystems. The biological ﬁxation of dinitrogen (N 2 ) to bioavailable nitrogen provides an important new source of nitrogen for pelagic marine ecosystems and inﬂuences primary productivity and organic matter export to the deep ocean. As one of a series of e ﬀ orts to collect biomass and rates speciﬁc to di ﬀ erent phytoplankton functional groups, we have constructed a database on diazotrophic organisms in the global pelagic upper ocean by compiling about 12 000 direct ﬁeld measurements of cyanobacterial diazotroph abundances (based on microscopic cell counts or qPCR assays targeting the nifH genes) and N 2 ﬁxation rates. Biomass conversion factors are estimated based on cell sizes to convert abundance data to diazotrophic biomass. The database is limited spatially, lacking large regions of the ocean especially in the Indian Ocean. The data are approximately log-normal distributed, and large variances exist in most sub-databases with non-zero values di ﬀ ering 5 to 8 orders of magnitude


Introduction
N 2 fixation is the biological conversion of dinitrogen (N 2 ) gas into two molecules of ammonia by diazotrophic organisms.Over geological time scales, N 2 fixation is important for regulating fixed N concentrations in the ocean and thereby sustaining ocean fertility (Tyrrell, 1999).The rate of pelagic N 2 fixation in the contemporary ocean has been estimated to be 100-200 Tg nitrogen (N) yr −1 , which constitutes about half of the total external source of bioavailable N to the ocean (Gruber andSarmiento, 1997, 2002;Karl et al., 2002;Galloway et al., 2004;Deutsch et al., 2007;Gruber, 2008).It is generally accepted that cyanobacteria are the major N 2 -fixing microorganisms in the ocean (Karl et al., 2002;Zehr, 2011).However, non-cyanobacterial prokaryotic plankton may also conduct N 2 fixation in the ocean as revealed by the presence and transcription of nifH genes (encoding the iron protein component of the nitrogenase enzyme) (Zehr et al., 1998;Riemann et al., 2010;Farnelid et al., 2011;Fernández et al., 2011), albeit their relative contribution to global N 2 fixation remains to be determined.
The most recently characterized diazotrophic phylotypes are the unicellular cyanobacteria (UCYN).Zehr et al. (2001) first successfully amplified a fragment of the nifH gene and nifH transcripts from the < 10 µm size fraction of the whole water samples that demonstrated the presence of unicellular diazotrophs.Subsequently, Montoya et al. (2004) measured high rates of N 2 fixation by UCYN in the Pacific Ocean.Three distinct phylogenetic groups have been identified from UCYN, including Crocosphaera watsonii (sometimes referred to as Group B, UCYN-B), uncultivated Group A (UCYN-A) (Zehr et al., 2001), and Group C (UCYN-C) (Langlois et al., 2005;Foster et al., 2007) which have only recently been cultured (Taniuchi et al., 2012).
Although marine cyanobacterial diazotrophs play a critical role in the oceanic N cycle, primary productivity and organic matter export (e.g., Karl et al., 1997Karl et al., , 2002;;Capone, 2000;Gruber, 2008), there is still no database synthesizing the many field measurements of diazotrophic abundances and N 2 fixation rates in the global ocean.Such a database is fundamental to understand the spatial distribution and temporal variability of diazotrophic biomass and activity.Moreover, a more comprehensive set of direct measurements may be useful in evaluating basin-and global-scale geochemical estimates of diazotrophic N inputs, which have generally found the global N budget to be in deficit, with total N sources significantly lower than the N sinks (Gruber, 2008).The database can also be expected to provide useful information with which to investigate the controlling mechanisms for marine diazotrophic distribution and activities.
In this paper we present a database compiling data on the abundance, biomass and N 2 fixation rate of diazotrophs in the global ocean.This effort is part of the Marine Ecosystem Model Intercomparison Project (MAREMIP), in which field measurement-based databases are constructed for biomass and related process rates for phytoplankton functional types (PFTs) (Buitenhuis et al., 2012).The databases, named "MA-Rine Ecosystem DATa" (MAREDAT), include nine PFTs: diatoms, Phaeocystis, coccolithophores, diazotrophs, picophytoplankton, bacterioplankton, mesozooplankton, macrozooplankton and pteropods, and also a database for dissolved organic carbon.In addition to the database for diazotrophs presented in this paper, other databases are presented in other papers of this special volume.The MAREDAT databases will be used for the future model intercomparison studies in the MAREMIP projects, and are also available for public use.
In Sect.2, the database is described, including information about database construction, data measurement methods, data quality control and conversion from diazotroph abundances to carbon (C) biomass.In Sect.3, we present and discuss synthesized results from the database, including (1) the results of quality control, (2) spatial and temporal distribution of cell counts of diazotrophs, N 2 fixation rates and nifH-based abundances, (3) general characteristics of the datasets including mean N 2 fixation rates and diazotrophic C biomass as a function of geographical location and depth, (4) estimates for global N 2 fixation rates and diazotrophic C biomass and (5) limitations of the database and the global estimates.In Sect.4, we draw conclusions and provide recommendations for appropriate use of this database.

Data and methods
The database is available at PANGAEA (doi:10.1594/PANGAEA.774851).

Database summary
Data comprised of three types of direct measurements, cell counts of diazotrophs, N 2 fixation rates, and nifHbased abundances via quantitative polymerase chain reaction (qPCR) assays, were compiled from the scientific literature and personal communication with researchers working in the field.The database contains a total of 12 151 data points including three sub-databases: (1) cell counts of diazotrophs with 5326 data points (Table 1a), (2) N 2 fixation rates with 3624 data points (Table 1b) and (3) diazotrophic abundances estimated from nifH copy abundances (referred to as nifH-based abundances hereafter) with 3201 data points (Table 1c).Note that the counts for Trichodesmium were reported in number of colonies, trichomes or cells, yet we maintain usage of the term "cell counts" to distinguish countbased methods from nifH-based abundances.The diazotroph abundances based on cell counts and nifH genes were converted into C biomass using conversion factors (discussed below).In each sub-database, the data were grouped into three taxonomic types: Trichodesmium, UCYN and heterocystous cyanobacteria (Table 1).A separate grouping is maintained for those N 2 fixation rates measured from whole seawater samples (Table 1b).
Each volumetric data point is identified by its sampling date, geographic location (latitude and longitude) and depth.Depth-integrated values were calculated for those vertical profiles with measurements available at three or more depths.By doing this, the profiles were linearly interpolated from surface to bottom sampling depth, i.e., N 2 fixation rates and diazotrophic abundances are considered to be negligible below the bottom depth.These calculated depth-integrated data points were not counted in the total points in      they are derived values.An additional 709 data points of diazotroph cell counts and N 2 fixation rates that were originally reported as depth-integrated values were also included.These 709 depth-integrated data points are counted in the total in Table 1 as they are independent from other data points.
Each depth-integrated data point is identified by its sampling date, geographic location and integral depth.
The database also provides total diazotrophic C biomass (from cell counts and nifH-based abundances) for each sample by summing values from the three defined diazotrophic types: Trichodesmium, unicellular and heterocystous cyanobacteria.Total N 2 fixation rates are also provided: when whole seawater N 2 fixation rates are available, they are used as total N 2 fixation rates.Otherwise, the total N 2 fixation rates are calculated by summarizing values from the three defined diazotroph types.In many samples, measurements were not available for all the three defined diazotrophic types.Also, these three defined types do not represent the full diazotrophic community.Thus, the derived totals (via summation) can be considered as the lower limits of diazotrophic biomass and activity.
Accessory data (including temperature, salinity and concentrations of nitrate, phosphate, iron and chlorophyll) are also provided if available.

Cell counts
Cell counts for diazotrophs (Table 1a) were largely performed by standard light microscopy whilst a number of samples were counted using epifluorescence microscopy with blue or green excitation (Orcutt et al., 2001;Chen et al., 2003Chen et al., , 2008Chen et al., , 2011;;Carpenter et al., 2004;Sohm et al., 2011a;Villareal et al., unpublished North Pacific data).The cell counts are limited to Trichodesmium and heterocystous cyanobacteria, but no cell counts are available for UCYN (Table 1a), as UCYN-B can only be directly identified by epifluorescence microscopy and UCYN-A have not been microscopically identified.UCYN-C has only recently been microscopically identified (Taniuchi et al., 2012).
Most counts for Trichodesmium were reported in number of colonies or trichomes per volume, and in a few datasets in cell densities.In order to use a unified biomass conversion factor for Trichodesmium (discussed later), all the Trichodesmium counts were converted to number of trichomes assuming commonly used conversion factors of 200 trichomes colony −1 and 100 cells trichome −1 (Carpenter, 1983;Letelier and Karl, 1996;LaRoche and Breitbarth, 2005;Benavides et al., 2011).An exception is for the dataset of Carpenter et al. (2004) where conversion factors were measured in selected vertical profiles in three cruises in the tropical North Atlantic with averages of 137 (71-267), 224 (89-411) and 148 (56-384) trichomes per colony, respectively.In this case, the measured conversion factors are used for this specific dataset.Notably, the assumed conversion factor of 200 trichomes colony −1 is consistent with values reported in Carpenter et al. (2004).
The cell counts for heterocystous bacteria were grouped into two major genera: Richelia and Calothrix.Counts for the Richelia and Calothrix are provided as heterocyst abundances.There are several datasets (Brzezinski et al., 1998;Gómez et al., 2005;Poulton et al., 2009;Villareal et al., 2011;Villareal et al., unpublished Gulf of Mexico data) in which abundances of host diatom Hemiaulus and Rhizosolenia were reported while Richelia heterocysts were not counted.As a Hemiaulus diatom typically contains 2 Richelia filaments whereas a Rhizosolenia diatom can contain 1-32 Richelia filaments (Sundström, 1984;Villareal, 1989Villareal, , 1990;;Foster and O'Mullan, 2008), the abundances of heterocysts for these datasets were derived from cell counts of their host diatoms by assuming that each Hemiaulus or Rhizosolenia cell contains 2 or 5 Richelia filaments, respectively.In one dataset (Gómez et al., 2005), abundances of Chaetoceros were counted but the associated heterocystous cyanobacteria were found to be Richelia.An average ratio of 0.5 Richelia heterocyst per Chaetoceros cell was reported by the dataset and was used to calculate the Richelia abundance.The trichomes or filaments of Richelia or Calothrix are typically composed of 3-4 vegetative cells and 1 terminal heterocyst (Foster and Zehr, 2006).Thus, abundances of cells within these genera are estimated by multiplying the heterocyst abundances by 5, i.e., assuming 5 cells per filament.Note that it may underestimate heterocystous cell abundances as the Richelia symbionts of Rhizosolenia in some cases can contain more vegetative cells (near 10) (Villareal, 1989(Villareal, , 1992;;Janson et al., 1999).
2.1.3N 2 fixation rates N 2 fixation rates were measured directly by 15 N-labeled N 2 gas ( 15 N 2 ) assimilation (Montoya et al., 1996) or indirectly by the acetylene (C 2 H 2 ) reduction assay (Capone, 1993) (Table 1b).The 15 N 2 assimilation method tracks the conversion of 15 N 2 to particulate N. The 15 N 2 tracer is added into the ambient pool of N 2 ; the 15 N/ 14 N ratio is measured in the particulate N after incubation and compared to the natural abundance of N isotopes in unlabeled particulate material.The C 2 H 2 method estimates N 2 fixation rate indirectly by measuring the reduction of C 2 H 2 (a competitive inhibitor of N 2 ) to ethylene (C 2 H 4 ) which is then converted to a N 2 fixation rate assuming 3 or 4 moles C 2 H 2 reduced per 1 mole N 2 fixed, depending on the extent of nitrogenase-linked hydrogen production (Postgate, 1998).Generally, the direct 15 N 2 assimilation is a precise and sensitive method; hence, it has been used to generate the majority of rates (Table 1b).Direct comparison of these two methods showed the direct 15 N 2 assimilation method generally yields lower rates than those estimated from C 2 H 2 reduction assay (see summary in Mulholland, 2007).Discrepancies between these measures could be because the 15 N 2 assimilation method measures the net rate of conversion of reduced N to cellular N, or net N 2 fixation, while the C 2 H 2 reduction method measures gross N 2 fixation, which includes the reduced N both stored in cells and excreted as ammonium or dissolved organic N during incubation (Mulholland et al., 2004;Mulholland, 2007).More recently, it has been suggested that the direct 15 N 2 assimilation method significantly underestimates the N 2 fixation rates because the 15 N 2 bubbles injected in seawater do not attain equilibrium with surrounding water (Mohr et al., 2010), which will be discussed later.We include the N 2 fixation rates acquired by either the 15 N 2 assimilation method or the C 2 H 2 reduction assays in the database, considering that they distribute within a similar range of magnitude (Fig. 1).But this analysis does not consider different sampling sites.Further investigations using pair-wise comparison of the methods are needed to evaluate effects of these two methods on N 2 fixation rate measurement.Users have to be careful when using the database to study N 2 fixation rates aggregated from two different methods.
The collected N 2 fixation rates were mostly measured for whole seawater samples (Table 1b).Some samples were filtered and N 2 fixation rates were measured for organisms in the < 10 µm size-fraction, which we have assigned to unicellular diazotrophs (Table 1b).Note that UCYN-B can form colonies and may not be included in this size fraction.It is also possible that some diatoms with associated heterocystous cyanobacteria may be included in < 10 µm fractions.N 2 fixation rates were also measured in some datasets specifically for Trichodesmium and heterocystous cyanobacteria.Most N 2 fixation data were reported as daily rates, except for 11 datasets that were reported as hourly rates.Trichodesmium fixes N 2 exclusively during the light period, while the diel patterns of N 2 fixation are unclear for other diazotrophs (Carpenter and Capone, 2008 and references therein).Thus, those hourly N 2 fixation rates were converted to daily rates by multiplying by 12 h, which, however, could be conservative if diazotrophs other than Trichodesmium could fix N 2 during night (e.g., Montoya et al., 2004;Zehr et al., 2007).

NifH-based abundances
NifH abundances were estimated by qPCR targeting the nifH gene (Church et al., 2005a, b;Foster et al., 2007).Cellular DNA was extracted, and the gene sequences were targeted for different diazotrophic groups.Most nifH-based abundances were estimated for the three major diazotroph types: Trichodesmium, UCYN groups and heterocystous groups (Table 1c).Gene copies for UCYN were identified as UCYN-A, UCYN-B and/or UCYN-C groups.NifHbased abundances were also estimated for different groups of heterocystous cyanobacteria based on three nifH gene sequences (het-1, het-2 and het-3), which have been identified in symbioses with diatoms: Richelia-Rhizosolenia, Richelia-Hemiaulus and Calothrix-Chaetoceros, respectively (Church et al., 2005b;Foster and Zehr, 2006).Note that there is one dataset (Boström et al., 2007) that reports the abundance of heterocystous genus Nodularia.The sensitivity of the qPCR thermocycler is usually in the range of 6-12 nifH copies per reaction.When accounting for sample filtration volume, final elution volume and the volume of the DNA extract used in the qPCR reaction, the detection limit of the assay is significantly higher.Due to the variety of factors that determine the assay detection limit, it varies between laboratories and even datasets and is typically on the order of 25-100 nifH copies l −1 .Thus, for those data points reported as "detected but not quantifiable", 10 nifH copies l −1 could be a conservative estimate.
To estimate diazotrophic abundances, nifH gene copies are converted to number of diazotrophic cells on 1 : 1 basis, as it is for those diazotrophic genomes that have been sequenced (e.g., Trichodesmium and UCYN-B) that there is one nifH gene copy per genome (Zehr et al., 2008) and assuming one genome copy per cell.NifH genes are present in both the vegetative cells and the heterocysts of heterocystous cyanobacteria (Foster et al., 2009b).Thus, this estimate accounts for abundances of the total cells, not just the heterocysts, of the heterocystous cyanobacteria.Limitations are associated with this extrapolation.Evidence indicates that this extrapolation overestimates the diazotrophic abundances and can only be treated as an upper limit of the cell density because of the presence of multiple nifH gene copies per cell in some diazotrophs such as Clostridium pasteurianum (Langlois et al., 2008).Little information is available on the variability of genome copies per cell for all nifH phylotypes (Langlois et al., 2008).It is possible that, when there is excess phosphorus or if diazotrophic cells are carbon/energy limited rather than nutrient limited, they might accumulate more than one genome copy per cell.However, this extrapolation can also underestimate diazotrophic abundances as it is known that DNA and RNA extractions are not 100 % efficient and may vary among species (Foster et al., 2009b).The extraction efficiencies, and even genome copies per cell, are currently under investigation (J.Zehr, personal communication, 2011).
In addition, non-replicating deoxyribonucleic acid (DNA) can comprise up to 90 % of the total DNA in the oligotrophic regions (Winn and Karl, 1986) and DNA sampled in natural environments may represent non-living (detrital or nonreplicating) particulate matter (Holm-Hansen et al., 1968;Holm-Hansen, 1969;Winn and Karl, 1986;Bailiff and Karl, 1991;Arin et al., 1999).There is also evidence suggesting that the non-living DNA is less important than originally thought (Dortch et al., 1983).

Log-normal distribution and quality control
Diazotrophic abundances and N 2 fixation rates in the ocean can range from 0 when diazotrophs are below detection or truly absent at that location and time, to very high values during diazotrophic bloom phases.Cell abundances and N 2 fixation rates hence vary by several orders of magnitude, are often not normally distributed and are positively skewed (long tail of high values).However, the datasets (excluding Earth Syst.Sci.Data, 4, 47-73, 2012 www.earth-syst-sci-data.net/4/47/2012/  (Koch, 1966;Campbell, 1995).Mathematically, calculating the (arithmetic) mean and standard error of the logtransformed data and then back-transforming them exponentially results in geometric mean xg and geometric standard error SE g , which should be used in estimating the mean and the error of the mean for a log-normal distribution in format of xg × /SE g ( xg multiplied and divided by SE g ) instead of using xa ± SE a ( xa : arithmetic mean; SE a : arithmetic standard error) (Limpert et al., 2001;Doney et al., 2009).Thus in this study, the error range for a geometric mean xg is represented as between xg /SE g and xg × SE g .We control for data quality by using Chauvenet's criterion to flag suspicious outliers (Glover et al., 2011), which generally applies to normally distributed datasets and rejects data whose probability of deviation from the mean is less than 1/(2n), where n is the number of data points.However, considering (1) the datasets are log-normally distributed and (2) valid diazotrophic abundances and N 2 fixation rates can be infinite low or zero, Chauvenet's criterion in our practice is applied to the log-transformed non-zero data to flag the outliers only on the high side.Those nifH-based abundances that were reported as "detected but not quantifiable" are not included in the application of Chauvenet's criterion.Although not used in the application of Chauvenet's criterion, data with zero values are kept in the database, as they represent valuable ecological information.Note that the criterion is processed separately for the volumetric data points of each taxonomic type.The criterion is also applied to the depth-integrated total N 2 fixation rates and the depth-integrated total biomass estimated from cell counts and nifH-based abundances, as these depth-integrated values are used later for the global estimates.First, the mean xlog and the standard deviation σ log of the log-transformed data are calculated, which are used to calculate the critical value x * log with a probability of a half of 1/(2n) that values would exceed the mean xlog by this amount assuming normal distribution (in logtransformed space).One-half of 1/(2n) is used because Chauvenet's criterion is two-tailed test, and we only reject data at one tail on the high side.Thus, all data with log-transformed values higher than xlog + x * log are flagged.All the data points not flagged by the Chauvenet's criterion are accepted.However, not all the suspicious outliers flagged by the Chauvenet's criterion have to be rejected, as our datasets are not strictly log-normal distributed and the distribution estimated from the existing samples may not well represent the true distribution, especially at the sites with unusual environmental conditions.Thus for each flagged outlier, we evaluate whether its extraordinarily high value is reasonable or spurious based on the specific environmental conditions and/or the discussion with the original data contributor.

Biomass conversion
The cell counts and nifH-based abundances are converted to C biomass using conversion factors (Table 2).As discussed above, for C biomass estimates all the counts of Trichodesmium in colonies and cells have been converted to number of trichomes by assuming 200 trichomes colony   and 100 cells trichome −1 , which were widely used as an estimate for Trichodesmium (LaRoche and Breitbarth, 2005).
To determine the biomass conversion factor for trichomes of Trichodesmium, we utilize size measurements of cultured cells of different Trichodesmium species (Hynes et al., 2012) and estimate carbon content using the model of Verity et al. (1992) (Tables 3 and S1).The format of the Verity et al. (1992) model is where C is the cell carbon content in pg and V is the cell volume in µm 3 .This model was used because it was based on many data points across a wide range of cell sizes and species type, including cyanobacteria.T. erythraeum is the smallest Trichodesmium species, with estimated C content of 65 pg C cell −1 (Table 3) or 6.5 ng C trichome −1 (using 100 cells trichome −1 ), which is consistent with the backcalculated 42 ± 1 pg C cell −1 from measurements of the ironto-carbon ratio and iron content per cell (Tuit et al., 2004;Goebel et al., 2008).Thus, the Verity et al. (1992) model used here appears to be suitable for Trichodesmium.However, T. erythraeum is relatively small compared to other Trichodesmium species.The estimated carbon content for other Trichodesmium species is 110-250 pg C cell −1 (Table 3) or 11-25 ng C trichome −1 (using 100 cells trichome −1 ).Carbon content for Trichodesmium was derived from elemental analysis coupled with direct trichome counts at 40 stations in the tropical Atlantic in 1994 and 1996 (see the dataset associated with Carpenter et al., 2004 in our database), yielding a conversion factor of 10 ± 12 ng C trichome −1 .Some other studies observed higher C contents for Trichodesmium colonies, such as 9.7 µg C colony −1 in the Pacific (Mague et al., 1977) and 10.9 and 11.To accommodate all these estimates, we use default conversion factors of 300 pg C cell −1 (30 ng C trichome −1 , using 100 cells trichome −1 ) for Trichodesmium, with its range estimated to 100-500 pg C cell −1 (Table 2).The biomass conversion factor for UCYN-A is difficult to calculate because there is no isolate in culture.UCYN-A cells are spherical in shape and the estimate of ∼ 1 µm in diameter by Goebel et al. (2008) is the only measurement for UCYN-A cell size, which was determined using fluorescence-activated cell sorting (FACS) coupled with real time-qPCR.This gives a cell size estimate of 0.5 µm 3 and a default conversion factor of 0.2 pg C cell −1 for UCYN-A using the Verity et al. (1992) model, with its range estimated by varying the default conversion factor by ± 50 %, i.e., 0.1-0.3pg C cell −1 (Table 2).
UCYN-B (Crocosphaera) cells are also spherical in shape with a reported diameter range of 3-5 µm in laboratory isolates (Goebel et al., 2008) and 3-8 µm in natural samples and cultures (Webb et al., 2009;Moisander et al., 2010).Thus, UCYN-B cells have a range in volume from 14-270 µm 3 and cellular carbon content from 4-50 pg C cell −1 using the Verity et al. ( 1992) model (Table 2).By assuming a diameter of 5 µm and thus a volume of 65 µm 3 , we also estimate the default conversion factor of 20 pg C cell −1 for UCYN-B.
The only successful isolation and laboratory culture of a UCYN-C strain, designated TW3 (Taniuchi et al., 2012), show that the cells are 2.5-3.0 µm in width and 4.0-6.0µm in length, which gives cellular volume of 20-42 µm 3 and cellular carbon content of 5-11 pg C cell −1 for UCYN-C by using the Verity et al. (1992) model.However, this range is an estimate from only one UCYN-C strain.For example, the nifH gene of UCYN-C is most similar to the nifH gene of the benthic Cyanothece (Zehr, 2011), and the cell dimensions of two Cyanothece strains BH63 and BH68 have been reported as 4-5 µm width by 7-8 µm length (Reddy et al., 1993)  equals 90-160 µm 3 and leads to a conversion factor of 15-24 pg C cell −1 by using the Verity et al. (1992) model.By merging these two estimates, the default conversion factor of 10 pg C cell −1 with a range of 5-24 pg C cell −1 is used for UCYN-C (Table 2).The trichomes of Richelia and Calothrix are comprised of three to ten vegetative cells and one terminal heterocyst (Janson et al., 1999;Foster and Zehr, 2006).The sizes of vegetative cells and heterocysts have been measured for multiple Richelia and Calothrix samples (Foster et al., 2011).We use these values to estimate the C contents of vegetative cells and heterocysts by using the Verity et al. (1992) model (Tables 4 and S2).As the sizes and the C contents are different in vegetative cells and heterocysts, average C content per Richelia or Calothrix cell is calculated by assuming each trichome is comprised of one heterocyst and three, five or ten vegetative cells (Table 4).The number of vegetative cells per trichome does not greatly impact the estimate of average C content per cell (Table 4).Based on these estimates, a default biomass conversion factor of 10 pg C cell −1 with a range of 2-80 pg C cell −1 is used for Richelia, and of 10 pg C cell −1 with a range of 5-20 pg C cell −1 for Calothrix (Table 2).

Results of quality control
Most data types are well approximated by a log-normal distribution (Figs. 2 and S1), except for Calothrix cell counts and Calothrix nifH-based abundances which only have limited non-zero data points (Fig. S1b and h).By applying Chauvenet's criterion, there are only 9 data points flagged, including 1 volumetric Trichodesmium cell count (Fig. 2a), 1 volumetric Trichodesmium N 2 fixation (Fig. 2b), 2 volumetric N 2 fixation rates by UCYN (Fig. 2c), 2 volumetric whole seawater N 2 fixation rate (Fig. 2d) and 3 volumetric UCYN-B nifH genes (Fig. 2e).Thus from a statistical point of view, most data are acceptable.The flagged 1 data point of Trichodesmium cell count of 4.4 × 10 10 trichomes m −3 and 1 data point of Trichodesmium N 2 fixation rate of 31 391 µmol N m −3 d −1 were sampled simultaneously in the western Indian Ocean near the Kenyan coast (Fig. 3a and b) (Kromkamp et al., 1997).The authors reported that a massive bloom, which seemed to be associated with a front, was encountered and large streaks with Trichodesmium colonies were floating at the surface.As both the high abundance and N 2 fixation rate were observed, these two data points were very likely real.The flagged data points of UCYN's N 2 fixation rate of 360 and 960 µmol N m −3 d −1 (Montoya et al., 2004) were from ∼ 10 • S, ∼ 130-135 • E, the western tropical Pacific in the Arafura Sea (Fig. 3b).High N 2 fixation rates are commonly found in this region, such as another flagged whole seawater N 2 fixation rate of 610 µmol N m −3 d −1 (Bonnet et al., 2009) found in the nearby sea region at 6 • S, 147 • E, near Papua New Guinea (Fig. 3b).Recent measurements in the southwestern Pacific near New Caledonia also show extremely high N 2 fixation rates (Bonnet, unpublished data).The possible reason for these high N 2 fixation rates could be the iron supply from volcanoes or via upwelling.Another flagged whole seawater N 2 fixation rate of 13 500 µmol N m −3 d −1 (Gandhi et al., 2011) was sampled from surface water at ∼ 17 • N, 73 • E, near the Indian coast (Fig. 3b).In the same vertical profile, the N 2 fixation rates sampled in 2 m and deeper were all less than 32 µmol N m −3 d −1 , and all the other surface N 2 fixation rates sampled in the surrounding area on this cruise were below 540 µmol N m −3 d −1 .We evaluate that this high rate is extraordinary and should be removed from the database.The flagged three data points of UCYN nifH-based abundances of ∼ 2-9 × 10 11 copies m −3 (Orcutt et al., unpublished data) are from the Mississippi Sound (Fig. 3c), although most other nifH gene data were sampled from the open ocean.Mississippi Sound is a very shallow, partially land-locked coastal environment and is very different from open ocean waters.Also, the three flagged data points of the extreme peaks in nifH-based abundance were sampled in June and July 2009 during extremely high water temperature events (∼ 30 • C) towards the end of the summer.This is also a time of the year Earth Syst.Sci.Data, 4, 47-73, 2012 www.earth-syst-sci-data.net/4/47/2012/ when the lowest seasonal dissolved inorganic N : P-ratios (approximately 0.5) occurred, although the ratio increases towards 3-6 by the end of the late summer period (likely due to N 2 fixation).The extreme peaks in nifH-based abundance also coincided with peaks in phytoplankton chlorophyll a (5-6 µg l −1 ).Thus, we believe most flagged data points, except the high surface N 2 fixation rate found near the Indian coast (Gandhi et al., 2011), are due to the specific environment and not necessarily related to the data quality.Hence, we have retained these values in the database.These points, however, are not included in our later analyses because their extremely high values would influence the mean values.It is also important to note that we have not excluded data based on an www.earth-syst-sci-data.net/4/47/2012/ Earth Syst.Sci.Data, 4, 47-73, 2012 assessment of the protocols of sampling, handling, preservation or measurement.

Data distribution
Figure 3 shows the spatial distribution of the three subdatabases.The Atlantic Ocean has the best data coverage in all three sub-databases, especially in the North Atlantic.In the Pacific Ocean, the coverage is limited especially in the South Pacific.There is almost no data coverage for the Indian Ocean except four datasets in the Arabian Sea (Capone et al., 1998;Mazard et al., 2004), near the Kenyan coast (Kromkamp et al., 1997) and in the Madagascar Basin (Poulton et al., 2009).There are also some data points in inner seas.N 2 fixation rates were measured almost every month at the BATS station in 1995-1997 and at Station ALOHA from 2005-2010.These are the only two sites with long-term sustained time series of N 2 fixation measurements.Most data were collected in the tropical and subtropical regions, with latitudinal coverage of 50 • S-50 • N for cell counts of diazotrophs, 40 • S-60 • N for N 2 fixation rates and 30 • S-60 • N for nifH-based abundances (Fig. 4).Most data were collected in the 1990s and 2000s, with some cell counts and N 2 fixation rates collected in 1960s and 1970s and very limited data points of cell counts collected in 1980s (Fig. 5a, c and e).The monthly distribution of the data tends to be random for cell counts (Fig. 5b) and N 2 fixation rates (Fig. 5d), while most of nifH gene data were collected in March, April and July (Fig. 5f).

N 2 fixation rates
After being binned onto 3 • × 3 • grids and calculating the geometric means of the data in each bin, depth-integrated N 2 fixation rates are found to be highest in the western tropical Atlantic near the Caribbean Sea and in the subtropical North Pacific near the Hawaiian Islands, on an order of 100-1000 µmol N m −2 d −1 (Fig. 6a).In most other regions, depth-integrated N 2 fixation rates are on an order of 1-100 µmol N m −2 d −1 (Fig. 6a).
The volumetric N 2 fixation rates are also analyzed on 3 • × 3 • grids in the five vertical layers of 0-5 m, 5-25 m, 25-62.5 m, 62.5-137.5 m and 137.5-250 m.The geometric mean values of each grid box are illustrated in Fig. 6b-f.Note that the depth-integrated and volumetric values are not from exactly the same data sources.They overlap for some data sources but also have locations that are not in common, as some of the data were originally reported as depthintegrated values and the other depth-integrated values were calculated herein.N 2 fixation rates generally decrease with depth (Fig. 6b-f).When compared horizontally in each layer, the database reveals that there are several regions with high N 2 fixation rates: the subtropical North Pacific in all the layers, surface waters of the western Pacific and the tropical Atlantic (especially in the west) in 0-25 m (Fig. 6b-f).

Diazotrophic abundance and biomass
Cell counts and nifH-based abundances are used to estimate diazotrophic C biomass using the default biomass conversion factors (Table 2), except for a few datasets where the contributors measured the conversion factors.The cell count data demonstrate that Trichodesmium is the dominant diazotroph, except that the abundance of Calothrix can be as large as Trichodesmium (Table 5).But the number of depthintegrated cell count samples for Calothrix are very limited (Table 5) and are limited to one cruise in the subtropical North Pacific (Villareal et al., unpublished).As the average Trichodesmium cell size, and thus the biomass con-version factor, is much larger than heterocystous cyanobacteria (Table 2), Trichodesmium constitute more than 97 % of diazotrophic biomass based on cell count data (Table 5).Cell count data do not include UCYN groups.The geometric mean of nifH-based abundances shows that UCYN-A is the most abundant among all the diazotrophic groups including Trichodesmium, and the abundances of UCYN-B, UCYN-C and Richelia are also comparable to Trichodesmium in the volumetric data (Table 5).However, Trichodesmium still dominate the diazotrophic biomass (Table 5) because of their higher biomass conversion factor than other groups.Comparison of the geometric means shows that nifH-based abundances are mostly one order of magnitude higher than cell-count-based abundances for Trichodesmium and heterocystous cyanobacteria, except that the volumetric Trichodesmium abundances are comparable in both cell-countbased and nifH-based geometric means (Table 5).In the North Atlantic Ocean where both cell counts and nifH-based abundances were frequently measured (Fig. 3a and c), histograms of Trichodesmium abundances derived from both the cell count data (assuming 100 cells trichome −1 ) and nifHbased data are in agreement (Fig. 7).This evidence provides some support for the reliability of nifH-based abundance for Trichodesmium.
For comparison, the arithmetic mean and standard error are also calculated for each diazotrophic group (Table 5).Note that zero-value data points have to be excluded for calculating geometric means, while these data points can be used to calculate arithmetic means.The arithmetic means are mostly one to several orders of magnitude larger than the geometric means, especially in the nifH-based datasets (Table 5).The estimated biomass of UCYN groups increases greatly and becomes comparable to that of Trichodesmium if the arithmetic means are used (Table 5), which, however, is certainly due to the high values within the approximate lognormal distributions dominating the calculation of the arithmetic mean.
Total diazotrophic C biomass estimated from cell counts is displayed in spatial maps for a given depth-integral and for the five vertical layers of 0-5 m, 5-25 m, 25-62.5 m, 62.5-137.5 m and 137.5-250 m by showing the geometric means of each 3 • ×3 • grid bin (Fig. 8a-f).The depth-integrated cellcount-based diazotrophic biomass is higher in the western than in the eastern Atlantic (Fig. 8a).The depth-integrated cell-count-based diazotrophic biomass is also high in the subtropical North Pacific near Hawai'i, while it is low in other regions of the subtropical North Pacific because mostly only heterocystous cyanobacteria were counted in these sampling sites (Fig. 8a).The cell-count-based diazotrophic biomass shows maxima in the surface and decreases with depth (Fig. 8b-f).The surface cell-count-based diazotrophic biomass is high in the tropical Atlantic (Fig. 8b and c), which is consistent with the high N 2 fixation rates found in the same region.In the southern Atlantic, the cell-count-based diazotrophic biomass is low in all layers (Fig. 8b-f).The high www.earth-syst-sci-data.net/4/47/2012/ Earth Syst.Sci.Data, 4, 47-73, 2012 cell-count-based diazotrophic C biomass in the Arabian Sea is from one dataset (Capone et al., 1998) reporting an extensive Trichodesmium bloom, which may not represent mean level of diazotrophic biomass in the Arabian Sea.The total diazotrophic biomass estimated from nifH-based abundances is presented for a given depth-integral and for the five vertical layers of 0-5 m, 5-25 m, 25-62.5 m, 62.5-137.5 m and 137.5-250 m by showing the geometric means of each 3 • × 3 • grid bin (Fig. 9a-f).Both the depth-integrals and results for vertical layers show high nifH-based C biomass in the tropical Atlantic (Fig. 9a-f), which is consistent with the high N 2 fixation rates found in the same region.High nifH-based biomass was also found in the southwestern Pacific (Fig. 9a, c and d), where the diazotrophic cell count data are not reported.The nifH-based biomass is generally high in 0-62.5 m (Fig. 9b-d) and decreases below 62.5 m (Fig. 9e and f).

First-order estimates for global N 2 fixation rate and diazotrophic biomass
Many analyses can be conducted with ∼ 12 000 data points in the database as per the objectives of the users.Here we show a simple example using the database to conduct first-order estimates of the global N 2 fixation rate and diazotrophic biomass.We divide the global ocean into 6 regions: the North and the South Atlantic Ocean, the North and the South Pacific Ocean, the Indian Ocean and the Mediterranean Sea (Tables 6, 7 and 8).Two methods, geometric mean and arithmetic mean, were used for the estimation.As the data are not evenly distributed in space and intensive samplings were made in some regions (Fig. 3), the depth-integrated values are first binned to 3 • × 3 • grid to partially avoid this bias.Geometric and arithmetic means are calculated for each bin, which are then used to calculate geometric and arithmetic means for each region.Areal sum of the total N 2 fixation rate for each region is calculated by multiplying geometric or arithmetic means with ocean area of that region.Note that the volumetric data points are not used in this simple example, although they can also provide valuable information for the global estimates.Global N 2 fixation rate and diazotrophic biomass are then estimated by summing the estimates from all the 6 regions.As 99 % of the data in the database were collected within a latitudinal span of ∼ 40 • S-55 • N, we assume the oceanic N 2 fixation is negligible outside of this latitudinal range.Given known temperature constraints on diazotrophy, this seems like a reasonable assumption; however, one could further refine global estimates if significant N 2 fixation is found outside this latitudinal range in the future.
In the estimates for N 2 fixation rate, geometric mean of the depth-integrated rates for the North Atlantic Ocean surprisingly is lower than that for the North and the South Pacific (Table 6), although the North Atlantic has attracted most studies of diazotrophs.By using the geometric means, the total N 2 fixation rate in the North Atlantic is estimated to 1.7 Tg N yr −1 , one order lower than the estimates for the North and the South Pacific Ocean (35 and 24 Tg N yr −1 , respectively) (Table 6).The arithmetic mean of N 2 fixation rate in the North Atlantic, however, is one order higher than the geometric mean, which makes the total arithmetic rate of 32 Tg N yr −1 more comparable to the North Pacific (56 Tg N yr −1 ) and the South Pacific (46 Tg N yr −1 ) (Table 6).The data indicate that, although high N 2 fixation rates were obtained in the North Atlantic, especially in the western tropical North Atlantic (Fig. 6a), low N 2 fixation rates were more frequently found in this region.More measurements are needed to confirm the high estimates of N 2 fixation rate for the Pacific Ocean, as the measurements in the North Pacific were mostly in the subtropical gyre near the Hawai'i Islands (Fig. 6a), and the sampling in the South Pacific has www.earth-syst-sci-data.net/4/47/2012/ Earth Syst.Sci.Data, 4, 47-73, 2012 been limited to data mostly from two cruises (Garcia et al., 2007;Raimbault and Garcia, 2008).The N 2 fixation rates were also low in the South Atlantic (Table 6).There are inadequate depth-integrated N 2 fixation rates collected in the Indian Ocean to estimate the means of this basin.By summing the geometric mean rates from all the regions, the global N 2 fixation rate (not including the Indian Ocean) is estimated to be 62 (error range: 53-73) Tg N yr −1 (Table 6), which is at the low end of the current geochemical estimates of 100-200 Tg N yr −1 for marine pelagic N 2 fixation (Gruber andSarmiento, 1997, 2002;Karl et al., 2002;Galloway et al., 2004;Deutsch et al., 2007;Gruber, 2008).Note that, as described in Sect.2.2, the error range for a geometric mean in this study is represented as between the geometric mean multiplied/divided by the corresponding geometric standard error.The estimate of global N 2 fixation rate using arithmetic mean is higher at 140 (standard error: 9.2) Tg N yr −1 (Table 6), which is within the range of the current geochemical estimates.
In a manner similar to our global estimates of the N 2 fixation rate, we have also calculated global mean diazotrophic biomass from cell-count-based biomass data (Table 7).As in many sampling sites in the Pacific Ocean, only heterocystous cyanobacteria are counted (Table 1a) and thus very low total diazotrophic biomass was reported for these sites (Fig. 8a).We give special consideration for the Pacific by estimating mean biomass for Trichodesmium and heterocystous cyanobacteria separately before summing the values from these two taxonomies (Table 7).The average diazotrophic biomass in the North Atlantic is similar to or higher than that in the North Pacific as shown by both geometric and arithmetic means (Table 7) The global diazotrophic biomass (not including Mediterranean Sea) is estimated from the cellcount-based data as 2.1 (error range: 1.4-3.1)Tg C using geometric mean or 18 (standard error: 1.8) Tg C using arithmetic mean (Table 7).
The global diazotrophic biomass is also estimated from depth-integrated nifH-based data.First, the depth-integrated data points of nifH-based biomass are limited, with no or inadequate data points in the South Atlantic, the Indian Ocean and the Mediterranean Sea (Table 8), and thus the estimate could be highly biased.The geometric mean nifH-based diazotrophic biomass is extremely high in the South Pacific compared to other regions (Table 8).The global estimate (the Pacific plus the North Atlantic) of diazotrophic biomass from nifH-based data is 89 (error range: 52-150) Tg C using geometric mean or 590 (standard error: 70) Tg C (using arithmetic mean), which are dominated by the estimate of the South Pacific (Table 8).
By comparing the global estimate of N 2 fixation rates with the global estimates of cell-count-based or nifH-based diazotrophic biomass (using geometric means) (Tables 6, 7 and 8) and assuming molar C : N ratio of 106 : 16 for diazotrophic cells, the turnover time of diazotrophic cellular N due to N 2 fixation is estimated to be 2 days and 92 days, respectively.

Limitations
There are a number of limitations with the database and the global estimates.First and foremost, these data points are not uniformly distributed in the world ocean.Historically, scientists have also sought out regions with high diazotrophic abundances and, hence, there is a higher possibility of artificially elevated N 2 fixation activity and diazotrophic biomass relative to true regional means.As the current database does not cover some regions such as the coastal upwelling zones, Earth Syst.Sci.Data, 4, 47-73, 2012 www.earth-syst-sci-data.net/4/47/2012/ our estimate could be changed substantially if values differ within these biogeographic domains.
As mentioned above, it was recently established that the most commonly applied method used to measure N 2 fixation, the 15 N 2 assimilation method (Table 1b), has underestimated the true rates because the 15 N 2 bubbles injected in seawater do not attain equilibrium with surrounding water (Mohr et al., 2010).Two independent method comparisons conducted in the Pacific and Atlantic Oceans demonstrated that the measured rates are significantly higher when the 15 N 2 tracer is added in a dissolved form to the incubations, resulting in an instantaneous 15 N isotopic equilibrium in the seawater (Großkopf et al., 2012;Wilson et al., 2012).The variability in the underestimation was linked to the composition of the diazotrophic community (Mohr et al., 2010;Großkopf et al., 2012) indicating that it will be difficult to correct the historical rate measurements made with the 15 N 2 method.However, these measurements can be treated as min-imum estimates that may increase in the future.These measurements are still valuable when comparing N 2 fixation rates in different regions, as they are approximately log-normally distributed and range in ∼ 6 orders of magnitude (Fig. 1).
It is also a notable issue that the difference between the geometric and the arithmetic mean N 2 fixation rate is larger in the North Atlantic than those in the other basins (Table 6).A strict log-normal distribution is symmetric in the log space and its geometric mean is located at the peak of the distribution, i.e., at the highest probability.However, the depthintegrated total N 2 fixation rates in our database, after binned on 3 • × 3 • grid, are not strictly log-normally distributed in each basin (Fig. 10).The North Atlantic has highest spatial coverage, and the N 2 fixation rates in this basin differ in 6 orders of magnitude from 10 −3 -10 3 µmol N m −2 d −1 (Fig. 10).Its distribution is not symmetric, with more values observed on the left (lower) side.Thus, the geometric mean N 2 fixation rate for the North Atlantic is one order of magnitude lower   than the peak of the distribution.The arithmetic mean in this basin is more than one order of magnitude higher than the geometric mean.In the North and South Pacific, the N 2 fixation rates only differ in 2 orders of magnitude, without any data lower than 10 µmol N m −2 d −1 (Fig. 10).Thus, the difference between the geometric and arithmetic means is smaller in the Pacific than in the North Atlantic.Although the geometric mean N 2 fixation rate in the North Atlantic is about one order of magnitude lower than those in the North and South Pacific, the peaks of the distributions are actually much closer in these basins (Fig. 10; also see Table 6).Compared to the North Atlantic, the Pacific is not intensively sampled for N 2 fixation.Thus, we need more samples with wider spatial coverage from the Pacific to better evaluate the global N 2 fixation rate.
Based on depth-integrated data, the global mean estimate of diazotrophic biomass from cell counts is 1-2 orders of magnitude lower than that from nifH-based abundances (Tables 7 and 8), which are mostly contributed by the dom-inant taxon, Trichodesmium (Table 5).But considering the log-normal distributions, difference of one or two orders of magnitude is relatively small compared to the span of the distribution especially when the number of samples is limited (see Fig. 7 as an example).More samples are needed in order to confidently validate the usage of nifH data for estimating diazotrophic abundances and the assumption of 1 : 1 conversion from nifH to cell densities.
The above global estimates are based on default values of the biomass conversion factors.In order to show the effects of the variation of the biomass conversion factors, we have applied the upper and lower bounds of these conversion factors for all diazotrophic subtypes in order to revise global diazotrophic biomass estimates following the same procedure described above.The estimated global geometric mean diazotrophic biomass can vary by about ± 70 % from default estimates, ranging in 0.8-3.5 Tg C 9 (default 2.1 Tg C) based on cell count data, or in 28-170 Tg C (default 89 Tg C) based on nifH-based data (Table 2).

Conclusions and recommendations for use
The first global database of oceanic diazotrophic measurements has been constructed with sub-databases for N 2 fixation rates, cell counts for diazotrophs and nifH-based abundances.This database provides useful information on the spatial patterns of N 2 fixation and diazotrophic biomass in the world ocean.Depth-integrated values were used to conduct a first-order estimate of the global N 2 fixation rate and diazotrophic biomass.Spatial coverage is the main limitation of the database, as there are still vast oceanic areas where diazotrophic activity and biomass have never been measured.For example, measurements in the Indian Ocean, the South Atlantic and the subtropical southern Pacific are a high priority in this regard.A finer understanding of the range and applicability of fixed biomass conversion factors is another concern.Although careful analyses have been conducted to derive a reliable global biomass, our estimates may vary by ± 70 % depending on the conversion factor.More direct elemental analyses are required to further narrow the range of biomass conversion factors.Relatively higher abundances tend to be found from nifH-based data than from the cell count data.The estimated global mean diazotrophic biomass from nifH-based abundances is 1-2 orders of magnitude higher than that derived from cell counts.However, in the well-sampled region of the North Atlantic, the Trichodesmium abundances estimated from these two types of data match each other better.Further investigations and more data are needed to identify what causes the discrepancies between these two estimates.However, it is clear that microscopic cell counts cannot, at present, account for the UCYN-A, which is very abundant in the ocean.
With the recent findings that most of the previous N 2 fixation rates have been underestimated when the 15 N 2 tracer was added to the incubations as gas bubbles, it can be expected that higher N 2 fixation rates will be reported in the near future using the new method by adding 15 N 2 in a dissolved form.Each data record needs to be well documented for the applied 15 N 2 method, so that future users can identify which data records are likely underestimated.
The database is stored permanently at PANGAEA and can be easily accessed by users.It will be routinely updated with new measurements.The database can be used to study the level of oceanic N 2 fixation activity and its temporal and spatial variations on local, regional and global scales, as well as for constraining the relative contribution of new nitrogen inputs from N 2 fixation and other sources.The diazotrophic biomass data, along with data from other functional groups, can be used to study phytoplankton community structure.The database can also be used to validate geochemical estimates of N 2 fixation and to parameterize and validate biogeochemical models, keeping in mind that N 2 fixation rate measurements have been underestimating the true N 2 fixation rates.

Figure 2 .
Figure 2. Histogram of data points on logarithmic scale (blue bars) and the critical values for quality control using Chauvenet's criterion (dashed red lines).Values higher than the critical values are rejected.(a) Trichodesmium cell counts, (b) Trichodesmium N 2 fixation rates, (c) UCYN N 2 fixation rates, (d) whole seawater N 2 fixation rates, and (e) UCYN-B nifH-based abundance.See Supplement Fig. S1 for figures for other types.

Figure 3 .
Figure 3. Spatial distributions of collected diazotrophic data in number of data points (binned on to 1 • × 1 • grid), including cell counts (panel a), N 2 fixation rates (panel b) and nifH-based abundances (panel c).Blue diamonds mark the location for the data rejected by Chauvenet's criterion.

Figure 4 .
Figure 4. Latitudinal distribution of the data for (a) cell counts for diazotrophs, (b) N 2 fixation rates, and (c) nifH-based abundances.

Figure 5 .
Figure 5. Temporal distribution of the data points in year and month for (a)-(b) cell counts for diazotrophs, (c)-(d) N 2 fixation rates, and (e)-(f) nifH-based abundances.

Figure 7 .
Figure 7. Histogram of depth-integrated (panel a) and volumetric (panel b) non-zero data points of cell-count-based (blue) and nifH-based (red) Trichodesmium abundances in the North Atlantic (10 • S-50 • N).Each Trichodesmium trichome is assumed to be comprised of 100 cells.Data values are on logarithmic scale.Those "detected but not quantifiable" nifH-based abundances, which are assigned 10 × 10 3 cells m −3 (see text for details), are not included.

Figure 10 .
Figure 10.Histograms of depth-integrated total N 2 fixation rates (after binned on 3 • × 3 • grid) in North Atlantic (blue), South Atlantic (green), North Pacific (red) and South Pacific (black).For each basin, filled diamond and empty circle mark the location of the geometric mean and the arithmetic mean, respectively.

Table 1a .
Summary of data points for cell counts of diazotrophs, including volumetric measurements of Trichodesmium, unicellular cyanobacteria and heterocystous cyanobacteria and their depth-integrals.

Table 1b .
Summary of data points for N 2 fixation rates, including volumetric measurements of Trichodesmium, unicellular cyanobacteria and heterocystous cyanobacteria and their depth-integrals.Computed from vertical profiles unless marked for those reported by data providers as depth-integrals.e Data are reported by data providers as depth-integrated N 2 fixation rates.f Do not include integrated profiles unless they are reported by data providers as depth-integrated values (as marked by e ). d

Table 1c .
Summary of data points for nifH-based abundances from qPCR assays, including volumetric measurements of Trichodesmium, unicellular cyanobacteria and heterocystous cyanobacteria and their depth-integrals.

Table 2 .
Estimated default and the upper and lower bounds for the biomass conversion factors, and their impacts on global biomass estimates (based on geometric mean).
a The low/high biomass is calculated when all their subtypes use low/high conversion factors.bAssuming 100 cells trichome −1 .zero-valuedatapoints) are approximately log-normally distributed (Figs.2 and S1) as is typical of many biological and ecological properties that are induced by biological mechanisms

Table 3 .
Verity et al. (1992)actors estimated for Trichodesmium cells based on size measurements of cultured Woods Hole Trichodesmium species(Hynes et al., 2012).Carbon contents are calculated using theVerity et al. (1992)model.

Table 4 .
Verity et al. (1992)actors estimated for Richelia and Calothrix trichomes based on biovolume measurements byFoster et al. (2011)and assuming trichome composition of one heterocyst and three, five or ten vegetative cells.Carbon contents are calculated using theVerity et al. (1992)model.Numbers are mean ± standard deviation when applicable.

Table 5 .
Abundances and estimated biomass (using default conversion factors in Table2) of each diazotrophic group from cell counts and nifH-based data, shown by geometric and arithmetic mean.Note that data points with zero-values are excluded when calculating geometric means.Error ranges for geometric means shown in parentheses are estimated by dividing and multiplying the geometric means by one geometric standard error.Standard errors for arithmetic means are shown in parentheses.Depth-integrated and volumetric data points are analyzed separately.

Table 6 .
Estimates of N 2 fixation rate for the global oceans based on geometric and arithmetic means.Note that data points with zerovalues are excluded when calculating geometric means.Depth-integrated N 2 fixation rates are first binned to 3 • × 3 • grid, and geometric and arithmetic means are calculated for each bin, which are then used to calculate geometric and arithmetic means for each region.Areal sum is calculated by multiplying geometric or arithmetic means with ocean area.Error ranges for geometric means shown in parentheses are estimated by dividing and multiplying the geometric means by one geometric standard error.Standard errors for arithmetic means are shown in parentheses.Error range for global geometric mean is estimated by simply summing lower and upper bounds of the error ranges of each region.Errors for global arithmetic means are propagated from (independent) errors in each region.ND: no data.NQ: not qualified to calculate areal sum because of too few samples.Earth Syst.Sci.Data, 4, 47-73, 2012www.earth-syst-sci-data.net/4/47/2012/

Table 7 .
Estimates of diazotrophic biomass for the global oceans using cell-count-based data.Same method as Table6is used, except that the estimates for Trichodesmium and Heterocystous cyanobacteria (CYN) are calculated separately in the North and South Pacific Ocean.ND: no data.NQ: not qualified to calculate areal sum because of too few samples.

Table 8 .
Estimates of diazotrophic biomass for the global oceans using nifH-based data.Same method as Table6is used.ND: no data.NQ: not qualified to calculate areal sum because of too few samples.