A global viral oceanography database (gVOD)

Virioplankton are a key component of the marine biosphere in maintaining diversity of microorganisms and stabilizing ecosystems. They also contribute greatly to nutrient cycles/cycling by releasing organic matter after lysis of hosts. In this study, we constructed the first global viral oceanography database (gVOD) by collecting 10 931 viral abundance (VA) data and 727 viral production (VP) data, along with host and relevant oceanographic parameters when available. Most VA data were obtained in the North Atlantic (32 %) and North Pacific (29 %) oceans, while the southeast Pacific and Indian oceans were quite undersampled. The VA in the global ocean was 1.17(±3.31)× 107 particlesmL−1. The lytic and lysogenic VP in the global ocean was 9.87(±24.16)× 105 and 2.53(±8.64)× 105 particlesmL−1 h−1, respectively. Average VA in coastal oceans was higher than that in surface open oceans (3.61(±6.30)×107 versus 0.73(±1.24)×107 particlesmL−1), while average VP in coastal and surface open oceans was close. Vertically, VA, lytic VP and lysogenic VP deceased from surface to deep oceans by about 1 order of magnitude. The total number of viruses in the global ocean estimated by bin-averaging and the random forest method was 1.56×1030 and 1.49×1030 particles, leading to an estimate of global ocean viral biomass at 35.9 and 34.4 TgC, respectively. We expect that the gVOD will be a fundamental and very useful database for laboratory, field and modeling studies in marine ecology and biogeochemistry. The full gVOD database (Xie et al., 2020) is stored in PANGAEA (https://doi.org/10.1594/PANGAEA.915758).


Introduction
Virioplankton are the most abundant biological entities and one of the largest genetic reservoirs in the ocean (Breitbart, 2012;Fuhrman, 1999). With an estimation of ∼ 10 23 marine microbes being infected every second, viruses play important roles in affecting microbial mortality, regulating community composition and impacting biogeochemical cycles (Suttle, 2005;Zhang et al., 2007). Viruses were estimated to kill ∼ 20 %-40 % of marine bacterioplankton every day, a rate similar to that caused by zooplankton grazing (Fuhrman, 1999). In particular, virus-mediated cell lysis effectively "shunts" approximately 25 % of the photosynthetically fixed carbon, which otherwise would be transferred to higher trophic levels, to the dissolved organic matter (DOM) pool, partly forming the basis of the microbial loop and leading to the recycling of nutrients (Suttle, 2007;Wilhelm and Suttle, 1999). Furthermore, viral lysis can contribute to the biological pump through the release of sticky lysates that accelerate the aggregation and sink of carbon into the deep sea (Suttle, 2005).
Compilation of the observations of viral abundance and activity in the global ocean is very necessary and urgent in understanding spatiotemporal distributions of viruses, exploring the controlling factors of viral processes, qualitatively and quantitatively assessing virus-host interactions and viral functioning in marine ecosystems, and even improving predictions of large-scale marine ecosystem and Earth system models. Two previous studies (Bar-On and Milo, 2019; Wigington et al., 2016) summarized viral abun-dance data in the ocean and estimated viral biomass as well as the virus-to-prokaryote ratio. However, the lack of host parameters, such as bacterial production, and oceanographic parameters, such as temperature, salinity and nutrient concentrations, limits the usage of these datasets in broader oceanographic contexts. More importantly, there is no public database of viral activity in the global ocean, which substantially hinders our understandings of the ecological and biogeochemical roles of virioplankton on a global scale. In addition, the ecological functions of the viruses are tightly linked to their life strategies, mainly including the lytic and the lysogenic infection (Wommack and Colwell, 2000). The significance of viruses in oceanic biogeochemistry is mainly reflected through the lytic infection, which results in cell lysis and the release of DOM. In contrast, other more temperate viruses choosing the lysogenic infection can influence microbial diversity and metabolism by transferring new genes to their hosts, altering the expression of host genes, and not killing hosts for many generations until an environmental or cellular trigger causes them to enter the lytic cycle. The lysogenic infection hence serves as a molecular time bomb (Paul, 2008). Therefore, it is necessary to include the quantity and quality data of viral life strategies in a viral oceanographic database.
In this study, we construct the first global viral oceanography database, namely gVOD, by collecting data of viral abundance (VA), lytic viral production (VP) and lysogenic VP, as well as other related viral, host and oceanographic metadata when available. Based on the database, we estimate the total viral number and biomass in the global ocean. In addition, the data of VA and VP generated with different techniques were compared to provide references for evaluating possible technical biases.

Database summary
In the gVOD, direct measurements of three core parameters (VA, lytic VP and lysogenic VP), as well as accessary viral, prokaryotic and oceanographic parameters when available, were collected from published papers or acquired from lead authors or principal investigators (Table 1). Sampling information including date, latitude, longitude, depth and methods was included for each data record. We used ocean depth shallower or deeper than 200 m as a criterion to identify coastal or open-ocean samples. The open-ocean samples were further separated into surface and deep samples that collected in 0-200 m and > 200 m, respectively.
The quality-controlled database consists of 10 931 VA data points (Table A1), 608 lytic VP data points and 119 lysogenic VP data points (Table A2). Most of VA (99.2 %) and lytic VP (98.4 %) and all lysogenic VP data have accompanying data of prokaryotic abundance (Table 1). For some samples, the abundances of flagellate, picoeukaryotes, Synechococcus and Prochlorococcus are also available. Prokaryotic productivity measurements cover 22.1 % of VA, 57.7 % of lytic VP and 76.5 % of lysogenic VP data. The most available environmental parameters are salinity and temperature, providing oceanographic information for about half of VA, twothirds of lytic VP and nearly all lysogenic VP data. Oxygen and chlorophyll a concentration data are also adequate, particularly for VA. The concentrations of different types of nutrient, including nitrate, silicate and phosphate, are available for many samples. Other environmental parameters (pH, light intensity, dissolved organic carbon concentration) are relatively scarce. Moreover, given that the frequency of viral infected cells was calculated, independently or together with VP, usually to quantify the impact of viral infection within the microbial community (Chen et al., 2019;Payet and Suttle, 2013;Weinbauer et al., 2003), the reported frequencies of lytic infection (n = 438) and lysogenic infection (n = 266) in the literature were also collected into the database to facilitate the future exploration of marine viral activities. Lastly, we collected 83 viral decay rate data, 206 viral burst size data and 111 virus-mediated mortality data, which can be useful for certain studies. The gVOD is a compilation of all the available data, to our best knowledge, by 2019. We plan to update the database every 5 years.

Viral abundance
The viral abundance in this database was counted using one of the following three methods. In the first method, viruses were harvested by ultracentrifuging onto copper grids, stained with uranyl acetate and then enumerated using transmission election microscopy (TEM) (Akaike, 1974). In the second method, viruses were collected onto 0.02 µm filters, stained with a nucleic-acid-specific fluorescent dye (e.g., SYBR Green I) and then counted under an epifluorescence microscope (EFM) (Noble and Fuhrman, 1998). The third method counted viruses by using a flow cytometer (FCM), before which viruses were stained with fluorescent dye (e.g., SYBR Green I or SYBR Gold) and identified on the basis of the green fluorescence versus side scatter signal (Brussaard, 2004;Marie et al., 1999). The details of these three approaches have been described elsewhere (Weinbauer, 2004).

Lytic viral activity
Lytic VP is paramount and widely employed to assess the activity of lytic viruses at the community level and the roles of viruses in marine ecosystems. In this database, the lytic VP was estimated by one of the following five methods. The first method estimated VP by calculating expected viral release rates by multiplying the fraction of viral infected cells (mainly prokaryotes), prokaryotic productivity (assuming equal prokaryotic mortality rate), and burst size obtained from TEM studies (Noble and Steward, 2001) or  (Weinbauer et al., 2002). For notational simplicity, in this paper we label this method as FPB to represent the three variables (fraction of viral infected cells, prokaryotic productivity and burst size) listed above and used in the estimation. In the second method, called radioactive incorporation approach (RIA), lytic VP was estimated by determining viral DNA synthesis rates using a labeled radiotracer (e.g., 3 H-, 32 P-, or 14 C-labeled thymidine or leucine) and a conversion factor to quantify the incorporated radiotracer into viral particles (Noble and Steward, 2001;Zimina et al., 1973). The third method estimated the lytic VP from the viral decay rates (VDRs), assuming that the abundance of virus particles is in a steady state, and then the loss rate of virus particles should be balanced by the production rate (Heldal and Bratbak, 1991). The fourth method used fluorescently labeled viral tracers (FLVTs) to measure the dilution rates from the decay of labeled viruses and net changes in the non-labeled viruses in a natural viral community (Noble and Fuhrman, 2000). The fifth method quantified the increase in viral abundance during time course incubation using a virus dilution or virus reduction approach (VRA) Winget et al., 2005), which effectively avoided new viral infection by reducing viral abundances using pore-size filters or tangential flow filtration systems.

Lysogenic viral activity
Lysogenic VP is generally measured by detecting the proviruses (temperate viruses) that choose lysogenic infection in the environment. Lysogenic VP in this database was estimated using VRA described above after the provirus induction by mitomycin C (Weinbauer et al., 2002). Hence, the lysogenic VP was estimated as the difference in viral abundance per unit time between the mitomycin-C-treated and the control samples.

Quality control
We conducted quality control for the VA, lytic VP and lysogenic VP data of the database. A negative lytic VP (Wells and Deming, 2006) was removed. All zero-value (below detection limit) data were kept in the database but were not included in the following analyses. For those positivevalue data, we applied the Chauvenet's criterion to their logtransformed values to identify outliers (Glover et al., 2011): a datum was treated as an outlier when its probability of deviating from the observed mean was lower than 1/(2n), where n was the number of data samples. Outliers were marked in the database and not included in the following analyses.

Total number and biomass of viruses in the global ocean
Based on the VA data of our database, we estimated the total number of viruses in the global ocean using two methods. The second method used the random forest (MATLAB machine-learning toolbox) (Breiman, 2001) to construct a model of VA based on sampling latitude, longitude, months and depths. VA data were binned to 1 • × 1 • with 44 vertical layers, and the mean VA of each bin, if data were available, was fed into the random forest. When implementing the random forest, 75 % of samples were randomly selected for training the model while the rest data were used for model validation. The trained model was then used to predict VA for each bin and then to estimate the total viral number in the global ocean.
The viral biomass of the global ocean was calculated from the virus numbers using a conversion factor of 0.023 fg C per viral particle, which was based on an empirical relationship between carbon contents in heads of marine viruses (C head ) and their sizes (Jover et al., 2014): where r was the radium of viral head for which an average of 26.3 nm from the Tara Ocean expedition data was used (Brum et al., 2015).
In this paper, all the uncertainties reported in parentheses after the means are standard deviations, except that the standard errors of the mean are reported for the estimates of total viral number and biomass of the global ocean, because the mean values are used in the estimates, and therefore the uncertainties of the means are the most interested.

Data distribution
Most VA data were collected in the Northern Hemisphere (particularly in tropical and subtropical regions), while fewer data were collected in the Southern Hemisphere (Fig. 1a-c). In total, nearly two-thirds of the VA data were sampled in the North Atlantic Ocean (32 %) and North Pacific Ocean (29 %) (Fig. 2a). In addition, six long-term time series of VA were included in the compilation (Fig. 1a,   surface ocean (≤ 200 m, 71 %), while fewer data were sampled in the deep ocean (> 200 m, 29 %), particularly below 1000 m (Fig. 3a). Summer VA samples were most abundant, while winter had the fewest data (Fig. 4a).
Lytic VP data in the Northern Hemisphere are much more than those in the Southern Hemisphere (Fig. 1d-f), with almost half of lytic VP data sampled in the North Pacific Ocean (31 %) and North Atlantic Ocean (18 %) (Fig. 2b). A majority of lytic VP data (86 %) was collected in the surface ocean (Fig. 3b), while the deep samples were mostly from the North Atlantic and the western and northeastern Pacific oceans (Fig. 1e). There were seasonal biases in lytic VP data, most of which were sampled in summer while rarely sampled in autumn (Fig. 4b). More lytic VP data were sampled in open oceans (63 %) than in coastal waters (37 %) (Fig. 5). Almost every lytic and lysogenic VP data was accompanied by VA measurements.
There was a very limited number of lysogenic VP data in both surface and deep oceans ( Fig. 1g and h), with those deep samples being even much fewer than the already scarce surface ones (Fig. 3c). The Northern Hemisphere had slightly more lysogenic VP data than the Southern Hemisphere (Fig. 1i), with most lysogenic VP data sampled in the North Pacific (29 %), the North Atlantic (24 %) and the South Atlantic (23 %) oceans (Fig. 2c). Similar to lytic VP data, lysogenic VP data tended to be collected in spring and summer compared to other seasons, particularly winter (Fig. 4c). Lysogenic VP data in the open ocean (77 %) were also much more than those in coastal waters (23 %) (Fig. 5).
In summary, most viral data were sampled in North Atlantic and northeast Pacific oceans (Figs. 1 and 2), and more data were sampled in the surface than in the deep oceans (Fig. 3). Viral data also tended to be sampled in summer (Fig. 4). Although the total viral data in the coastal samples were fewer than the open-ocean samples (Fig. 5), they were more concentrated in the coastal zones considering their relatively small area in the global ocean.

Viral abundance in the global ocean
In the surface oceans, VA (n = 7768) mostly varied in the order of 10 6 to 10 8 particles mL −1 , with mean VA in coastal waters (3.61(±6.3) × 10 7 particles mL −1 ) about 5 times higher than that in the open oceans (7.3(±12.4) × 10 6 particles mL −1 ) ( Fig. 6a and b). VA in the coastal South Atlantic Ocean and Mediterranean and Baltic seas was higher than that in other coastal oceans (Fig. 6a). Although the VA across different surface open oceans was distributed in similar ranges, the average VA in the Pacific (particularly in its southern portion) was higher than those in other basins (Fig. 6b), a pattern previously found in another study  using fewer data than this study. VA decreased with depth, with those in the global deep ocean [1.26(±2.44) × 10 6 particles mL −1 , n = 3164] about 1 order of magnitude lower than those in the surface ( Fig. 7a and b). The vertical profiles in different open-ocean basins more clearly showed that the VA in the Pacific was higher than that in the Atlantic in surface (1000 m) waters, while the difference did not exist in deeper oceans (Fig. 7b).
In our database, most VA samples were measured using FCM (7353, 67.26 %) and EFM (3465, 31.71 %), while only 112 (1.03 %) VA samples were counted using TEM (Table A1). Previous studies have shown that the VA counted using FCM, which became more popular in studies after 2014 (Table A1), had a strong correlation with those using EFM (Brussaard et al., 2010;Marie et al., 1999;Payet and Sut-tle, 2008). Our data demonstrated that the VA obtained by FCM and EFM methods has consistent results in similar environments. For deep open-ocean samples, VAs using TEM are substantially lower than those using the other two methods (Fig. 8). But considering much fewer VA data points using TEM than others ( Fig. 8 and Table A1), we cannot conclude TEM substantially underestimated VA in the deep water samples. Nevertheless, our database provides references for methodological comparison in the future.
The total number of global ocean viruses estimated by binning the VA data ( Fig. 7a and b) is 1.56(±0.2) × 10 30 particles (mean ± s.e.), which is very close to the estimate of 1.49(±0.14) × 10 30 particles (mean ± s.e.) using the random forest method (Fig. A1). Both values are consistent with the previous estimates of 1 × 10 30 (Suttle, 2007) version factor of 0.023 fg C per viral particle (see Methods), our two values of total viral number give the estimates of total viral biomass in the global ocean at 35.9 ± 0.46 and 34.4 ± 0.32 Tg C, respectively, confirming a recent estimate of 30 Tg C (Bar-On and Milo, 2019).

Viral production
In the surface ocean, lytic VP (n = 523) varied greatly from 10 3 to 10 7 particles mL −1 h −1 in different ocean basins ( Fig. 6c and d). The overall mean and standard deviation of lytic VP in the global ocean were 9.87(±24.16) × 10 5 (ranging 0.00746 × 10 5 -350 × 10 5 ) particles mL −1 h −1 . Lytic VP values in the surface open Pacific Ocean were about 1 order of magnitude higher than those in the surface open Atlantic Ocean (Fig. 6d), a pattern consistent with VA (Fig. 6b). Lytic VP in the surface Arctic Ocean was much lower than that in other basins, which was expected considering its much lower biological productivity (Fig. 6c, d). Although insufficient lytic VP data (n = 82) were available for meaningful statistical analyses in the deep waters (Fig. 9), the existing data showed a general trend that VP decreased by 1 order of magnitude from the surface to the deep open oceans (Fig. 7c). Unlike VA, average lytic VP in coastal waters was close to that in the surface open ocean (Fig. 6c).
Most of the lytic VP (84.4 %) in this database was estimated by VRA, suggesting that VRA was widely utilized in literature and became a standard method to estimate VP across different marine environments. Several studies have tried to compare different approaches estimating the lytic VP, revealing that the VRA method was more reliable and less laborious, compared to the probable overestimation by the FLVT approach and the potential underestimation by the RIA method, though such comparisons were mainly constrained to the coastal ocean Karuza et al., 2010;Rastelli et al., 2016;Winget et al., 2005). Additionally, although a meaningful comparison of reported lytic VP values between disparate marine ecosystems is complicated by the inherent variability among approaches, the lytic VP rates in this database might provide a tentative global-scale insight into methodological comparison. Our statistics showed that, in similar environments, the lytic VP rates determined by FLVT and VRA were higher than those measured by RIA. For coastal samples, such difference among methods was not obvious (Fig. 9). However, due to the limited number of samples using the methods other than VRA ( Fig. 9 and Table A2), we did not have adequate data to tell if the difference in VP was caused by the measurement methods or the randomness of the samples. Hence, more measurements of lytic VP using multiple approaches simultaneously will be certainly needed to better evaluate the differences among them.
The lysogenic VP data were too few (surface n = 85, deep ocean n = 34) for meaningful comparisons across different ocean basins or between the surface and deep waters, although the results were plotted for readers' reference ( Fig. 6e and f). The overall lysogenic VP in the global ocean was estimated at 2.53(±8.64) × 10 5 (ranging 0.00132 × 10 5 -68.8×10 5 ) particles mL −1 h −1 , which was about one-third of the level of lytic VP, although more data will be needed to better compare the two types of VP.

Code availability
The MATLAB codes for calculating the total number of viruses can be found in the Supplement or be obtained by requesting the corresponding authors.

Conclusion
We constructed a global ocean viral database (gVOD) by compiling 10 931 VA data, 608 lytic VP data and 119 lysogenic VP data. This database may be useful for globalscale studies of viral processes and their roles in marine ecosystems and biogeochemical cycles. The VA, lytic VP and lysogenic VP data were greatly variable. Most VAs were counted using flow cytometers and epifluorescence microscopes, while the virus reduction approach was the most popular method in estimating VP. The lytic VP is about 3 times higher than the lysogenic VP. The calculation using the database also confirms the previous estimates of viral numbers and biomass in the global ocean.
Our database shows that the current investigations have the limitation in spatiotemporal coverage. The VA dataset has a poor coverage in the South Pacific and Indian Ocean. The lytic VP dataset does not have a good coverage in the South Pacific, northwest Pacific, Indian and South Atlantic oceans. The lysogenic VP data are very few in the global ocean. Vertically, all viral data were sampled much less in mesopelagic and deep oceans than in the surface oceans. Thus, the measurements of viral parameters in these regions and depths should be given high priority. In addition, more viral data should be sampled in winter to avoid seasonal biases.
The database is stored in a public data repository (PAN-GAEA) and will be updated regularly when new data become available. We hope that the database will be valuable for field and modeling studies in marine ecology, biogeochemistry and other areas of oceanography.
Appendix A Table A1. Sources and methods of viral abundance data. EFM: counted by epifluorescence microscopes; FCM: counted by flow cytometer; TEM: counted by transmission election microscopy. Data marked by * are those collected in a previous dataset (Wigington et al., 2016).

Region
Number of
Author contributions. RZ and YWL conceived and designed structure of database and mathematical analyses of the data. LX, WW, LC, XC and YH collected the data and described the metadata. LX, NJ, RZ and YWL conducted quality control and analyses of the data. LX, RZ and YWL led the writing of the paper, with contribution from all the co-authors.