A metadata template for ocean acidification data

Instruments Data Provenance & Structure


Introduction
Since the start of the Industrial Revolution, human activities have released large amounts of carbon dioxide (CO 2 ), causing atmospheric CO 2 to increase over 40% compared to the highest CO 2 levels in the last 800 000 years (Hönisch et al., 2009).However, only about half of the emitted CO 2 stays in the atmosphere.The remaining portion has been taken up by the ocean and terrestrial systems.Based on the most recent data, from 2004 to 2013, the global oceans have absorbed about 26 % of anthropogenic CO 2 emissions (Le Quéré et al., 2014).While effectively slowing the increase of atmospheric CO 2 and hence global climate change, the oceanic uptake of CO 2 is causing decreasing pH, carbonate ions (CO 2− 3 ) concentration, and carbonate saturation state.The process is commonly called ocean acidification (OA) (Caldeira and Wickett, 2003;Feely et al., 2004;Orr et al., 2005).
Over the past 10 years, OA studies have expanded significantly.Based on the bibliographic database of the European Project on Ocean Acidification (EPOCA), the number of OA publications averaged about 10-20 per year from 1990 to 2005, and then increased sharply to about 270 publications per year by 2011 (Laffoley and Baxter, 2012).Much of this increase was from studies on biological responses to OA (Nisumaa et al., 2010(Nisumaa et al., , 2012)).For example, publications of this type accounted for over 80 % of all OA papers in 2011 (Laffoley and Baxter, 2012).With the rapid growth in research and the parallel rise in publications, the need for a comprehensive OA metadata template to facilitate archiving and access to these data has been recognized by the international ocean acidification data management community (Hansson et al., 2014).
Metadata is structured information that describes an information resource (e.g., an oceanographic data set), enabling its discovery and access (Guenther and Radebaugh, 2004).For OA studies, a metadata record documents such information as what was measured; by whom; when (temporal coverage); where (geographic coverage); how was it sampled and analyzed; with what instruments and following what protocol; and finally what were its units of measure and quality of the data (Pesant et al., 2010).
Metadata is critical to data discovery.It enables data sets to be found through relevant criteria, and provides information about locations of the data sets.Metadata also helps to document information about the data sets, so that they can be understood and utilized beyond the original use.Metadata plays an extremely important role in supporting archiving and preservation of data, facilitating interoperability, and synthesizing legacy data.It serves as the key to ensuring that a data set will be accessible by future researchers (Guenther and Radebaugh, 2004).
While metadata templates for chemical OA data have been available for a long time, the lack of a template for biological response OA data is a significant hindrance to the effective management of these data.Establishing a metadata template is thus fundamental for research into biological responses to OA.We present a metadata template developed in collaboration with many OA researchers.It is applicable to a broad spectrum of OA data sets, including those from studies of biological responses to OA.

Requirements for the OA metadata template
The envisioned OA metadata template responds to the needs expressed by the community to meet three requirements: (1) to enable data discovery, (2) to document information about OA data sets in a consistent manner, and (3) to be broadly applicable to many types of OA studies.These three requirements served as the guiding principles in the development of the OA metadata template.

Data discovery
One of the main functions of metadata is to enable data discovery.During the OA metadata template development, data discovery was emphasized when decisions were made on the selection of elements and their organization.

Documenting data
Another important role of the metadata template is to document information about a data set.The value of a data set increases significantly if it comes with the documentation needed to understand and use the data set.If such information could be collected, stored, discovered, and accessed in a consistent manner, the data set would be made more available towards improved assessments of marine ecosystem vulnerability and better OA forecasting capabilities.

Broad applicability
The template targets a broad spectrum of OA data sets.Ocean acidification covers a wide range of oceanographic subject areas, including chemical observations, biological monitoring, physiological response experiments, model studies, and paleoceanography studies.If the metadata template can be constructed to apply to many types of OA data sets, the OA data management effort will be much more effective.

Development process
The OA metadata template development involved two steps:  Mize et al., 2011) to generate the ocean acidification ISO metadata template.The ISO 19115-2 standard was chosen to take advantage of such sections as "MI:Acquisition" which allows data providers to capture information about ships and other data collection platforms in a structured machine readable format.All of the fields in the original ISO 19115 standard are captured within ISO 19115-2, so there is no potential loss of content in choosing to use this version of the standards.

OA metadata template
The OA metadata template (Ocean Acidification Data Stewardship Team, 2014) consists of three files: 1. a submission form (in Microsoft ® Excel spreadsheet) that can be used to prepare the metadata, 2. an instruction file (in Adobe ® portable document format) that defines the fields and lists their hierarchical relationships, and 3. an XML file (encoded with the ISO 19115-2 standard) that can be fed into data search portals for OA data discovery.
In the following sections, the main structure of the metadata template, its elements, and their hierarchical relationships will be described.

Variable metadata sections
The term "Variables" (or "Parameters") refers to the observed or derived properties of a study (e.g., temperature, salinity, dissolved oxygen (DO), chlorophyll, and larval survival rate).Hereafter, we will use the word Variable(s), although Parameter(s) is an acceptable synonym for this discussion.Table 1 lists some commonly used variables in OA research.Variables are treated as the focal point of the entire metadata template because we expect them to be the single most important metadata elements that would be used as search terms to locate a data set.Furthermore, all types of OA research generate some kind of variables, regardless of their sampling scheme, experiment setup, or model inputs.Therefore, the treatment of variables as the focal point of the template allows the template to apply to many types of OA data.

Controlled vocabularies for variables
In order to meet the data discovery goal, it is important to maintain controlled vocabularies for the variables (Table 2).
As an example, the variable -dissolved inorganic carbonis also commonly referred to as total carbon dioxide.If no controlled vocabulary is used, data consumers would have to search the database using all possible variations of the term, in order to locate every available data set.Controlled vocabularies, however, would allow data consumers to locate the data sets by using the corresponding term in the controlled vocabulary.
Ideally, controlled vocabularies should be standardized across all OA data centers.The Ocean Acidification International Coordination Centre (OA-ICC) has been leading the efforts in developing controlled vocabularies for OA data documentation.SeaDataNet, the "Pan-European infrastructure for ocean and marine data management", also provides controlled vocabularies for many kinds of broader oceanographic services.

Key child elements of a variable
The child metadata elements of a variable are organized around the variable itself to form a "variable metadata section" (Table 2).The following fields, observation type, whether the variable is a manipulation condition or a response variable, and the biological subject on which the variable is studied form the skeleton structure of the variable metadata section.
"Observation type" identifies the way a variable was captured in relation to its observational context.It could be generic terms that describe how a variable is collected.For example, for chemical OA studies, the observation type could be "Surface underway", "Time series", or "Profile" (Table 3).For physiological response OA studies, such terms as "Laboratory experiment", "Pelagic mesocosm", "Benthic mesocosm", or "Natural perturbation site study" could be used (Pesant et al., 2010).
A metadata element called "Insitu/manipulation/response" is also added as a child element of a variable.In physiological response OA studies, variables could fall into several categories.For example, carbon-related variables, e.g., pH, partial pressure of carbon dioxide (pCO 2 ), dissolved inorganic carbon (DIC), and total alkalinity (TA), are often manipulated to simulate future OA conditions, while other variables, e.g., calcification rate, growth rate, and larval survival rate, are monitored to understand the responses of the organisms to OA.The former group of variables can be categorized as "Manipulation condition variables", and the latter can be labeled as "Response variables".To enable this metadata element to be applicable to other types of OA studies, we expanded its scope to cover in-situ observed variables, and call it "In-situ/manipulation/response".
Table 1.Some commonly used ocean acidification variables, their definitions, and recommended abbreviations.

Variable name [abbreviation or formula] Definition
Water temperature [Temp] In situ temperature of the water Salinity Salinity is the salt content of a water sample or body of water.The measure of salt content of a water sample follows the United Nations Educational Scientific and Cultural Organization (UNESCO) standards known as the Practical Salinity Scale (PSS) as the conductivity ratio of a seawater sample to a standard KCl solution.PSS is a ratio and has no units.
[Salinity] Salinity measured from discrete bottle sampling [CTDSAL] Salinity recorded by the conductivity, temperature, depth (CTD) sensor Dissolved inorganic carbon Dissolved inorganic carbon (DIC) is the sum of the concentrations of all inorganic carbon species in the ocean: bicarbonate ion (HCO − 3 ), carbonate ion (CO 2− 3 ), and un-ionized . About 90 % of DIC is present as bicarbonate ion, the proportion of carbonate ion is about a factor of 10 less (∼ 10 %), and that of un-ionized carbon dioxide is another factor of 10 less (< 1 %).DIC is often inaccurately called total CO 2 , which also includes inorganic CO 2 species bound to organic molecules.
Total alkalinity Total alkalinity of seawater is the number of moles of hydrogen ions equivalent to the excess of proton acceptors (bases formed from weak acids with a dissociation constant K ≤ 10 −4.5 at 25 • C and zero ionic strength) over proton donors (acids with K > 10 In the oceans, most of the negatively charged ions present are bicarbonate (HCO − 3 ) and carbonate (CO 2− 3 ).pH [pH] pH is defined as the negative decimal logarithm of the activity of the hydrogen ion in a solution and is a measure of acidity.pH = −log 10 [H + ].Commonly used pH scales include the free scale, total scale, seawater scale, and National Bureau of Standards (NBS) scale.The free scale includes only the effect of free hydrogen ions.The total scale includes the effect of both free hydrogen ions and hydrogen sulfate ions.The seawater pH scale includes the effect of free hydrogen ions, hydrogen sulfate ions, and fluoride ions.The International Union of Pure and Applied Chemistry (IUPAC) defines a series of buffer solutions across a range of pH values (often denoted with NBS designation).These solutions have a relatively low ionic strength (∼ 0.1) compared to that of seawater (∼ 0.7), and, as a consequence, are not recommended for use in characterizing the pH of seawater since the ionic strength differences cause changes in electrode potential.
Partial pressure (or fugacity) of carbon dioxide in water [pCO 2 (or f CO 2 ) water] The partial pressure of an ideal gas in a mixture is equal to the pressure it would exert if it occupied the same volume alone at the same temperature.The fugacity (f ) of a real gas is an effective pressure which replaces the true mechanical pressure in accurate chemical equilibrium calculations.It is equal to the pressure of an ideal gas that has the same chemical potential as the real gas.

Mole ratio of carbon dioxide in the atmosphere [xCO 2 -atm]
The concentration of carbon dioxide in the atmosphere is often recorded as the mole ratio of carbon dioxide molecules and that of all molecules in the atmosphere.Its unit is parts per million (ppm, or ppmv), unlike the units of pCO 2 (or f CO 2 ) water: micro atmosphere (or µatm).

Dissolved oxygen
The concentration of oxygen dissolved in water [Oxygen] Dissolved oxygen measured by chemical analysis (e.g., Winkler titration) [CTDOXY] Dissolved oxygen measured by optical probes on a CTD rosette

Chlorophyll [CHL]
Chlorophyll is a green pigment found in cyanobacteria and the chloroplasts of algae and plants.All oxygenic photosynthetic organisms use chlorophyll a, but differ in accessory pigments like chlorophyll b.

Chromophoric (or colored) dissolved organic matter [CDOM]
Chromophoric or colored dissolved organic matter is the optically measurable component of the dissolved organic matter in water.It occurs naturally in aquatic environments primarily as a result of tannins released from decaying detritus.
Earth Syst.Sci.Data, 7, 117-125, 2015 www.earth-syst-sci-data.net/7/117/2015/ Photosynthetically active radiation is a measurement of the spectral range (wave band) of solar radiation from 400 to 700 nm that photosynthetic organisms are able to use in the process of photosynthesis.

Nitrate [Nitrate]
In seawater, inorganic nitrogen is found in several different forms, including ammonia/ammonium (NH 3 /NH − 4 ), nitrates (NO − 3 ), and nitrites (NO − 2 ).The nitrate ion is the base of nitric acid (HNO 3 ), with a molecular formula of NO − 3 .It is the principal form of fixed dissolved inorganic nitrogen assimilated by organisms.

Nitrite [Nitrite]
The nitrite ion (NO − 2 ) is a symmetric anion with equal N-O bond lengths.Upon protonation, the unstable weak acid nitrous acid (HNO 2 ) is produced.
Nitrate is often quantitatively reduced to nitrite, before the reduced nitrite as well as the previously existing nitrite in seawater can be determined colorimetrically.The sum of the nitrite (NO − 2 ) and nitrate (NO − 3 ) concentrations is often reported when this analytical method is used.Ammonia/ammonium [Ammonia/ammonium] Ammonia and ammonium are acid/base pairs.Ammonia is a compound of nitrogen and hydrogen with the formula NH 3 .The ammonium cation is a positively charged polyatomic ion with the chemical formula NH + 4 .It is formed by the protonation of ammonia (NH 3 ).

Phosphate [Phosphate]
Phosphoric acid is a mineral (inorganic) acid, and the chemical formula is H 3 PO 4 .The conjugate base of phosphoric acid is the dihydrogen phosphate ion (H 2 PO − 4 ) that in turn has a conjugate base of hydrogen phosphate (HPO 2− 4 ), which has a conjugate base of phosphate (PO 3− 4 ).Phosphate, a salt of phosphoric acid, is considered the most important phosphorus species that is immediately biologically available in seawater.Phosphate exists predominantly in the form of HPO 2− 4 in seawater.

Silicate [Silicate]
Silicon exists in seawater usually as silica (SiO 2 ) or silicates (SiO 4− 4 or SiO 2− 3 ).Silicate is commonly used to refer to the sum of silica and silicate.Dissolved silicate concentrations in seawater range from < 1 µmol kg −1 in surface waters to ∼ 180 µmol kg −1 in the deep North Pacific.

Chlorofluorocarbons [CFC]
Chlorofluorocarbons are gases that are synthetic halogenated methanes.They were introduced as industrial coolants in the 1930s and afterward.In oceanography, they are used as tracers of ocean circulation.

Delta carbon-13 [DELC13]
Delta carbon-13 (δ 13 C) is a measure of the ratio of the carbon isotopes 13 C : 12 C (carbon-13 : carbon-12) in the sample to that in the reference standard, reported in parts per thousand (per mil, ‰).

Delta carbon-14 [DELC14]
Delta carbon-14 ( 14 C) is a measure of the ratio of carbon isotopes 14 C : 12 C (carbon-14 : carbon-12) in the sample to that in the reference standard, reported in parts per thousand (per mil, ‰). 14C represents the "normalized" value of δ 14 C. Normalized means that the effect of fractionation is removed.

Delta oxygen-18 [DELO18]
Delta oxygen-18 (δ 18 O) is a measure of the ratio of oxygen isotopes 18 O : 16 O (oxygen-18 : oxygen-16) in the sample to that in the reference standard.

Delta nitrogen-15 [DELN15]
Delta nitrogen-15 (δ 15 N) is a measure of the ratio of nitrogen isotopes 15 N : 14 N (nitrogen-15 : nitrogen-14) in the sample to that in the reference standard.In biological studies, many of the measured variables are attached to a specific organism or a biological community.For example, the variable "Larval survival rate" is not detailed enough without mentioning species on which the larval survival rate was studied.An element, called "Biological subject", is where users identify an organism or a biological community, to which the observation applies.

Additional child elements of a variable
The four metadata elements discussed above -variable name, observation type, in-situ/manipulation/response, and biological subject -form the skeleton structure of any "variable metadata section".They are also the main discovery metadata elements that would be used to locate data in a data search portal.However, additional metadata elements are needed to make the variable independently understandable.
"Variable abbreviation" documents the abbreviation or formula of a variable in the data files."Full variable name" spells out the detailed descriptive name of the variable.For manipulation condition variables, their manipulation methods, e.g., bubbling carbon dioxide, adding acid or base to the solution, can be recorded in a field called "Manipulation method"."Units" should be reported in accordance with the National Institute of Standards and Technology (NIST) International System of Units (SI).In addition, whether a variable is "Measured or calculated", and its "Calculation method and parameters" are also important pieces of information to document (Table 2).
Instrumentation is split into two categories: "Sampling instruments" and "Analyzing instruments".A common mistake in data management practices is that sampling and analyzing instruments are used interchangeably.For example, if a researcher measured dissolved oxygen (DO) using an oxygen sensor attached to a conductivity, temperature, depth (CTD) rosette, some people may document the instrument of the study as CTD rosette, but others may think the instrument is the oxygen sensor, even though both should be recorded: CTD rosette as the sampling instrument and oxygen sensor as the analyzing instrument.Instruments that are used to collect water samples or deploy sensors are here defined as sampling instruments.Examples of sampling instruments include a CTD rosette, a Niskin bottle, and a flow-through pump onboard a research vessel.The term Analyzing instruments, however, refers to instruments that are used to analyze water samples collected with the sampling instruments, or sensors that are mounted on the sampling instruments to measure some variables of the water.For example, the analyzing instrument for a pH measurement could be a glass electrode coupled with a pH meter, or a spectrophotometer.In addition, we also created a free text field called "Detailed sampling and analyzing information" to allow users to capture additional details of their sampling and analyzing procedures beyond what instruments are used.
Several elements that elaborate on the sample size and data quality are also included in the template."Uncertainty" is an open-text field that allows users to document information about the data quality of the variable.Input to this field could be the standard deviation of the measurements (e.g., 1 %, 2 µmol kg −1 ), instrument error logs, or other information related to the quality control of the variable.The data quality flag conventions/rules of the variables can be stored in the field "Data quality flag description".Input to this field includes the name of the data quality flag convention or descriptions of the data quality flag rules, e.g., "2-good quality data", "3-questionable data", and "4-bad data".
In addition to "Biological subject", a field called "Species identification code" is added to document the standard species IDs, if such information is available (Table 2).Using the reference databases from the Integrated Taxonomic Information System (http://www.itis.gov/)or World Register of Marine Species (http://marinespecies.org/) is recommended, although use of any other reference databases is allowed.The name of the reference database can also be recorded in this metadata field.The field "Life stage of the biological subject" documents development stages of the species.Input to this field can be eggs, larvae, juveniles, adults, etc.The use of the British Oceanographic Data Center (BODC) parameter semantic model biological entity development stage terms (S11) is recommended.
"Method reference" is reserved to hold the bibliographic citation information of the method used to measure the A study the OA effect on shell calcification rate of a species in an indoor aquarium, by bubbling CO 2 to raise the acidity of the water.

Pelagic Mesocosm
A mesocosm study has the advantage over standard laboratory experiments in that it maintains a natural community under close to natural, self-sustaining conditions, taking into account relevant aspects from "the real world" such as indirect effects, biological compensation, and recovery, and ecosystem resilience (Riebesell et al., 2010)."Pelagic" zone is defined as any water in a sea or a lake that is neither close to the bottom nor near the shore. (none) Benthic Mesocosm A mesocosm study has the advantage over standard laboratory experiments in that it maintains a natural community under close to natural, self-sustaining conditions, taking into account relevant aspects from the real world such as indirect effects, biological compensation and recovery, and ecosystem resilience (Riebesell et al., 2010)."Benthic" zone is the ecological region at the lowest level of a body of water such as an ocean or a lake. (none) Natural perturbation site study The natural perturbation site study is based on looking directly at how organisms and communities and ecosystems react to high/low pH and carbonate saturation state in the real world, replete with all its biodiversity, ecosystem interactions, and adaptation to the ambient chemistry. (none) Model output Generation of data from numerical models (none) variable.Most OA studies involve collaboration from multiple scientists.Therefore, it is important to document the researchers' information for each variable, to give them credit for their sampling and analyzing efforts and to allow trace-ability for future questions.For the sake of brevity, only researchers' names and their affiliated institutions are recorded.

Investigators
"Investigators" information is needed to credit researchers for their overall data collection and analysis efforts.In addition to the basic address and contact information, we recommend the use of personal identifiers (e.g., ORCID, Re-searcherID) to unambiguously define the investigator, and recommend using a controlled vocabulary for organizations as well.

Temporal and spatial coverage
Temporal and spatial information provide important data constraints.It could be as broad as an ocean basin, such as North Pacific Ocean, or as specific as a local river or bay.For biological studies, the bounding box coordinates and geographic names are used to document the location of the water collection.An additional field called "Location of organism collection" is used to document where the organisms were collected.

Platforms and sampling IDs
"Platforms" often refer to the research vessels that carry out the research.However, platforms could be something other than a ship (e.g., glider, Argo, satellite) or something that is fixed (e.g., moored buoys, towers)."Expedition code (EX-POCODE)" consists of the four-digit International Council for the Exploration of the Sea (ICES) platform code and the sailing date in the YYYYMMDD format.For a list of ICES platform codes, please refer to http://vocab.ices.dk."Section ID" is the identification number for a research cruise section or leg.It was commonly used during the WOCE studies, which often had many repeating cruises on a single section, e.g., A16N."Cruise ID" is the particular ship cruise number (e.g., MT901), or other alias for the cruise.For example, a cruise ID (e.g., A16N_2013) could consist of a section ID (e.g., A16N) and the sampling year (e.g., 2013).
3.5 Citation, references, and supplementary information "Citation" refers to bibliographic information about how the data set should be cited.When working with a data center to publish their data, data producers may only need to provide the author list to complete the citation as the other portions of the citation are typically the responsibility of the data publisher.When compiling an author list, we recommend using the format of Lastname1, Firstname1 Middle-name1; Lastname2, Firstname2 Middlename2; . . .For data centers, we recommend the use of the styles described in the FORCE11 Joint Declaration of Data Citation Principles (https://www.force11.org/datacitation)to organize your data citations."References" are bibliographic citations of publications, e.g., papers, cruise reports, that describe the data set.Researchers often submit their data to an archive after their work has been published.It is important to share related publications to help future users better understand the data set.The "Supplementary information" field is reserved for any information critical to understanding the data set that does not fit into any other existing fields.

Conclusions
The creation of a common metadata template to manage biological response OA data sets is a major effort by the OA research and data management community.We described a metadata template that applies to many types of OA data sets, including chemical OA data sets and those describing the biological responses to OA.In addition to serving OA data management efforts effectively, the template can be used by the OA research community for documenting their OA data sets, sharing data sets among researchers, and submitting data sets to data centers.The metadata template files are stored at the National Oceanic and Atmospheric Administration institutional repository with a digital object identifier (DOI) of 10.7289/V5C24TCK.The metadata development approach documented here can benefit other scientific data management programs in terms of metadata template development.
Derek Manzello, Renee Carlton, Jessica Morgan (NOAA), and David Kline (Scripps Institution of Oceanography) for their comments on the template.Rob Ragsdale (US Integrated Ocean Observing System), Emilio Mayoga (University of Washington), and Sara Haines (University of North Carolina) contributed to the development of definitions for some of the variables.We are indebted to Dean Perry and Dylan Redman (NOAA Northeast Fisheries Science Center) for allowing us to use their metadata records as a real world example of our metadata submission form.

Table 2 .
Variable metadata section, with child metadata elements organized around the variable/parameter.

Table 3 .
Commonly used observation types of a variable in ocean acidification studies.
Bounding box coordinates refer to the rectangular box whose edges are defined by the northernmost and southernmost latitudes, and the westernmost and easternmost longitudes."Geographic names" are the names of the seas or water bodies where the sampling takes place.