MOSAIC (Modern Ocean Sediment Archive and Inventory of Carbon): 1 A (radio)carbon-centric database for seafloor surficial sediments

28 Mapping the biogeochemical characteristics of surficial ocean sediments is crucial for 29 advancing our understanding of global element cycling, as well as for assessment of the 30 potential footprint of environmental change. Despite their importance as long-term repositories 31 for biogenic materials produced in the ocean and delivered from the continents, 32 biogeochemical signatures in ocean sediments remain poorly delineated. Here, we introduce 33 MOSAIC a 35 (radio)carbon-centric database that seeks to address this information void. The goal of this 36 nascent database is to provide a platform for development of regional to global-scale 37 perspectives on the source, abundance and composition of organic matter in marine surface 38 sediments, and to explore links between spatial variability in these characteristics and 39 biological and depositional processes. The database has a continental margin-centric focus 40 given both the importance and complexity of continental margins as sites of organic matter 41 burial. It places emphasis on radiocarbon as an underutilized yet powerful tracer and 42 chronometer of carbon cycle processes, and with a view to complementing radiocarbon 43 databases for other earth system compartments. The database infrastructure and interactive 44 web-application are openly accessible and designed to facilitate further expansion of the 45 database. Examples are presented to illustrate large-scale variabilities in bulk carbon properties 46 that emerge from the present data compilation. 47

, with tree-rings, corals and other annually-resolved archives providing information on 71 historical variations in 14 C in the atmosphere and surface reservoirs (Friedrich et  In this study, we present MOSAIC (Modern Ocean Sediment Archive and Inventory of Carbon) 136 -a database designed to provide a window into the spatial variability in geochemical and 137 sedimentological characteristics of surficial ocean sediments on regional to global scales. 138 MOSAIC represents the starting point of an on-going endeavor to compile from data from prior 139 and on-going studies in order to build a comprehensive, continental margin-centric picture of 140 the distribution and characteristics of organic matter accumulating in modern ocean sediments. 141 The database infrastructure has been configured for facile incorporation of new data, for 142 expansion of included parameters, as well as for retrieval of data in an accessible and citable 143 format. MOSAIC is realized in an interactive web environment which allows users to visualize, 144 select and download data. This infrastructure is built using open-source (or optional open-145 source) software (SI Table 1). The overarching goal is for MOSAIC to serve as a data platform 146 https://doi.org/10.5194/essd-2020-199 The focus of MOSAIC is on the coastal ocean (continental margins) with limited inclusion of 153 data from deep ocean settings. Attention is also restricted to surficial sediments (nominally the 154 upper ~ 1m) that are most effectively sampled with shallow coring systems designed to recover 155 an intact sediment-water interface (e.g., hydraulically-damped multicorer, box corer). The 156 rationale is because of the focus on processes associated with deposition, early diagenesis, and 157 burial of organic matter, rather than on down-core investigations used for paleooceanographic 158 and paleoclimate reconstruction. Sediment depth profile data primarily used to examine 159 diagenetic profiles, and to constrain sedimentation rates, mixed layer depths, redox gradients, 160 as well as to determine carbon fluxes and inventories. 161 162

Scope of data acquisition 163
The data currently comprising the MOSAIC database was extracted from over two hundred 164 publications. No unpublished data is included in the on-line version, and the focus of the 165 database in this initial phase of implementation is on an initial suite of commonly measured 166 sediment parameters (e.g. sampling depth, carbon content and δ 13 C) that are available in high 167 abundance. A non-exhaustive list of the most important parameters cataloged in the MOSAIC 168 database can be found in Table 1. A more comprehensive list of parameters that are targeted 169 for inclusion in the near future can be found in the Supplemental Information (SI). 170 171

Core parameters 172
The database was established based on selected key parameters, with a particular emphasis on 173 the radiocarbon content of OC, as well as other basic properties that provide broader 174 geochemical and sedimentological context ( Table 1). The former include total organic carbon 175 (TOC) and total nitrogen (TN) content, organic carbon/total N ratios, and the stable carbon 176 isotopic composition (d 13 C and 14 C values) of OC. Sedimentological parameters are yet to be 177 implemented in the on-line version but will include parameters such as grain size, mineral 178 https://doi.org/10.5194/essd-2020-199 6.3.10). The relational aspect of the database means that data (e.g., related to sample or 185 location-specifics) are stored in data tables which are connected (or related) by a unique 186 identifier. "Normalized" implies that in the structure of the database redundancies are 187 eliminated (e.g., a variable such as water depth occurs only once in the database, Codd, 1990). 188 A schematic of the detailed database structure can be found in SI Figure 2. The database 189 structure contains entries for key geochemical parameters pertaining to ocean sediment core 190 samples, including organic matter content, isotopic signature, and composition, as well as 191 texture and sedimentological parameters. Information can be collected for bulk samples as well 192 as for example size and density fractions. Furthermore, it is designed to enable additional 193 modules that can accommodate data related to other sample suites such as sinking particulate 194 matter from the ocean water column (e.g., time-series sediment traps), or riverine samples. It 195 includes is an exclusivity option which can be used to indicate if data is in the public domain 196 or not (e.g., pending publication of separate contributions). 197 Reporting conventions are detailed in the SI Table 2. Units as specified in the original papers 198 were used (listed in SI). Where possible 14 C information was collected as D 14 C, alternatively it 199 was collected as Fm and all D 14 C values were converted to Fm (Stuiver and Polach, 1977). 200 Ongoing efforts are underway to further harmonize the data and convert all data to D 14 C for 201 the next iteration for the MOSAIC database. 202

The MOSAIC Pipeline 203
There is a five-step pipeline for incorporation of data into MOSAIC. These are: (1) data 204 ingestion, (2) quality control, (3) transformation and structuring and (4) addition to a user-205 friendly MySQL database interface, which is (5) available for users via a website ( Figure 1). 206 This design enables users to query the collected data and augment and extend the existing 207 database using familiar spreadsheet software (Microsoft Excel®, LibreOffice). The associated 208 app allows any user to interactively select, visualize and query data without using database 209

Data ingestion 212
Input of data to the database is possible by filling in a pre-structured spreadsheet file with set 213 vocabularies. The user selects relevant parameter inputs from drop-down menus that streamline 214 data entry and assist in execution of subsequent SQL queries. Excel files were designed for 215 specific datasets, and within each Excel file there are three sub-tabs corresponding to groups 216 of the normalized MOSAIC SQL database (more details on database structure are provided in 217 the database). These tabs are (i) sample-related tab, (ii) geopoint-related tab (i.e., location), (iii) 218 author-related tab (i.e., paper). Certain variables pertaining to sample coordinates and depth 219 are required for data submission (i.e., latitude, longitude, water depth and sample core depth). 220 In this first version of MOSAIC, filled-in spreadsheet files with specified units and pre-defined 221 lists can be sent to mosaic@erdw.ethz.ch 1 for ingestion into the database. The next step involves transforming data (using Python code) from Excel into csv files that are 237 compatible with the normalized relational database structure in SQL. This is done by (i) adding 238 unique identifiers to the data and (ii) transforming the data into appropriate csv files. 239 Importantly for the database structure, unique identifiers are created for each appropriate 240 database table (SI Figure 2). For example, for a specific location, an individual sediment core 241 may yield multiple samples (i.e., core sections corresponding to different depth intervals), with 242 multiple measurements (e.g., 13 C, 14 C and %TOC) performed on each sample (section). In this 243 example, the location is assigned a unique geopoint location identifier, the core receives a 244 unique identifier, and each sample (section) is given a unique identifier. The Excel files designed for facile data ingestion are transformed in order to be compatible 253 with the normalized database using a Python script. This script executes this transformation by 254 auto-creating the compatible csv files, including the unique identifiers for the primary keys. 255 The script can be adapted to a dataset and is provided in the SI. The MOSAIC SQL database 256 allows for a direct upload of csv following data quality assessment, addition of identifiers and 257 creation of csv files. At present, a member of the ETH Biogeoscience group is allocated to 258 undertake this task upon receipt of files.   There are ongoing efforts to collect all water depth information, ancillary information will be attained using the GEBCO bathymetric grid (GEBCO, 2020).