A status report on a section-based stratigraphic and palaeontological database – the Geobiodiversity Database

Big data are significant for quantitative analysis and contribute to data-driven scientific research and discoveries. Here a brief introduction is given to the Geobiodiversity Database (GBDB), a comprehensive stratigraphic and palaeontological database, and its data. The GBDB includes abundant geological records from China and has supported a series of scientific studies on the Paleozoic palaeogeography and tectonic and biodiversity evolution of China. The data that the GBDB has including those that are newly collected are described in detail; the statistical results and structure of the data are given. A comparison between the GBDB; the largest palaeobiological database, the Paleobiology Database (PBDB); and the geological rock database Macrostrat is drawn. The GBDB and other databases are complementary in palaeontological and stratigraphic research. The GBDB will continually provide users access to detailed palaeontological and stratigraphic data based on publications. Non-structured data of palaeontology and stratigraphy will also be included in the GBDB, and they will be organically correlated with the existing data of the GBDB, making the GBDB more widely used for both researchers and anyone who is interested in fossils and strata. The GBDB fossil and stratum dataset (Xu, 2020) is freely downloadable from https://doi.org/10.5281/zenodo.4245604.


Introduction 35
Palaeontology and stratigraphy have become a quantitative discipline of geoscience and there has been a subsequent rapid increase in the implementation of numerical methods in palaeontology and stratigraphy that started in the 1960s (Shaw, 1964;Schwarzacher, 1975;Kemple et al., 1989;1995;Sepkoski, 1992Sepkoski, , 2002Alroy et al., 2001;Hammer and Harper, 2006;Rong et al., 2007). Quantitative analysis based on big data of fossil and stratum records have been more common recently, especially on the study of biodiversity evolution (Alroy et al., 1994;2001;2008;Hautmann, 2016;Fan et al., 2020), graphic correlation 40 of strata (Kemple et al., 1989;Fan et al., 2013b), palaeoecology (Muscente et al., 2018), mass extinction (Muscente et al., 2019), and palaeogeography (Ke et al., 2016;Hou et al., 2020). There are professional databases, such as Paleobiology Database (PBDB), Macrostrat (https://macrostrat.org/) and Geobiodiversity Database (GBDB), storing and providing a big volume of fossil record data and making a number of quantitative studies possible. Well-structured stratigraphic and palaeontological databases and user-friendly-, accessible data are significant to the quantitative development of the discipline 45 and furthermore, push forward digital Eearth science in the era of big data (Guo, 2017). In this paper, we show the update and the improvement of a comprehensive database of stratigraphy and palaeontology biodiversity, Geobiodiversity Database (GBDB), and its data, brief history, development, and improvement. The comparisons between related databases are also given.

A brief history of the Geobiodiversity Database
The Geobiodiversity Database (GBDB) was started in 2006 and provided online service since 2007 when there was a strong 50 and urgent demand for the quantitative understanding of fossil and stratum records from China, which was initially supported by the national project of "Organism origination，radiation, extinction and recovery during the key geological ages" (973 Project) (Rong et al., 2006;. At that time the PBDB (Paleobiology Database) had been a large palaeontological database that included plenty of fossil occurrence data from the publications of euro-languages, however, fossil and stratum data from China were temporarily ignored for the obstacle of language or the relatively less contribution from China. The initial purpose 55 of the GBDB was to accommodate data of fossil and stratum data, geological sections as well as fossil collections from China, and furthermore to recognized biodiversity changed occurring in the geological ages of China (Rong et al., 2006).
Since the start of the GBDB, there used to be at most ten data entry clerks, including master or PhD students, assistant professorsresearchers and non-professional employees, digitalizing palaeotological and stratigraphic descriptions "from the page into cyberspace" (Normile, 2019) and aligning these data with standards that are acceptable to international researchers, 60 so that a researcher could quickly link them to carry on quantitative analysis that would likely have omitted Chinese data 域代码已更改 3 previously.
The GBDB was designed to facilitate regional and global scientific collaborations focusinged on palaeobiodiversity, systematics, palaeogeography, palaeoecology, regional correlation, and quantitative stratigraphy.
Basic functions of data input and output were gradually added and enhanced. In 2013, a huge volume of palaeontological 65 and stratigraphic data were included in the GBDB, such as taxonomy, identification features, occurrence, opinion, lithostratigraphy, biostratigraphy, chemostratigraphy, radio isotopic dating, reference, and palaeogeographic map (Fan et al., 2013a;Fan et al., 2014). Additionally, there were embedded a few online statistical and visualization tools, such as Time Scale Creator (integrated into GBDB in 2010), a stratigraphic visualization tool designed by Jim Ogg and Adam Lugowski (http://www.tscreator.com), and 2) GeoVisual (integrated in GBDB in 2010 and updated in 2012), a tool used for geographic 70 visualization and preliminary biogeographic analysis.
One of the exclusive features of the GBDB is its abundant geological section data, which are readily exported for the severalnumerical correlation tools, such as Constrained Optimization (CONOP) (Kemple et al., 1995) and SinoCor. SinoCor was designed and updated by Fan et al. (2002) and Fan and Zhang (2000;2004). Its correlation resembles CONOP but requires a unique file format. SinoCor and CONOP are individual outgrowths of graphic correlation. The geological section data of the 75 GBDB can also be used in other There are several related professional tools, such as Graphcor, PAST, and CONMAN (see Hammer and Harper, 2006;Fan et al., 2013b).
The GBDB became the formal database of the International Commission on Stratigraphy in August 2012 at the 34 th International Geological Congress in Brisbane, Australia, and, as a result, the GBDB achieved the goal of integrating stratigraphic standards (e.g. the GSSPs) with comprehensive and authoritative web-based stratigraphic information service for 80 global geoscientists, educators, and the public.
Since 2011, stratigraphic and palaeontological data related to early Paleozoic, especially Ordovician and Silurian periods, stratigraphic and palaeontological records had been quantitatively analyzed and a series of scientific findings were published.
The related research themes included the Ordovician and Silurian palaeogeography and tectonic evolution of South China (Chen et al., 2012;2017a), the spatio-temporal pattern of the Ordovician and Silurian marine organisms from China 85 2017b;Zhang et al., 2014a;, Permian-Triassic transition and extinction (Shen et al., 2011;Wang et al., 2014;Ke et al., 2016), and the Paleozoic palaeogeography evolution of South China (Chen et al., 2018;Zhang et al., 2014b;Hou et al., 2020). Recently, nearly all data of Paleozoic marine organisms of GBDB were used to analyze biodiversity evolution . Though all data were from China, the Paleozoic geological sections of China cover several palaeocontinents and can be accepted to reflect global biodiversity change. 90 In 2017, the GBDB became a data partner of the British Geological Survey (BGS) and started to digitalize the fossil and stratum data and establish the datasets for the BGS. This is a time-consuming task and still ongoing by the GBDB data entry team. The BGS has amassed and housed about 3 million fossils gathered over more than 150 years at thousands of sites across 4 the British Islands.
At the end of 2018, the managerhead of the GBDB, Dr. Fan J.X., left the NIGP, CAS and Dr. Xu H.H. took over the 95 GBDB. Since 2019, the new working group continued the GBDB work of data Besides data collecting, processing and visualization as the GBDB group did during 2007-2018, inputted more data of fossil terrestrial organisms (e.g., such as insects and plants), were input into the GBDB, and re-designed the database and the website were re-designed according to the feedbacks collected from the GBDB users., and Tthe GBDB is ushering a new start.

The data of the Geobiodiversity Database 100
The Geobiodiversity database (GBDB) was designed as a stratigraphic and palaeontological database and its input format was designed as geological section-based, which means that data entry clerks or any scientific users must input the metadata for the GBDB according to the geological sections or assumed sections. Every metadata record contains all geological information of a geological section, including its basic unit (or bed or layer), sediment color, lithology, thickness, horizon, locality, palaeo-block, geological age, bio-stratigraphy, geochemistry, palaeo-ecology, radio isotopic age, fossil collection, and 105 any available original information of the rock or fossil specimens or fossil sample during the fieldwork. An individual geological section normally can be subdivided into dozens of basic units when being it is inputted the GBDB. Such geological section records with much information can be found from stratigraphic and palaeontological publicationsliterature. Sometimes the geological sections are not easily or directly to obtain and the help consulting withfrom the professional experts is necessary.
ActuallyHowever, many palaeontological descriptions or reports are lacking detailed stratigraphic description, the GBDB 110 includes these records as assumed sections, and e.g. they were treated as geological sections with, which may have only a very small portion, for example, of a single bed (unit) or collection. Borehole core records, many of which are from the oil company 3). It is noteworthy that in the GBDB fossil occurrence data are included in the stratigraphic records and can'tcouldn't be queried directly, such was improve in our updating. The palaeontological data are linked to the fossil collections from individual geological sections and borehole cores. The data include taxonomy (species, genus, family, order, class and division), major group, synonym (opinion data with different authors) and description (key features) (Figure 1). Though the GBDB is geological section-based, from which fossil occurrences can be outputted, it is compatible with fossil occurrence-based 120 databases. Most fossil collections and occurrences of all sections from China are included in the GBDB (Figure 3). Subsequent authors in further study amended a portion of fossil taxa from these sections. In this way, there are also plenty of opinion data in the GBDB.
Since 2017, the GBDB started to record the data of Global Boundary Stratotype Sections and Points (GSSPs) of the International Commission on Stratigraphy, including the detail information of GSSP and some panorama and three-125 dimensional scanning of individual GSSP, as is example of the Changhsingian GSSP in Changxing, Zhejiang Province, southeastern China (http://www.geobiodiversity.com:8080/Panorama/47/output/).
Since August 2017, the British Geological Survey (BGS) and the GBDB started to collaborate in stratigraphic and palaeontological data processing. The GBDB data working team helps to digitalize the geological reports from the BGS archive and to build separated datasets for it. 130 Since 2019, the GBDB has begun to include the borehole core data of petroleum companies, such as China National Offshore Oil Corporation (Tianjin and Qingdao, China) and China National Petroleum Corporation (Karamay, Xinjiang, China).
In brief, as much as possible stratigraphic and palaeontological records are collected from the original geological publications. Since the establishment, the GBDB data team conscientiously collected and included stratigraphic and 135 palaeontological data from Chinese literature. The detailed statistic outcomes are given here (Table 1) (see Xu, 2020).
For a long time, the biodiversity evolution study was based on marine organism fossil records. For example, the earliest quantitative analysis of the geological time biodiversity that dreaws the conclusion of the five mass-extinction (Raup and Sepkoski, 1982) and a serial of related geological biodiversity studies beingwere based on marine organism fossil family or genus records (lot of work was done based on PBDB data, see Jablonski, 1994;Sepkoski, 1992;2002;Alroy et al., 2001;2008;140 Rong et al., 2006;, as well as the quantitative study based on terrestrial organism fossil records (e.g. Alroy, 1998;2001).
There have been quantitative studies on the plant diversity of the Silurian and Devonian periods that was significant for the early plant evolution and diversification (Xiong et al., 2013) and the study on plant diversity change during the Permian-Triassic boundary (Xiong and Wang, 2007). The mass extinction occouring at the end of Permian is the greatest extinction of the geological history and wipeds out over 95% marine organisms (Jablonski, 1994). These twoBoth plant diversity studies 145 used fossil record data from South China and listed the data as the supplementary materials of the published papers. It took the authors of the two studies a few years to complete the data collection, even the data from only the South China palaeo-block.
An inconvenient fact is that the terrestrial organism fossil database is not as good as that of marine ones, and that the nonmarine fossil record is necessarily less complete and less widespread. For a lone time, the GBDB focused on the fossil records of marine organisms. Since 2019, GBDB started collected terrestrial fossil and stratum data conscientiously and now has a 150 unique feature for the fossil terrestrial organisms. The fossil plant record dataset includeshas collected 738 Devonian plant species occurrences from global localities and thousands of Mesozoic plant species occurrences from China.
Beside the plant fossil, the terrestrial organism fossil record data of the GBDB are the insect fossil records, which greatly increase after taking over the international fossil insect database of the International Palaeoentomological Society, EDNA (https://fossilinsectdatabase.co.uk/), which holds details of the holotypes of all fossil insects in the world. 155 6 The EDNA database was named after Edna Clifford who started the recording of new species on a card index system and was designed as an update of Handlirsch's 1906-1908 "Die Fossilen insekten und die phylogenie der rezenten formen" which listed all the then known fossil insect species. Handlirsch recorded 5,160 species in 1906. The database is detailed in its contents: it records taxonomic information, synonym details, references for every species (including the page number where it is introduced), and for holotypes site details, stratigraphic information, and geological details are recorded. All the data have been 160 obtained from exhaustive literature searches.
The EDNA database aims to be a complete, fully interactive, list of all the species of insects named from the fossil record, with the site, geological age, and reference for each holotype. Updating and checking will be ongoing, and the data available will be greatly improved if details of omissions and errors are sent to the administrator for incorporation. The database comes from an exhaustive literature search and in the 2019 edition contains 28,439 species names (including synonyms) extracted 165 from 5,218 references ( Figure 3d). The database is held in 38 fields, all of which are searchable, independently or in combination, and the output can contain any one or more as required.
Fields include: generic and specific names, citation, subfamily, family, superfamily, division, suborder, and order;: aAuthor, title, journal, and date of publication, and page on which the species is first described.: Age data including stage, epoch, subperiod, period, and era and age (range) in millions of years;: Bed, member, formation, and group;: Site name, nearest feature 170 (town, river etc.) county, state, country, and continent ( Figure 4). For all taxonomic ranks, citations can be included and both junior and senior synonyms displayed. Natural History Museum London Library call numbers are also included.

Database comparisons and discussions
The comparison is made between the GBDB and fossil occurrence-based Paleobiology Database (PBDB), which was founded in 1998 and became the largest paleobiological database. Data of the PBDB include fossil taxa, collection, opinions 175 (paleobiological views from different authors), and related publications. The data volume of the PBDB is larger than the GBDB (Table 1). The noticeable difference lies in that the PBDB has little information about geological sections. The GBDB is known for its large number of geological sections.
By November 202019, 26,450 geological sections were recorded in the GBDB, the geological age of which ranges from Ediacaran to Cenozoic (Table 1) As we mentioned, the GBDB is geological section-based; every record was subdivided into detailed parts when being inputted in the database. The fossil occurrence and collection data can also be exported from the GBDB, just as those in the PBDB. Nevertheless, the fossil taxon count recordednumber in the GBDB is about 30% of that in the PBDB, whilst the fossil 185 7 occurrence record numbers in the GBDB is about 40% of that in the PBDB (Table 1). This is because the two databases have different histories, the PBDB was founded in 1998, the GBDB, in 2007 (Figure 1). The PBDB has a history of comprehensive backup, mirror sites and multiple portals (e.g. Fossilworks: Fossilworks.org), and user-training guide. The GBDB had held several workshops during the international academic meetings in recent years, but there is much to be done to improve the data quality and quantity of, all of which are the thing that the GBDB need to improve. 190 The stratigraphic records in the GBDB are reminiscent of Macrostrat (https://macrostrat.org/), which is a platform for the aggregation and distribution of geological data relevant to the spatial and temporal distribution of sedimentary, igneous, an d metamorphic rocks as well as data extracted from them. Macrostrat aims to become a community resource for the addition, editing, and distribution of new stratigraphic, lithologic, environmental, and economic data. By November 202019, Macrostrat records 1,534 regional rock columns, 35,163 rock units, and 2,540484,323619 geologic map polygons. Macrostrat has a lot of 195 exclusive data of composite geological sections, e.g., the section that are compiled from several places in one basin and may have completeness and thickness that never accumulated in one place. It is also worth noting that Macrostrat records mostly geological data from North America, whilst the GBDB includes nearly all stratigraphic data of sediments from China, igneous and metamorphic rocks were also recorded if they were reported in sediment units.

Updates and prospects
Since the GBDB website started online in 2007, there have been few updates. During the GBDB management change of the GBDB at the end of 2018. A survey was carried within some GBDB users of the GBDB and dozensa lot of feedbacks were received. According to these suggestions and feedbacks we sorted the existing problems of the GBDB and its website, and we comprehensively updated the server and the website of the GBDB, making the database a safe data bank and the 205 website a new and friendly portal (GBDB 2.0, relative to the previous version). The new website has optimized input and output of data, the search engine, and the data examination system.
During the process of data inputting, the raw data will be checked by registered authorizers, such action aims to make sure that the data valid but not to the authorizer's own point of view. Today knowledge is updating quickly, it is normal to have a mixture of valid and obsolete information to a certain point, such as the taxonomical synonymies, and the implementation of 210 a better decay constant to recalculate old radio-isotopic dates. The GBDB shows only the data bank but does not supports any academic points. The authorizers make the data valid but the users need to choose the data to use and analyze. In the GBDB a huge volume of opinion data is remained.
Data visualization is developed. All data are plotted on the world map of the homepage that also displays the data volume of the all data in the right up corner. The view center is the map of China and the map can be zoomed in or out using mouse 215 and to give comments. The old version of the GBDB remains available and has an entrance on the homepage, for the users who prefer the old format and hope to use the GBDB in the way they have learned. The GBDB also developed the applications for mobile devices, users can examine the data of the GBDB and give comments through the mobile devices.
In the next step, more data visualization, and analytic tools ( Figure 5) will be embedded in the new GBDB website publicly, for stratigraphic and palaeontological research. 225 GBDB and PBDB are complementary in their great volumes of geological section and fossil occurrence data. Through the geological sections, the GBDB data records the thickness of individual fossil samples and have the important evidence of fossil organism co-existence. Fossil taxa of the two databases contain not only the widely-distributed and endemic organismsfossils, but also those published in both English (and others) and Chinese languages. GBDB and Macrostrat are complementary in the stratigraphic study to some extent. The data of the two databases contain records from both North America and China. Data 230 from these databases, therefore, provide the possibility to conduct various stratigraphic and palaeontological analyses.
The GBDB, just as the PBDB and the Macrostrat, will continually and assiduously provide users access to the detailed palaeontological and stratigraphical data based on publications. Multiple and compatible formats for common software, such as CONOP and SinoCor, will be downloadable in the GBDB. Statistical and analytical tools will be easily used in the GBDB.
Additionally, the GBDB is collecting non-structured data of the palaeontology and stratigraphy, including fossil specimens' 235 images and three-dimensional models, geological section panorama images, tomographic image stacks, and references. We will build the organic correlations between these non-structured data and the palaeontological and stratigraphic data that the GBDB has collected for years. All-around information will be shown after searching an individual item thatr is related to any fossils or stratuma, making the GBDB more widely used for both researchers and anyone who isare interested in palaeontology and stratigraphyfossils and strata. 240 Author contribution: HX and ZN designed the project, developed the model, and performed the simulations. HX prepared and revised the manuscript with contributions from ZN. Y-SC gave technician supports.

Competing interests:
The authors declare that they have no conflict of interest.
Data availability: The current dataset is archived via Zenodo represents a static version of November 2020: 245