Dataset of antiarch placoderms ( the most basal jawed vertebrates ) throughout Middle Paleozoic

Antiarch placoderms, the most basal jawed vertebrates, have the potential to enlighten the origin of the last common ancestor of jawed vertebrates. Quantitative study based on credible data is more convincing than qualitative study. To reveal the antiarch distribution in space and time, we created 15 a comprehensive structured dataset of antiarchs comprising 64 genera and 6025 records. This dataset, which includes associated chronological and geographic information, has been digitalized from academic publications manually into the DeepBone database as a dateset. We implemented the paleogeographic map marker to visualize the biogeography of antiarchs. The comprehensive data of Antiarcha allow us to generate its biodiversity and variation rate changes throughout its duration. Structured data of antiarchs 20 has tremendous research potential, including testing hypotheses in the fields of the biodiversity changes, distribution, differentiation,population and community composition. Also, it will be easily accessible by the other tools to generate new understanding on the evolution of early vertebrates. The data file described in this paper is available on https://doi.org/10.5281/zenodo.5639529 (Pan and Zhu, 2021).

Explaining the spatial and temporal distribution of early vertebrates is the prerequisite to understand their biogeographic exchange. Although Zigaite and Blieck (2013) advocated a quantitative analysis to define 50 biogeographic patterns of early vertebrates, there is still lacking efficient quantitative analysis to understand the dispersal of early vertebrates. This occurs mostly because no comprehensive data collection of early vertebrates was accomplished.
The main objective of this study is to present an unprecedented structured dataset of Antiarcha that is potentially facilitate the research of the geospatial distribution patterns for antiarchs and the quantitive 55 study on early vertebrates. This dataset is the first step to accomplish the global coverage of the vertebrate fossil dataset to analyze the Middle Paleozoic biogeography and paleogeography. Revealing the distribution of antiarchs in the paleomap background with reference to the global paleogeographic reconstructions of Scotese (2002) Antiarcha dataset differs from the other datasets in its basic unit, which is the specimen ID coupled with the occurrence and other detailed data. All the specimens are referenced to taxa and literature.
Because the Antiarcha dataset was designed as a vertebrate paleontological dataset and its input format was designed as specimen-based, data entry assistants input the metadata according to the 75 published specimen or virtual specimen.
Since there is no satisfactory approach that could automaticly extract the paleontological data from literature, we have recruited several data entry assistants including relevant researchers, master's and PhD students to collect and curate the data. In order to guarantee the quality of the data, we have designed a four-step data processing procedure (Fig. 2): 80

Data Processing and Quality Control
We have made a tailored web page that provides a better user interface for them to fill in the rows of paleontological data. After that, the data would be reviewed by the other related experts so that a researcher could quickly access them to perform quantitative analysis reliably (Fig. 2). This workflow 110 was learned from the Geobiodiversity Database (GBDB) (Xu et al., 2021). Almost all antiarch literature was published in English, Russian, French, German, and Chinese. Data entry assistancestructure could handle the literature in Chinese and English well. Literature in French, Russian, and German was dealt with by paleontology postgraduates who know well these languages.

Data Visualization 115
The analysis is conducted on the raster tile form paleomaps (Scotese, 1998(Scotese, , 2016. We first convert the excavation locations from current GPS to paleo-GPS using TrackPoint software (Ke et al., 2016;Scotese, 2002). However, the construction of paleontological maps and networks for visual analysis is based on paleomaps. Thus, we convert these latitude and longitude coordinates into pixel coordinates of raster tile maps using the Web Mercator algorithm. The Web Mercator algorithm is a variant of the Mercator 120 projection and the de facto standard for Web mapping applications (Battersby et al., 2014). Specifically, the projection can be modelled as: Where is the long axis of the earth; is the longitude in radians, the value range is [-π, π], the east longitude is positive and the west longitude is negative; is the latitude, the value range is [-π/2, π/2], 125 north latitude is positive and south latitude is negative.
As for the paleomap, we also need to consider the map scaling and canvas size, so the final formula used is as follows: Where is the length of the plane map; is the size of the zoom scale, ∈ * ; in the analysis of 130 geological periods, is taken as 1; in the analysis of cross-geological periods, represents the -th geological period, ( ! , ! ) are the pixel coordinates of the node in the -th geological period, and is the fixed pixel distance between the map centers of adjacent geological periods.

Biodiversity Visualization
We calculate the genus and species biodiversity and plot it as a bar map. To better view the results, we 135 utilize the kernel density estimation (KDE) algorithm to smooth the curve and estimate the biodiversity.
In statistics, KDE is a fundamental data smoothing algorithm that can help inferences about the population based on finite data samples.
The counts of genus or species of various time slots (visualized as the bins in the charts) are defined as In statistics, they can be defined as samples drawn from a univariate distribution with an 140 unknown density . To estimate the shape of this distribution , its kernel density estimator is defined is the kernel function for estimation, we use the Epanechnikov kernel for better smoothing; ℎ > 0 is a smoothing parameter named as bandwidth. For better assessment of the biodiversity changes, 145 we adopt the first derivative of the Continuous Probability Distributions function to estimate the rate of variation.

Data Overview
This dataset, which was extracted from 126 published papers or books manually, consists of 64 genera 150 and 6025 records, covering all antiarch lineages. The 6025 records include 5867 fossil specimens that had been systematically described and documented, and 158 virtual specimens, which were introduced to describe the taxon information when no specimen was assigned for the referred records. The quantities distribution of specimens are given in the supplement. Each record has at least one reference within our dataset, and the specimens lacking precise age are excluded. We transferred the unstructured data from 155 literature to structured data for further research as detailed as possible. Table 1 shows the framework of our dataset. Among all the referred specimens, 6.51% belong to Yunnanolepidoidei, 2.86% belong to Sinolepidoidei, 78.92% belong to 'Bothriolepidoidei', and 11.71% belong to Asterolepidoidei. All the fossil sites of the constituent groups are plotted in Figure 3

185
As Young (1990a) mentioned that biogeographic data must be interpreted in the context of paleogeographic hypotheses, we plot our data on paleomap (Fig. 5) to generate an outline of their past.
The analysis of biogeographic trends in Paleozoic vertebrates is highly dependent on fossil data and paleocontinent reconstruction. Due to the easy access of the paleo-geographic coordinates calculator

The Paleogeographic Distribution of the Antiarcha Dataset
We plotted these fossil sites on the paleogeographic map (Fig. 5) except the Silurian Shimenolepis, which is the earliest record of Yunnanolepidoidei and the only documented antiarch specimen before the Devonian (Wang, 1991;Zhao et al., 2016). Most of the fossil sites were positioned around the equator.
In the present scenario, the suborder Yunnanolepidoidei apparently originated as early as Silurian in the

First Appearance Record
First appearance record of a taxon or a lineage is important in paleontology and evolutionary biology as it renders a hard minimum constrain on molecular clock calibration for a taxon (Benton and Donoghue, 220 2007;Benton et al., 2009;Donoghue and Benton, 2007). Based on our dataset, the oldest record of yunnanolepidoids or antiarchs is Shimenolepis graniferus from the Xiaoxi Formation at Shanmen Reservoir, Lixian County, Hunan, China. Shimenolepis was first described as the oldest known placoderm, dated as Telychian of Llandovery (Janvier, 1996;Wang, 1991). However, after a detailed stratigraphic work, Zhao et al. (2016) suggested that the age of Shimenolepis is late Ludlow rather than 225 late Llandovery. Janvier and Tông-Dzuy (1998) also documented an indeterminate yunnanolepidoid (Antiarcha gen. sp. indet.) from the Do Son Formation of northern Vietnam, which could be another earliest antiarch potentially.
The oldest sinolepid is Liujiangolepis suni, from the Nakaoling Formation (Pragian), Xiangzhou, Guangxi, China (Wang, 1987). The oldest bothriolepidid is Houershanaspis zhangi, documented from 230 the Danlin Formation (Pragian) in Mt. Houershan, Guizhou, southwestern China, on the basis of a bothriolepid-like anterior median dorsal plate (Lu et al., 2017). The earliest asterolepidoid records are represented by Wurungulepis and some disarticulated specimens, which had been documented from the Broken River Formation, Broken River, Australia. The age of the Broken River Formation was first referred to Eifelian, then reassigned to Emsian (serotinus Zone) (Burrow, 1996;De Pomeroy, 1996;

Conclusions
Data are significant for quantitative analysis and contribute to data-driven scientific research. Previous