LegacyPollen 1.0: a taxonomically harmonized global late Quaternary pollen dataset of 2831 records with standardized chronologies

. Here we describe the LegacyPollen 1.0, a dataset of 2831 fossil pollen records with metadata, a harmonized taxonomy, and standardized chronologies. A total of 1032 records originate from North America, 1075 from Europe, 488 from Asia, 150 from Latin America, 54 from Africa, and 32 from the Indo-Paciﬁc. The pollen data cover the late Quaternary (mostly the Holocene). The original 10 110 pollen taxa names (including variations in the notations) were harmonized to 1002 terrestrial taxa (including Cyperaceae), with woody taxa and major herbaceous taxa harmonized to genus level and other herbaceous taxa to family level. The dataset is valuable for synthesis studies of, for example, taxa areal changes, vegetation dynamics, human impacts (e.g., deforestation), and climate change at global or continental scales. The harmonized pollen and metadata as well as the harmonization table are available from PANGAEA (https://doi.org/10.1594/PANGAEA.929773; Herzschuh et al., 2021). R code for the harmonization is provided at Zenodo (https://doi.org/10.5281/zenodo.5910972; Herzschuh et 2022) so that datasets at a customized harmonization level can be easily established.

The numerous pollen records available in open databases, however, are not yet consistent concerning data type (e.g., pollen counts or percentages), pollen taxonomy, and nomenclature (Fyfe et al., 2009;Cao et al., 2013), and their metadata are neither approved nor harmonized. For example, palynologists identify pollen taxa to different taxonomic levels ranging from (sub)species to order, depending on the purpose of their study and the differentiability and preservation of the pollen grains. Some efforts have been made to harmonize taxonomies of pollen taxa in the databases (Fyfe et al., 2009;Giesecke et al., 2019;Mottl et al., 2021;Githumbi et al., 2022); however, a general framework is needed that can be applied to existing and newly published records.
Here we present LegacyPollen 1.0, a global taxonomically harmonized pollen dataset along with standardized metadata from 2831 sites for which recent chronologies have also been established . This dataset is based on a general framework and implemented in R, which allows customized datasets to be built as well as the inclusion of new pollen records. The LegacyPollen 1.0 dataset is available at PANGAEA (https://doi.org/10.1594/PANGAEA.929773;  and provides both count and percentage pollen data. We also provide the R code and the taxa harmonization table at Zenodo (https://doi.org/10.5281/zenodo.5910972; Herzschuh et al., 2022).

Data sources
We initially downloaded 3147 late Quaternary fossil pollen records (including dating) from the Neotoma Paleoecology Database ("Neotoma" hereafter) using the Neotoma package in R (Goring et al., 2019; R Core Team, 2020). As the spatial coverage of Neotoma records in certain regions is poor, for example, in China and Siberia, these records were supplemented by 324 records compiled by Herzschuh et al. (2019) and Cao et al. (2013Cao et al. ( , 2020 and our own data (AWI, Alfred Wegener Institute). Out of this pool, we selected 2831 records, including both raw (94.2 %) and digitized (5.8 %) data, for which standardized chronologies could be established .

Metadata processing
After checking the metadata of all records from the Neotoma and Asian datasets, we implemented the following modifi-cations: (1) we evaluated the units of the provided depth information (meters/millimeters to centimeters) of all records and contacted Neotoma to correct the depth information of one record (Dataset ID 27027); (2) we checked each record's archive type (e.g., peat, lake) based on its site description from Neotoma or the original publication; and (3) we integrated two records (Dataset ID 835, 3127) into a combined record (Dataset ID 70001).
We collected the sample ages from the chronologies provided by Li et al. (2022), which were newly established for all 2831 records using a standardized approach.  present estimated ages for each centimeter. For records with sample depths given at a subcentimeter scale, we applied a linear interpolation (performed in R; R Core Team, 2020) to assign the age of each sample.

Pollen taxa harmonization
Only terrestrial pollen taxa (including Cyperaceae) were taken into account, thus excluding aquatic pollen taxa as well as spores from mosses, ferns, fungi, and algae. First, we standardized the taxon nomenclature. To do so, we set up a master table containing all pollen taxa names from the 2831 records and made names consistent (e.g., "betula" to "Betula"), italicized all taxa below family level (e.g., "Artemisia" to "Artemisia"), replaced the abbreviations with full names (e.g., "P. pumila" to "Pinus pumila"), updated with the latest taxon nomenclature (e.g., "Gramineae" to "Poaceae"), and corrected wrong spellings (e.g., "Aluns" to "Alnus"). This master table is published in a machine-readable data format on PANGAEA (https://doi.org/10.1594/PANGAEA.929773 in the "Further details" section; Herzschuh et al., 2021). Second, we harmonized the pollen taxa according to the classification of the Angiosperm Phylogeny Group IV system (APG IV; The Angiosperm Phylogeny Group et al., 2016) and the Gymnosperm Database (https://www.conifers.org/, last access: 27 July 2021). Woody taxa were harmonized to genus level as well, as were some very common herbaceous taxa such as Artemisia, Thalictrum, and Rumex. All other herbaceous taxa were harmonized to family level. The various pollen taxa of heather plants were summarized at the order level as Ericales.

Pollen data type standardization
Although most pollen records contain the count data ("raw" data hereafter), the "pollen counts" for those without raw pollen counts were backcalculated using the pollen percentages and assuming a terrestrial pollen sum of 300 pollen grains, as most of the publications did not provide a pollen sum. We replaced the original taxon name with its harmonized name and summed all counts of the harmonized taxa for each sample. As we only considered terrestrial plant taxa, some samples in the records contained no pollen counts, and U. Herzschuh et al.: LegacyPollen 1.0 3215 those samples were excluded from the harmonized dataset. We then recalculated the terrestrial pollen percentages for each sample based on their total sum.

Structure of site metadata
The metadata for each site in the LegacyPollen 1.0 dataset include the following: Event (PANGAEA dataset identifier), Data Source, Data Type (raw or digitized), Site ID (in the source datasets), Dataset ID (in the LegacyPollen 1.0 dataset), Site Name, Location (longitude, latitude, elevation, and continent), Archive Type (e.g., peat, lake sediment core), Site Description (from original publication/Neotoma), and Reference. All site-specific metadata are available at PANGAEA (https://doi.org/10.1594/PANGAEA.929773; Herzschuh et al., 2021) in the "Further details" section (Metadata of the LegacyPollen dataset.csv).

Structure of pollen data
Sample-specific pollen metadata for the 2831 sites include depth, age (according to Li et al., 2022; minimum age, maximum age, mean age, median age), and harmonized taxon names with count and percentage data. To ease data handling, data files were separated to give pollen count data and pollen percentages and files for each region (western North America, eastern North America, Europe, Asia, Latin America, Africa, and the Indo-Pacific) are provided separately in both CSV and TXT format. In total, 28 pollen data files are published at PANGAEA (https://doi.org/10.1594/PANGAEA.929773 in the "Other version" section; Herzschuh et al., 2021) and can be joined by the Dataset ID with other data products. Furthermore, we also provide the taxa harmonization table at PANGAEA (https://doi.org/10.1594/PANGAEA.929773, in the "Further details" section; Herzschuh et al., 2021).

Spatial and temporal coverage of the dataset
Of the 2831 records included in LegacyPollen 1.0, 670 records originate from eastern North America (<105 • W; Williams et al., 2000), 362 from western North America, 1075 from Europe, 488 from Asia, 150 from Latin America, 54 from Africa, and 32 from the Indo-Pacific (Fig. 1). Most records (2659 records, 93.9 %) are from the Northern Hemisphere, where the main vegetation and climate zones are covered.

Harmonized taxonomy
A total of 10 110 terrestrial pollen taxa or taxa notations were obtained from the 2831 records, which we condensed to 1002 families or genera through taxonomic harmonization ( Fig. 3;  Appendix Fig. A1). On average, 10.8 original taxa or taxa notations are covered by one harmonized pollen taxon, ranging from 1 to 599 (median: 2). Overall, Asteraceae (599), Fabaceae (437), and Apiaceae (276) are the pollen taxa with most variants.
The biggest differences between the taxa names and notations before harmonization and those after harmonization can be found in Europe (with a mean of 42 variants per harmonized taxon) and in eastern and western North America (average of 22), with both regions also exhibiting the highest record density (Fig. 4). A high amount of tropical and subtropical tree and shrub taxa can be found in the Southern Hemisphere; these are harmonized to genus level and are therefore subsumed to fewer harmonized taxa, and they have a higher taxa diversity overall than the Northern Hemisphere continents. In the Southern Hemisphere, the most taxa and variants are harmonized for Fabaceae, as this is the most common family found in tropical rainforests and dry forests of Latin America and Africa.
Europe has the most harmonizations of herbaceous taxa from open landscapes, e.g., Asteraceae, Apiaceae, and Caryophyllaceae. In North America and Asia, several species or species groups of major woody taxa are harmonized to their respective genus levels, e.g., Alnus and Acer in North America and Betula and Quercus in Asia. The Pinus Haploxylon and Diploxylon subgenera are subsumed into the genus level Pinus, as the differentiation to subgenus level is not provided consistently.

Code and data availability
The data are published in the PANGAEA repository under PANGAEA (https://doi.org/10.1594/PANGAEA.929773, in the "Other version" section; Herzschuh et al., 2021) in both comma-separated values (.CSV) and tab-delimited text (.TXT) formats for the LegacyPollen 1.0 dataset of counts per continent and the LegacyPollen 1.0 dataset of percentages per continent. Site metadata, as well as a taxa harmonization master table, are provided in the "Further details" section.
The R code for taxa harmonization is stored on Zenodo  data from the Neotoma Paleoecology Database, harmonizing the pollen taxa, and assigning ages to sample depth data to create customized datasets can thus be easily done.

Discussion
6.1 Quality of the LegacyPollen 1.0 dataset To our knowledge, LegacyPollen 1.0 is the largest harmonized fossil pollen dataset; it includes more than twice the number of records integrated into previously published datasets (e.g., Fyfe et al., 2009Fyfe et al., : 1032Trondman et al., 2015: 636 records;Marsicek et al., 2018: 642 records;Giesecke et al., 2019: 749 records;Mottl et al., 2021Mottl et al., : 1181Githumbi et al., 2022Githumbi et al., : 1128. Several re-gions have poor pollen-record coverage either because no records are available due to the scarcity of suitable archives (e.g., continental interiors) or because available records were not compiled and integrated into Neotoma. Ongoing initiatives to compile pollen data from Africa and Latin America will allow a straightforward extension of the LegacyPollen 1.0 dataset using the provided framework.
A further advantage of the LegacyPollen 1.0 dataset is that it is accompanied by consistent metadata, allowing subsetting of the dataset. Aside from information about the location and archive type, the metadata also include sample ages that were inferred from recently revised chronologies  along with their age uncertainties (i.e., output from BACON; Blaauw and Christen, 2011), and the framework  and R code also allow customized reestablishment of the agedepth models.
Generally, the temporal coverage is good from about 14 ka cal BP. Rather few records cover the glacial period, which is mainly due to an absence of archives, as many lakes and peatlands were dry or covered by ice sheets. Marine isotope stage 3 is covered by many more records from Asia than from Europe and North America.
Taxonomic harmonization is required for multi-site synthesis studies (Fyfe et al., 2009;Trondman et al., 2015;Marsicek et al., 2018;Herzschuh et al., 2019;Routson et al., 2019;Mottl et al., 2021;Zheng et al., 2021;Githumbi et al., 2022). This is particularly true when numerical approaches are applied that measure the compositional dissimilarity between pollen spectra; for example, between fossil and modern sites for climate reconstructions using the modern analogue technique or regression methods, or among fos-sil records for beta-diversity studies (Birks et al., 2012). If taxa are not harmonized, an inferred high dissimilarity between two spectra may originate solely from differences in taxa nomenclature. On the other hand, if all taxa are harmonized to a taxonomic level that is too high, the ecological signal may be lost (Giesecke et al., 2019). We applied an intermediate level of harmonization, using growth form (i.e., woody vs. nonwoody) as additional guidance. We assume that our approach best reflects the typical presentation of pollen data, which is mainly limited by the pollen morphological features visible at 400× magnification using light microscopy and the typical taxa identification precision of most pollen analysts.
Plant taxa distribution changes based on the mapping of pollen taxa can yield information about glacial refugia and past migration patterns, as, for example, previously implemented for Quercus (Brewer et al., 2002), Picea (van der Knaap et al., 2005;Zhou and Li, 2012), Larix (Cao et al., 2020), east Asian tree taxa (Cao et al., 2015), and European broadleaf forest (Woodbridge et al., 2014;Fyfe et al., 2015). With the establishment of LegacyPollen 1.0, a Northern Hemisphere-wide analysis of past changes in distributional ranges is now possible, which would help us, for example, to better understand the different postglacial colonization patterns of Larix in Europe, North America, and Siberia (Herzschuh, 2020). Such an understanding of past range changes can underpin conservation management via the use of species distribution modeling at a broad scale enhanced by the higher spatial resolution and larger extent of LegacyPollen 1.0.
Studies aiming at broad-scale pollen-based vegetation reconstructions can benefit from the harmonized LegacyPollen 1.0 dataset, including those performed via biomization approaches (Prentice et al., 1996), multisite ordination or classification approaches (e.g., two-way indicator species analysis; Hill, 1996;Fletcher and Thomas, 2007;Connor and Kvavadze, 2009), or approaches relating modern to fossil datasets (e.g., the modern analogue technique; Overpeck et al., 1985). Furthermore, quantitative vegetation reconstructions (e.g., the Regional Estimates of Vegetation Abundance from Large Sites (REVEALS) model; Sugita, 2007) can be easily implemented, as a synthesis of relative pollen productivity estimates is already available for the Northern Hemisphere . Such quantitative information about taxa covers changes that can be directly compared to vegetation model outputs (Dallmeyer et al., 2021) at regional to continental scales, which is a potentially more accurate approach than first translating pollen and model outputs to biomes .
Pollen-based climate reconstructions are the backbone of paleoclimate synthesis studies for the continents (Marcott et al., 2013;Marsicek et al., 2018;Routson et al., 2019;Kaufman et al., 2020a, b). The reconstruction of mean annual temperature (T ann ), mean annual precipitation (P ann ), and mean temperature in July (T July ) using LegacyPollen 1.0 as input is an ongoing LegacyClimate 1.0 project. This will substantially increase the number of records and close data gaps in the global temperature datasets, thus enabling the evaluation of climate simulations at the hemispheric scale (Wu et al., 2013;Hao et al., 2019). It will contribute to the "Holocene conundrum" debate (Liu et al., 2014) and to the discussion of the relationship between temperature and precipitation change (Trenberth, 2011;Routson et al., 2019).
Human activities are an important driver of vegetation change, in addition to climate and other natural forces (Ellis and Ramankutty, 2008;Mottl et al., 2021;Pavlik et al., 2021). Deforestation during the Holocene period is of particular relevance, and, with the help of the LegacyPollen 1.0 dataset, this can now be investigated at the hemispheric scale. The harmonized chronologies of the LegacyPollen 1.0 dataset allow for the analysis of similarities and dissimilarities between continents in the temporal pattern of deforestation.