the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
The Reading Palaeofire Database: an expanded global resource to document changes in fire regimes from sedimentary charcoal records
Sandy P. Harrison
Jennifer L. Clear
Kevin J. Edwards
Mary E. Edwards
Graciela Gil Romera
Martin P. Girardin
José Antonio López-Sáez
Eniko Katalin Magyari
Cécile C. Remy
Pierre J. H. Richard
Charles E. Umbanhowar
- Final revised paper (published on 11 Mar 2022)
- Supplement to the final revised paper
- Preprint (discussion started on 19 Aug 2021)
- Supplement to the preprint
RC1: 'Comment on essd-2021-272', Daniel Gavin, 08 Sep 2021
CC1: 'Reply on RC1', John Williams, 16 Sep 2021
I’m adding a comment to Dan’s review, in my role as one of the leaders of the Neotoma Paleoecology Database and its Leadership Council, to add follow-on information about Neotoma and how we might be able to help:
*Neotoma is a community-curated data resource, in which coalitions of expert Data Stewards upload data to Neotoma and curate data in Neotoma. These coalitions are organized as Constituent Databases, which usually represent a particular type of data, perhaps bounded by region, e.g. the International Ostracode Database, the European Pollen Database, the North American Pollen Database. In this framework, it would be quite possible for the Reading Paleofire Database to join Neotoma as a Constituent Database and become the body charged with uploading and curating (and, effectively, governing) charcoal data uploaded to Neotoma.
*As Dan noted, Neotoma’s primary mission is to store original measured variables, such as charcoal counts. Neotoma rarely stores derived variables, such as e.g. influx. This focus on primary data allows Stewards to focus on catching and removing the kind of errors that Dan flags. In this approach, derived variables such as influx are calculated outside of Neotoma, by using Neotoma software services to extract data from Neotoma and then calculate the derived variables in, e.g. R. This is the same approach followed by paleoclimatological users of Neotoma data, who pull the raw data from Neotoma and then apply paleoclimatic transfer functions externally. This approach supports reproducible workflows in which everyone starts with the same raw data and then can build their own analytical pipelines to conduct their particular analyses of interest.
*All contributions of data to Neotoma are voluntary. We try to encourage data contributions by providing a series of services to contributors and users. These include: 1) All datasets in Neotoma are assigned persistent unique digital identifiers (DOIs) with associated landing pages (e.g. https://data.neotomadb.org/14194). 2) Neotoma data can be searched and retrieved through third-party portals such as NOAA’s National Center for Environmental Informatics – Paleoclimatology (https://www.ncdc.noaa.gov/paleo-search/) and the Earth Life Consortium (https://earthlifeconsortium.org/). 3) The Neotoma APIs (https://api.neotomadb.org/api-docs/) and R package (https://cran.r-project.org/web/packages/neotoma/index.html) provide several options for large-scale search and retrieval of Neotoma data. The map-based graphical user interface Neotoma Explorer (https://apps.neotomadb.org/explorer/) supports quick browsing and viewing of data. 4) Snapshots of the full Neotoma database are periodically posted to Neotoma’s home page in PostgreSQL (https://www.neotomadb.org/snapshots) and to FigShare (https://figshare.com/search?q=neotoma+database). 5) Support of multi-proxy paleoecological datasets, so that e.g. charcoal datasets can be stored and analyzed in conjunction with other proxies such as fossil pollen.
*Neotoma is endorsed as a repository by the US National Science Foundation, the US Geological Survey, Past Global Changes (PAGES), and the American Quaternary Association. Neotoma is registered as an American Geophysical Union COPDESS data resource and is accredited by the ICSU World Data System, with a CoreTrustSeal accreditation pending. Data uploaded to the Neotoma relational database are housed at the Center for Environmental Informatics at Penn State and are protected by data backup and archiving policy that facilitates swift backup and long-term archiving. Data are protected by multiple measures, including redundant disk storage, off-site mirroring, file-system snapshotting, regular tape backup, and duplication of the backup set.
*There are three potential barriers to having the Reading Paleofire Database join Neotoma as a Constituent Database, all of which are real but have potential solutions:
1) The time involved in preparing data for upload via Tilia and doing the associated quality checks. Some of this time needs to be spent no matter what, given the issues flagged in Dan’s review. Neotoma has some resources to help: A) We often run Tilia training workshops for interested Data Stewards; B) we have a network of Stewards that can offer advice and help troubleshoot, and probably could prepare a few demo charcoal datasets as examples; and C) we have an active Slack channel (https://join.slack.com/t/neotomadb/shared_invite/zt-cvsv53ep-wjGeCTkq7IhP6eUNA9NxYQ) where people can post questions and solutions.
2) Data governance – who makes decisions about the data? This issue is often a concern to first-time data contributors. Neotoma addresses this through its Constituent Databases, in which sets of Data Stewards have authority to upload and modify data of a particular type, but are not allowed to modify other data types. For example, a vertebrate paleontologist Data Steward cannot modify a fossil pollen record, nor can a palynologist Steward modify a vertebrate record. Neotoma also keeps track of which Stewards have worked on which records, so that curatorial decisions (e.g. adding a new age-depth model) are linked to the people making the decisions.
3) Data completeness in the Paleofire Database. Neotoma does require, for example, that original age controls such as radiocarbon dates be uploaded with the dataset. Counts are strongly preferred over derived variables. This may mean that not all records in the Paleofire database can be uploaded to Neotoma.
All of the above is informational. I personally see a strong logic for adding the charcoal database to Neotoma, for the reasons that Dan outlines, and particularly to better support multi-proxy paleoecological research. I’m happy to help if there’s interest in connecting the resources. But, it’s up to the authors, of course, to decide how best to proceed.
AC2: 'Reply on CC1', Sandy Harrison, 26 Sep 2021
Reponse to Jack Williams
Thank you, Jack, for providing additional information and for your willingness to encourage the incorporation of data from the RPD in Neotoma. I strongly support the idea that these data should be lodged with Neotoma. However, as explained in the response to Dan's comments, my feeling is that we should encourage the individual contributors to lodge this data rather than uploading the RPD itself.
The RPD was created in order to conduct analyses on palaeofire regimes. Although we have quality controlled the data as far as possible, we cannot guarantee that all the data we have ingested is error free and indeed Dan has pointed out that there are errors with his sites. Thus, it would be better for the original data providers to lodge their data with Neotoma. We will do all that we can to encourage this. However, if the data contributors do not have the resource to do this, we would be happy to work with you to facilitate incorporating data that has been checked into Neotoma.
- AC2: 'Reply on CC1', Sandy Harrison, 26 Sep 2021
AC1: 'Reply on RC1', Sandy Harrison, 26 Sep 2021
Thanks, Dan, for your comments on the manuscript. We can certainly improve the current database and ms based on your suggestions, but our aim was not to create the perfect database but rather to create a dataset that we can use for meaningful analyses and that is both more comprehensive than existing resources and has fewer errors. We will clarify this in a revised version of the text. Furthermore, we will revise the text so as to indicate possible issues which a user needs to be careful about.
In answer to your general points:
(1) The relationship between the RPD, Neotoma and the GCD
Our main purpose in putting these data together was to create a database that could be used for a series of planned analyses. We were driven to do this because the charcoal data coverage in Neotoma is limited, we recognised that there were problems with some of the sites we are working with in the various versions of the GCD, and there are a lot of data that we wished to use in our analyses that are currently not in any database or long-term repository. It is not our intention to create a permanent data repository for charcoal data and we agree that people generating charcoal data should ensure that they lodge their data with a long-term repository such as Neotoma. Since we are aware that there are still errors in the data set we have put together, we are reluctant to upload the RPD as a whole to Neotoma. However, we have encouraged individual data contributors to lodge their records in Neotoma or another suitable data repository. We are also happy for the data we have assembled in the RPD to be incorporated into Neotoma and/or the GCD, so that they can be used by the wider community. We welcome Jack's positive encouragement to do this, and if it appears that individual data contributors do not have the resource to do this, we will work with Neotoma on the best way forward to ensure data are not lost.
(2) Universal application of BACON age-depth modelling.
Our main purpose in putting these data together was to create a database that could be used for a series of planned analyses. For this reason, we decided to use a standard approach to age modelling and to use the latest appropriate calibrations. We recognise that this approach may not be suited to every site and that users might want to use alternative approaches for their own analyses. For this reason, we provide the information about the dates available for each site in the table "date-info". This includes all radiocarbon dates, other radiometric dates, correlative dates and core top ages as provided in the original publications or by the authors of specific records. We also indicate when the original age model excluded specific dates and why this was done. In addition, we provide the age for each sample based on the author's original age model in the "chronology" table. Thus, the RPD parallels the Neotoma structure both in terms of archiving the base data to create age models and in terms of providing an alternative chronology. We will revise the text describing the construction of the new age models to make it clear that while we are providing the models (and the uncertainties associated with them), the user can access the original age models and can also use the dates provided to construct their own age models
(3) Raw versus processed data.
We agree that the ideal is to archive raw data (count, area or mass). However, as you rightly point out, this was not available for all of the sites repatriated from the GCD. Specifically, 99.8% of the data repatriated from the GCD does not have raw data (855 out of 856 entity records). Furthermore, it was not available for all of the new sites included in the RPD. Specifically, raw data is not available for 77% of the new data we have included in the data base. During the construction of the data base, we have prioritised the inclusion of raw data (23%) or concentration data (54%) for new records wherever possible. There are some cases where we have both count and concentration data for the same records (n=24 from 9 different sites); we can remove the concentration data for all but 2 of these records for which we do not have information on sample size. In some cases, we have been able to replace influx measures with raw or concentration data for existing records taken from the GCD (n=43 sites). We may not have done this for all sites where the raw data are available, and this should certainly be a priority for future improvements to the data set. We agree that it is necessary to be careful in making analyses with the RPD data to ensure that we don't double calculate influx or concentration. We will add a caveat about this in the text.
(4) Errors in repatriated data.
The five sites that you list as containing errors were taken directly from the GCD and we apologise for not checking directly with you about these sites. We can certainly correct this information before publication of the RPD. The broader issue here of course is how many errors there might be in the rest of the data taken from various versions of the GCD. Given that our goal is to use the data for analyses, rather than to construct a permanent data base, we hope that these errors will be trapped and corrected as we go forward. However, we will correct the errors that you have pointed out in the data for Yahoo Lake, Cooley Lake, Clayoquot Lake, Rockslide Lake and Yahoo Lake. We will also check the Neotoma holdings and see if these provide additional raw count data that can be used to update the RPD records.
Response to specific suggestions:
(1) Inclusion of Neotoma IDs. We originally included the GCD ids for various sites, but this was confusing because the ids changed between versions of the GCD. We do not include the Neotoma ids for individual sites because so few of the sites are currently in Neotoma. However, we do include a field in the entity table which identifies the source of the data (i.e. whether it was from Neotoma, a specific version of the GCD, or a new contribution from one of the co-authors) and this should make it possible for users to track back and find the original data. This will also facilitate them being able e.g. to combine charcoal and other types of environmental data archived at Neotoma.
(2) Checking measurement units and changing to raw values where possible. The co-authors have already checked sites which they contributed, and we have included raw values where these are still available. We have expended a considerable effort on data checking for other sites but agree that we can and should do further checks. However, the use of the data compilation is the ideal washing machine here and we are sure that it will be easier to clean up the data as errors become apparent through use. In addition to the corrections for the five sites listed above, and checking of the Neotoma holdings, we will run a further check for measurement units for the new sites in the data base (currently 50% of the records).
(3) Analytical sample volume. We agree that it is relatively simple and that it would be useful to separate the size and the units here and will implement this. We will take the opportunity to standardise the units further e.g. to remove units that are expressed as multiples (ax100, ax1000) and to convert different weights (e.g. mg, g, kg) to a standard unit (g). We will add text to point out that these conversions have been made so that the reported data might not appear to be the same as previously published data.
(4) Separate counts, volume, concentration, influx etc. We did not do this because of the tendency for people to provide entries for all columns, which could lead to confusion if influx is recalculated using new age models. Rather than create separate columns for each type of count, we will focus on ensuring that the information given is correct and in trying to obtain raw data wherever possible.
- CC1: 'Reply on RC1', John Williams, 16 Sep 2021
RC2: 'Comment on essd-2021-272', Anonymous Referee #2, 22 Nov 2021
This manuscript presents the Reading Palaeofire Database (RPD), a SQL database of palaeofire records. The authors took great care in collating new palaeofire records, combining these new records with published databases (such as the Global Charcoal Database (GCD) and its various versions), correcting errors, collecting metadata, and assigning consistent and up-to-date chronologies to each record.
Overall, I readily acknowledge that the RPD was a formidable undertaking and that its publication could be very useful to the palaeofire field. However, I do have several concerns which would need to be addressed before I could recommend publication. In addition to more specific comments regarding components of the manuscript and its associated SI, I also have several more general comments which are made below. I would also like to state that I broadly agree with the comments made by Drs Gavin and Williams. As such, I would urge the authors to note the recurrent themes in my comments/criticisms of this manuscript and the associated RPD data product.
The manuscript in its current state is ambiguous with regard to the RPD’s relationship with the Global Palaeofire Database. I mean this both in a direct sense (i.e. how much of the data in the RPD was directly pulled from the GPD, both earlier versions and the most up to date web version?) as well as in a logistical sense (was the Global Palaeofire Working Group involved in the creation of the RPD?). Statements about these relationships are ambiguous (see specific comments below regarding L98-99, 266-269, and SI Table 1). For example, the current presentation of the RPD does not allow the viewer to tell from where each dataset came. Despite the assertion that the RPD “is a community effort,” it is rather unclear how members of this community (e.g., the Global Palaeofire Working Group) were engaged and involved in the process. More broadly, I am very curious as to why the RPD is not being directly integrated into the existing community framework of the Global Palaeofire Database and Working Group?
Despite the merits of this manuscript and its associated RPD, I feel that if released in its current form (i.e. as a self-standing database) and without a plan for community integration, then this work could have a detrimental impact on the broader palaeofire field and the willingness of its members to share data. By downloading multiple community-driven databases that embody the spirit of open data, making improvements and expansions, and then creating a separate and less accessible database (see comment below), I would argue this is a step in the wrong direction. Why go through all of the work of improving community databases to then refuse to return the database to these same communities and in these same frameworks? Despite its flaws, the GCD’s web interface is significantly more ‘available’ than an SQL database (see comment below). The same would be true for Neotoma. As Drs Gavin and Williams have previously expressed in their comments, interfacing the RPD with existing community databases would achieve greater impact and utility for the palaeofire community.
Although I understand that SQL is open source and theoretically available to all. I assert that providing the data only in a SQL format poses an equity and accessibility issue. Use of the SQL database requires downloading a large program to read the files. To then use or access the data requires knowledge of a programming language to construct queries. I argue that this is an undue burden on accessibility. For anyone without knowledge of SQL to access the data would require potentially hours of instruction and learning. As a test case, it took nearly two hours for me to download and install mySQL, and then to import the data. Barring integration with community databases, I think the simplest way to address this issue would be to also provide text or csv files of the tables that make up the SQL database. This way, anyone (even those without knowledge of this database style and language) could assess the data freely and easily. I feel that ignoring the hurdle that the SQL format poses to potential users of the RPD would represent undue gatekeeping in direct contradiction to ESSD’s aim of “furthering the reuse of high-quality data of benefit to Earth system sciences.” Alternatively, I reiterate the suggestions of Drs Gavin and Williams to integrate the RPD with either Neotoma or the Global Palaeofire Database.
Manuscript main text – Specific Comments:
L24-30: References? These statements should be supported by relevant literature.
L33-36: Same comment as above.
L66: I think a clear statement as to the logistical relationship between RPD and the Global Palaeofire Database as well as the GPWG is needed here. Was the RPD as created by these authors done so independently from the GPWG? Was there community involvement and input during the creation of the RPD?
L98-99: Does this mean that any/all charcoal datasets which were not previously publicly available were provided by one of the authors of this article? Relating to my query regarding L66 above, I think a statement is needed to explain who the authors of this article are and what constituted a ‘contribution’ to this article, especially as the bulk of the RPD is derived from the GCD and GPWG. Were original authors of datasets in the GCD version not included as co-authors by virtue of their having made their data publicly available?
L103-108: Although the download of mySQL and the importation of the RPD was relatively straightforward and aided by the documentation provided by the authors, I wonder if this format poses an equity issue and accessibility issue. Namely, could the tables also be provided as csv or text documents so that those less technologically inclined could still view the RPD without using SQL queries?
L126-127: However, in many cases, I suspect that the accuracy of GPS coordinates for sites are not this great. How can trailing nought values in these latitudes and longitudes be differentiated as being reflective of accurate GPS location versus merely artefacts of non-exact GPS?
L159-160: Charcoal measurements are not always made in terms of volume (e.g., by dry mass basis). How were these types of measurements integrated into the RPD? Or were they simply omitted?
L266-269: This sentence is misleading as both versions 3 and 4 are several years old and many datasets have since been added to the online version of the GPD. E.g. as of 16 November 2021, the GPD contains 1231 cores. This is not a fair comparison. A more direct measure of the RPD’s value would be the number of new records (not sourced from the GPD or any earlier versions of the GCD).
L326: Here again, the ‘community’ needs to be defined (similar to my comment regarding L66 and L98-99).
L329: It is very difficult to judge whether there is actually expanded coverage (as per my comments regarding L266-269). More direct comparison or quantifications are needed to assess the validity of this statement.
Tables 1, 2 , and 4: For the fields which were ‘Selected from predefined list’, it would be prudent to provide the choices contained within these lists. Upon reading the SI, I see there are tables containing these. It might be good to note this in the main text.
Figure 4: The legend text is fairly small and hard to read, please consider making this text larger.
Supplementary Information – Specific Comments:
SI Table 1: An extremely useful and important field that is missing from this table is the source of each site (i.e. which version of the GCD did each come from, was it a new addition by the authors or this article, etc.).
SI Table 1: I believe that all of the datasets coming from the NOAA database are inaccurately cited. As per the NOAA database site: “Please cite original publication, online resource and date accessed when using this data. If there is no publication information, please cite Investigator, title, online resource and date accessed.” For example, Bass Lake Kandiyohi County should be attributed to Marlon and Umbahowar.
SI Table 1 : There appears to be duplicates of the Wild Tussock site.
- AC3: 'Reply on RC2', Sandy Harrison, 05 Jan 2022
Peer review completion
- Full-text XML
This data paper presents a new SQL database of sedimentary charcoal records. An important value and main motivation of this database, and of earlier versions of this database, is to understand the history of fire in the Earth system, at regional to global scales. As presented in the manuscript, there was considerable effort placed on quality checking, development of new age models, and adding site and entity metadata. The new database represents a large improvement in the breadth and depth of past databases. The authors claim several issues of data errors in the most recent database (the GCD, on paleofire.org). While paleofire.org provides a user-friendly interface to the data, I agree that metadata on chronological control is lacking.
We thus have a situation where there are multiple databases with different PIs sponsoring the data bases. To be up front regarding conflict-of-interests, I have no vested interest nor history of involvement in these databases except in their early genesis > 12 years ago when I participated in the first publications of this database. Since then, my main interest has been on site-level interpretation, or at most intercomparisons of a few sites. Thus, I am familiar with the nature of these data and I appreciate the goal of using these data in Earth-system science.
I found the manuscript to sufficiently describe the data. Modest database skills are required to read the SQL database. As described by the authors, the database is unique given the additional sites, age models, and metadata added. However, it is built from existing databases and my comments below address this issue.
My first concern is parallels with the Neotoma Paleoecology Database. The RPD has a data table structure that parallels Neotoma. All the variables in the RPD map directly to a variable in a table in Neotoma. The metadata forms in Tilia (which uploads to Neotoma) and the chronology information are much more thorough in Neotoma than in the RPD. Furthermore, many (if not most) of these records contain other data types (pollen, geochemistry) that may be archived in Neotoma. Thus, the effort at age model development would better serve the research community had the data been uploaded into Neotoma. Of course, this can still be done. As stated on the Neotoma database website: " Neotoma will enable joint analysis of multiproxy datasets to address paleoenvironmental questions that transcend those possible with single-proxy databases." Currently, only 25 sites in Neotoma have macrocharcoal data, and 19 sites have microcharcoal data. However, many of the entities in the RPD exist in Neotoma for pollen and other data types. Neotoma is fully capable of defining all >100 types of units for measurements (area, counts, size fractions). I understand the history that lead to this situation: the GPD began many years before Neotoma was archiving charcoal data. I now think Neotoma is the proper home for sedimentary charcoal data.
However, there are limitations in the Neotoma database with respect to charcoal data for research. The API interface for Neotoma, as exists in the Neotoma package in R, can batch-download datasets. The API interface for downloading charcoal data may not be functioning for charcoal datasets. I am not very familiar with this interface. Once functional, R scripts could generate the charcoal data and various other data from the same entities (e.g., LOI, pollen of fire-sensitive taxa). Thus in the absence of this interface, stand-alone databases such as this are needed.
A second concern is the universal application of Bacon age-depth modeling. Bacon is superb when dating density is high enough to result in overlapping PDFs of the calibrated ages. In contrast, in situations when, for example, a 14,000-year record is dated by only five well-separated dates, in my experience, Bacon produces interpolations that are indistinguishable from linear interpolation. In such cases, spline fits using, for example, Clam, will not result in abrupt changes in sedimentation rate at the dated depths and not miss dates that are a little out of line. Often, original authors have placed considerable effort in the unique situations of their sediment record, and thus applying a canned approach (as described by the ageR package) may not be as desirable as simply updating the original model with INTCAL20 calibration.
This second concern provides another argument for Neotoma for dataset archiving. Neotoma has separate geochronology tables (a list of all dates measured) and chronology tables (the age controls used for a particular age model or "chron"). New "chrons" may be added to existing data sets.
Finally, a third concern is that this database does not store the raw data values at most sites, but rather some derived value. Specifically, the TYPE variable for most entities is 'concentration' or 'influx' or some other value that is very unclear to its meaning ('other'). I assume this derives from the original data collection efforts in the original databases, in which raw data may have not been archived. (By 'raw data' I mean the measured value for each sample, e.g., 'count' or 'area', or 'mass'). Many statistics depend on these values (e.g., the CharAnalysis program). It is also just good practice to archive raw values, not some derived value. For example, the new age models cannot be used to recalculate new influx rates if the data are not stored as raw values (or at least concentration data). Scanning the database, less than 20% of the entities have 'count' or 'raw count' under TYPE, while most are concentration. Furthermore, converting from concentration back to raw values (multiplying by analytical_sample_volume) is not easily done because the 'analytical sample volume' is a string field with values such as '5cm3'. I note that there was no effort to locate the raw data values (as they are available for several sites that are reported as influx), but rather to use whatever units were provided in the earlier databases. Analysis of the RPD will need to be made very carefully to prevent errors such as double-calculating influx rates or concentrations.
I examined five of my sites within the database. I found errors in each of these sites. Correct data exist on Neotoma and on my personal web page (for most):
It is interesting to surmise the sources of these errors. In my case, I provided the original raw data in spreadsheets which included measured values, concentration, and influx columns. Thus, over time, through different versions of databases, information was lost or changed. If these issues were detected in the five sites that I checked, there is reason to believe such issues exist with all the sites.
I recognize the effort placed into creating this database. I see much work was spent on adding new sites, the ageR age models, and developing the MySQL schema, and adding some new metadata fields. However, it is quite concerning those errors in old databases are perpetuated into new databases. I highly recommend Tilia and Neotoma as a means of correcting errors. The data upload process into Neotoma (performed by data stewards) involves several quality control procedures. Uploading to Neotoma is time consuming because 1) the errors in the GCD and RCD will preclude using a bulk upload method, requiring checking against original publications, and possibly contacting authors, and 2) the quality control checks in Neotoma often detect additional errors.
My recommendation for a major revision is for a large but very do-able job:
1) Check if the site and entity exist in Neotoma and provide a table that matches the entity to the Neotoma entity.
2) Check the measurement units and change values to raw measurement units where possible. Also provide correct thickness values wherever possible. The numerous coauthors should be able to help with this task.
3) Change the analytical_sample_volume to double and include a new variable for volume units or standardize all samples to cm3.
4) Have the sample table contain four columns for 1) value (e.g., count), 2) volume, 3) concentration, and 4) influx (calculated using the new age model when possible). In theory, columns 3 and 4 could be omitted and generated on the fly as needed. However, it will not be possible to have raw values for all sites, thus requiring these four columns to exist.