A database of databases for Common Era paleoclimate applications

Evans, Michael N.; Lücke, Lucie J.; Fan, Kevin J.; Zhu, Feng

doi:10.5194/essd-2025-364

Preprints

https://doi.org/10.5194/essd-2025-364

Preprints

01 Jul 2025

| 01 Jul 2025

Status: this preprint is currently under review for the journal ESSD.

A database of databases for Common Era paleoclimate applications

Michael N. Evans, Lucie J. Lücke, Kevin J. Fan, and Feng Zhu

Abstract. We present a database of curated databases (DoD2k version 1) developed for Common Era (1–2000 A.D.) paleoclimate research. The DoD2k leverages existing community efforts, many of which arise from the PAGES (Past Global Changes) 2k working group, and the codebase developed by the paleoclimate data informatics communities over the past decade. Using a common, compact set of terms for metadata and data management, we merge five existing curated databases. These individual curated databases represent a range of approaches, from single archive-single observation to multiarchive-multiobservation collections, and span a total of 14 archives, 49 data types, and 4613 records within the Common Era. We then use a multistage algorithm to remove duplicates, checking against a common set of metadata and comparison metrics. We illustrate the value of the DoD2k with two applications. In the first, we extract the moisture and temperature subset of records and perform an empirical orthogonal function (EOF) analysis on the resulting multi-archive, multi-observation dataset. In the second, we show that calcite speleothem oxygen isotopic composition is consistent with proxy system simulations. DoD2k may also be useful for paleoclimatic detection and attribution analysis using proxy system modeling, data assimilation, and deep learning for the development and testing of improved proxy system models.

Received: 18 Jun 2025 – Discussion started: 01 Jul 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Michael N. Evans, Lucie J. Lücke, Kevin J. Fan, and Feng Zhu

Status: final response (author comments only)

AC1: 'Updated python environment and codebase', Michael Evans, 03 Aug 2025

To assist reviewers and users, we have created a minimal python environment
dod2k-env.yml
which contains only those packages needed for running the dod2k functions and notebooks.
We have also checked and revised notebooks to replace absolute paths with relative paths.
Please see the file
https://github.com/lluecke/dod2k/blob/main/Quickstart.md
for how to get started with the dod2k environment, functions, notebooks and products. In particular, users of the S_analysis

notebook will need to create a directory called speleothem_modeling_inputs and retrieve and unzip modeling source

there.

Citation: https://doi.org/10.5194/essd-2025-364-AC1
RC1:
'Comment on essd-2025-364', Anonymous Referee #1, 20 Aug 2025

General comments (overall quality of the preprint)
Summary:

This manuscript presents a “database of PAGES 2k databases,” or dod2k, which provides a set of tools for translating between, and combining records from, PAGES 2k databases. This type of toolkit has long been needed and I’m grateful that the author team has taken it on. Overall, the manuscript and the set of Python functions and codes are very well thought out and of high quality. They present notebooks for the applications that they showcase in the manuscript, which are useful and straightforward, and the notebooks can be run while reading the manuscript side-by-side.
I did have some concerns regarding the clarity of the workflow as presented in the manuscript and the leap that users must make between reading the paper and actually using the code, which is not as straightforward as it could be. I have made a few suggestions below for clarifying the text and the workflow figure, and for providing a more thorough Quick Start guide and a simple tutorial (which can be as simple as an example set of more thoroughly commented notebooks walking through the creation of the dod2k_dupfree_dupfree dataset that the authors use for their examples in the manuscript). Critically, I believe the author team should do some beta testing of these tutorials and the QuickStart guide with a few new users who have had *no* involvement in the project whatsoever. The errors and troubleshooting that I encountered while reviewing most likely happened because when putting together the Github repo, readme, and quick start guides, the author team was already so familiar with the code and the conceptual workflow that it was easy to take certain steps for granted. These additions will make the database/toolkit (is it a database or a toolkit?...see below) more widely usable and improve the overall accessibility of the manuscript. My comments below mostly focus on making sure a reader of the manuscript can easily understand and follow along with the process of loading the databases and removing duplicates, as well as following the example applications given.
Is dod2k more accurately called a toolset than a database?

As catchy as ‘dod2k’ is, I go back and forth on whether the product here is actually a database. Actually I think it is more accurately called a set of tools. Yes the result of using the tools is to load all the databases’ record into one super-database, but dod2k does not have its own datasets and metadata. For example if a new temperature record is published, it would be submitted to the pages2k temperature database and added to the next release, rather than it being added to dod2k directly. In addition, the final “dod2k” product ultimately depends on decisions the user makes about duplicates—it is not the same set of records every time. So I think a more accurate title would be something like DT2k (databaseTools2k) or DIOP2k (Database InterOPerability 2k) or something similar. This is sort of semantic but the distinction is important when thinking through the process for updating the databases themselves plus the tools included in dod2k, see below.
Stewardship and updates to dod2k and the underlying databases:

The notebooks provided here are tailored to the most current versions of the Pages 2k databases and include some workarounds for the bugs or peculiarities that are specific to those databases and their specific versions (for example the Palmyra record in the pages2k load notebook). That means that if/when the individual databases are updated, the dod2k code may break or some sections of the notebooks may become moot, while other sections may need to be added for the new inevitable peculiarities that arise. Given that updating the databases themselves is always a large endeavor, what is the plan for updating dod2k/dt2k/diop2k when new versions of databases are released to the community? Will updates to dod2k be governed by the author team of this paper, or will community users contribute code? If the latter, who will review and commit those changes? Can this discussion be added to the text?
Usability of the dod2k tools and some additional resources for new users:

As is always the case with Python, the trickiest part especially for non-expert users is getting the environment set up and getting example scripts or notebooks to run without breaking. The paleoclimate community uses a variety of programming languages and it is likely that many (if not most) of the prospective users of dod2k have only a basic knowledge of Python and some (many?) users will not have any knowledge at all (compared with, for example, R, which seems to be more popular in the community). When I initially started to review the paper, I cloned the github repository and tried setting up the environment and running the notebooks, but then I started running into enough errors requiring troubleshooting that I did not feel the manuscript was ready to review. The editor reached out to the authors and they then posted a comment providing an updated .yml file for a python environment (dod2k-env.yml) and a QuickStart, both of which were very useful. I was able to successfully get most of the notebooks running. Still, there is an underlying issue which is that it feels that the dod2k workflow and tools need to be better clarified in the text, and probably more rigorously beta tested with users from the broader paleoclimate community before it is really ready for wide release. See my comments on section 2.2 and the quickstart guide below.
Something that would help immensely for all users (regardless of their python experience) would be adding some resources to allow users to get started with the dod2k tools. At minimum, it would be helpful to include with this publication a “tutorial” that is simply a set of notebooks that exactly reproduce the steps taken in this manuscript to produce the “dod2k_dupfree_dupfree” version of the dod2k dataset, which appears to be the one used for the examples in sections 3 and 4. These notebooks should include more extensive commenting to walk a user through the decisions that were made to produce that version of the dataset, including in the notebook directly some comments on the rationale/justification for each of the decisions about duplicate cases. I think this is what was intended with the README file in the duplicate-screened database that is created with the screening notebooks, but the examples in the github repo just have a simple README that lacks operator comments on the duplicate screening.
After doing this, a short section should then be inserted into the beginning of the Results section of the manuscript that describes the process of creating dod2k_dupfree_dupfree, with the names of the notebooks and the order in which they are used very clearly stated so that a reader can follow along and create their own version of dod2k_dupfree_dupfree if they like. This addition will also help with overall reproducibility of the figures in the manuscript.
For inspiration regarding tutorials, I suggest looking to the example of Pyleoclim, which provides a number of extremely accessible tutorials that are easy to run out of the box: https://github.com/LinkedEarth/PyleoTutorials

Specific comments (individual scientific questions/issues)
.yml file:
The original cfr-env.yml fails to complete a working environment because of a problem with pyvsl. The new yml file, dod2k-env.yml, successfully produces a working conda environment which is great. However the new .yml file did not include any version of jupyter notebook so that had to be installed separately. I recommend adding the jupyter and notebook lines back to the dod2k-env.yml that was previously in the older .yml file.
Comments on the process of removing duplicates:
The notebooks for identifying and handling duplicates take a very long time to run (at least 2 hours on my machine), and due to connection timeouts I was not able to successfully run the dup_detection notebook even with multiple attempts. This is even with loading only two of the databases. Aside from connection timeouts I also encountered this error: “File Save Error for dup_detection.ipynb. Failed to fetch.” It’s fine if code takes a long time to run, that’s just reality. But perhaps in the manuscript and in the QuickStart/readme files, it would be helpful to specify at the outset of the section on duplicate handling that there is a reasonable version of the dod2k dataset, dod2k_dupfree_dupfree, that is ready to be used should the user want to do so rather than starting out with making their own decisions regarding duplicates. Many users may want to skip the full duplicate detection/decision-making process at first and just jump in to exploring a version of the compiled dataset where some reasonable decisions have already been made (eg, dod2k_dupfree_dupfree). This will help people get started working with all the compiled datasets right away.
I did really appreciate that the csv file already exists in the Github so that a user can just proceed without running the full duplicate detection process. That was a nice touch. It may be helpful to clarify the name of that csv file in that cell of the notebook where it directs a user to comment out that cell if they wish to use an existing csv file.
In addition to the above, to promote useability I wonder if it is possible to change the way that duplicates are written to the csv file while it’s running, so that it’s possible to pick up where it left off if the notebook finishes only partway through? I get that this may not be possible, but it’s worth looking in to.
Some suggestions for the decision-making for duplicates: The process of prompting the user to manually make decisions is nicely thought out here and I appreciate the decision-making metadata for future reproducibility purposes. I find the figures a little tough to parse, though. Can both Y-axes for both datasets be plotted, and then specify which Y axis pertains to which record? Also, I could not quite tell what the grey line is exactly? I see that it is something about the differences between the records but it jumps around so wildly that it was sort of hard to tell, and differences were often very far from zero even with records that had high degrees of correlation. Plus it is a little distracting and makes it hard to look very closely at the underlying records. Could it be added to a sub-panel? And/or just explained better in terms of what the units are.
Finally, there seem to be a number of “dod2k_dupfree” folders with different initials appended to them (and one with “_dupfree” again appended), but no clear description of what the differences are (though I gather from the paper these are probably subsets of moisture and temperature records, and not someone’s initials…?). Can this go in a readme file somewhere?

Other specific comments on the manuscript and notebooks:
- In line 69, the text says “The framework is extensible and can incorporate new databases or updates to existing ones.” From what I can tell though, a user is supposed to use the versions of the databases that are stored in the github repository. That means the user should essentially download each of the databases from this github repository, rather than download them from their official public repositories or websites? Which ultimately means that as updates to the individual databases are released, this github repository has to then also be updated, yes? Can this be clarified? And, this is where it would be very helpful to include an overall discussion of plans to keep dod2k updated (see my “overall” comment above).
- Section 2.2: The workflow figure is great, but I found the actual description of the workflow in the text to be lacking some key, concrete information. For example, to load all of the databases, does one need to go through and execute each individual load notebook? Can you specify where those load notebooks are located (they aren’t in the main directory of the github repo, and I found them, but it would be easy to just specify here)? Can someone load just a few of the databases, or will that break things later? I started by just loading 2 of the databases, since the duplicate detection process was so lengthy I was hoping this would cut down on time. It wasn’t actually successful (see above) but if it had been, would that have worked, or does a user actually need to load in all the databases? I looked for concrete workflow examples in Section 2.5 and 2.6, and in the example applications in 3.2, but those still don’t list the complete workflow to reproduce the figures made here. In addition to the tutorial I mentioned above, can you also include more specifics in Fig 1, rather than just a conceptual overview? Or another figure that lists the specific workflow for one (or both) of the example applications, including the names of the notebooks? And finally, as I said under “overall comments” include a set of notebooks that walks through the creation of dod2k_dupfree_dupfree including the rationale for decisions about duplicates.
In addition to the text in section 2.2: The Quick Start guide that was added is helpful, but it should be revised to contain files in the order that one should use them, following the steps in the manuscript and in the workflow figure.
- Section 2.3: The way this is described here is not really accurate. The text says “A virtual environment for running the Python functions, scripts and Jupyter notebooks built within a Jupyterhub installation 105 (https://tljh.jupyter.org/en/) can be found at the aforementioned github repository in the file cfr-env.yml.” Looking at the Github repository, it does not seem like Jupyterhub is actually necessary. I believe this text should actually say “Users can create a virtual environment to run the Python functions, scripts, and Jupyter notebooks using the dod2k-env.yml file found at the aforementioned github.”
Unless there is some reason that users need to be using Jupyterhub? From what I can tell, users still need to download all the datasets and code in the github repository, right? If that’s not the case, that needs to be clarified here as well as in the README file on github.
Anyway, this language should be clarified and ideally, more depth given for how to get this set up on an individual’s laptop and a group server (probably the two most common approaches) so that it is easier for users to access dod2k and the notebook examples here. It is OK to assume some working knowledge of Python, but since the paleoclimate community will be approaching this database with very different experience levels, providing some concrete info on getting everything set up in a straightforward manner will go a very long way with the community. Clarify that all a user needs to reproduce what is done in the paper and then perform their own analyses, is the ability to use Jupyter notebooks, and to set up an environment with specific package versions etc.
For what it’s worth, I did not use JupyterHub, I just cloned the github repository to an Ubuntu server that I use, and worked through all the examples there. If people need instructions on how to use Jupyter lab or Jupyter notebooks there are some helpful tutorials by LinkedEarth that you can recommend in the manuscript and/or on the Github: https://github.com/LinkedEarth/PyleoTutorials
- One thing you may wish to consider, following Pyleoclim's footsteps, is to encourage first-timer users run the notebooks with myBinder. This way they do not need to install anything. I tried to use myBinder to run the dod2k notebooks in order to review this preprint in the first place, but ran into errors importing packages when trying to run the load scripts, so something is not right there. This would be very useful for the DoD2k user community and allows people to get started checking out the codebase without having to figure things out on their local machines.

Technical corrections (compact listing of purely technical corrections)
Just one minor correction:
Line 20: “may take tens to thousands of years to be fully realized” makes it sound like it may take tens of thousands of years for people to figure out internal climate variability. I am rather more optimistic than that. I suggest rephrasing.

Citation: https://doi.org/10.5194/essd-2025-364-RC1
- AC2: 'Reply on RC1', Michael Evans, 31 Oct 2025
  
  The comment was uploaded in the form of a supplement: https://essd.copernicus.org/preprints/essd-2025-364/essd-2025-364-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/essd-2025-364-AC2
RC2:
'Comment on essd-2025-364', Julien Emile-Geay, 21 Aug 2025
Review of “A database of databases for Common Era paleoclimate applications” by Evans et al

Summary:
The article presents an attempt at synthesizing paleoclimate proxy records across 5 different databases with partial overlap. A detailed procedure for identifying and removing duplicates is described, and two applications of analysis on the unified database are shown. This careful work will be suitable for publication after minor revisions.

Scientific Comments:

Given the scope of the journal, my comments will focus on the data and associated code.
1) Since this is, in part, an attempt at standardization, I would like to point out that many (3/5) of the constituent databases have recently been updated on lipdverse.org, and now use terminology that espouses the community-sourced LinkedEarth ontology (https://linked.earth/ontology/), which itself uses a number of controlled vocabularies (https://lipdverse.org/vocabulary/). Some of these vocabularies have been aligned to relevant terms in the NCEI PaST Thesaurus (https://www.ncei.noaa.gov/products/paleoclimatology/paleoenvironmental-standard-terms-thesaurus). To the extent possible, it would be good to align the terms used in this study (cf Table 1) to those standards, and refer to them in the text so readers are more aware of them.

2) While the database itself will undeniably be useful for some applications, I believe the associated workflows are of greater value still. In particular, the workflow to identify and remove duplicates addresses a recurring issue in this line of work, and to my knowledge it is the first published instance of such a workflow being described in detail, and shared in code form.
Unfortunately, there is no universal standard for sharing workflows. It is very helpful that the authors made notebooks and auxiliary Python modules available through GitHub, but the notebooks are still a little rough around the edges (cf a lot of commented out old code) and lack a narrative. I would like to invite the authors to organize their cleaned-out notebooks as a JupyterBook, and share it through a gallery like PaleoBooks: https://linked.earth/PaleoBooks/. I believe the work will have greater visibility there, and will have more enduring value to the community.
Editorial Comments:

The paper is well written, though I have a handful of suggestions.
there is inconsistent terminology throughout the manuscript, sometimes referring to PAGES2K, or PAGES2k. The proper nomenclature is PAGES 2k (lowercase, with space).

L117: complimentary —> should be “complementary”

Table 2: it looks like the authors loaded the constituent databases from various static files. For most up to date information on PAGES 2k, Iso2k, and CoralHydro2k, it is recommended that they download the latest from lipdverse.org, as many updates were made this summer.

L167: “The evidently true duplicate records …”. How many such duplicates were found, and what fraction of the total number does that represent?

L238: “the sensor model in PRYSM (Dee et al., 2015),” —> It should be noted that this is the sensor model introduced by Partin et al (2013), http://dx.doi.org/10.1130/G34718.1.

Section 4: it is odd to put results in the discussion. I recommend renaming Section 4 “Applications” and having a very short Section 5 called “discussion” or “outlook” that incorporates what is currently in Section 4.3. Conclusions could be merged in there too.

L310: references should be parenthetical (\citep{}, not \citet{}).

L327: “ filtering by dictionary terms we have employed” —> filtering by THE dictionary terms we have employed (missing THE).

In surveying S_analysis.ipynb, if appears that the authors re-invented the wheel in how they implemented something as straightforward as linear regression. I recommend that they use statsmodels (https://www.statsmodels.org/stable/), as it will provide all the information the authors need via model summaries, and lead to a more lightweight notebook, less prone to errors since statsmodels has been more thoroughly vetted.

Julien Emile-Geay
Citation: https://doi.org/10.5194/essd-2025-364-RC2
- AC3: 'Reply on RC2', Michael Evans, 31 Oct 2025
  
  The comment was uploaded in the form of a supplement: https://essd.copernicus.org/preprints/essd-2025-364/essd-2025-364-AC3-supplement.pdf
  
  Citation: https://doi.org/10.5194/essd-2025-364-AC3
RC3:
'Comment on essd-2025-364', Nicholas McKay, 06 Sep 2025
In this manuscript, the authors describe the rationale and methodology behind the assembly of a “Database of Databases” for paleoclimate for Common Era, and include two use cases to demonstrate the potential utility of this merged data product. The manuscript is well written and illustrated and addresses a common problem: how to use a collection of related, but not custom-built, data compilations to address a problem that could benefit from a larger collection of data.

The authors identify metadata integration, including non-overlap and terminology differences, and duplicate handling as the primary challenges in this exercise. This is consistent with my experience. The approach and methodology for identifying, handling and tracking the choices made in the deduplication process is well done, and a valuable addition to the literature.

The code and data to load in the databases, align their terminology, and remove duplicates is all available, as is the code for the use cases and figures. As always, it’s great to have access to all of this to get into the details of how the authors did what they did. I thank the authors for following best practice here! Using the instructions I was able to run all the notebooks except one (“load_pages2k_vv2.ipynb”) , which seemed to hang during the `pdb = cfr.ProxyDatabase().fetch('PAGES2kv2')` command.

Having the codebase is very helpful, however I do think that it would be a challenge for others to try to build on this approach to add additional or alternate compilations using the same design. It’s certainly more of a reference with examples than it is a tool to easily create new compilations.

Overall, I think this is a worthwhile contribution and I think the community will find the DoD very helpful. Indeed, I expect the database itself (as opposed to the methodologies described to create) to be widely used and a starting point for many researchers keen to take a data-intensive look at many aspects of Common Era climate. And because of this expectation, I have a few suggestions to make that database both more useful, and less prone to misuse.

First, while I appreciate the need for a reduced set of metadata while integrating across datasets, there are three additional fields that I think are critically needed. Fortunately, this metadata is available in most of the original compilations so the authors would not be starting from scratch.

interpretation direction: A field that describes whether the variable is positively or negatively related to the interpreted variable. The need for this field is evident in the first use case, where the authors multiply the tree ring and coral d18O datasets by -1 to allow for more direct comparison with the other proxies. Although it’s true that these relationships due vary by proxy and archiveType, there are many examples of variable interpretations within archive and proxy classes. Lake sediment d18O is one example where this is often variable between lakes. Critically, there are also several examples in the literature where the interpretation direction was not properly applied, and this can lead to substantially wrong conclusions. Given its importance, including explicit interpretation direction for each interpreted dataset (and interpretation) is critical. After adding these metadata, I suggest replotting the PCA results using these metadata rather than the class-specific data.

Seasonality: Many of the datasets are interpreted to be more sensitive to climate variability during some parts of the year than others, and it’s very valuable to be able to filter by interpreted seasonality to test various hypotheses.

Variable name: The database includes both units (what units the variable was reported in) and proxy, which describes the type of physical, chemical, and/or biological systems that imprint climatic condition onto the archive, but not the name of the variable itself. This is often similar (or identical) to the proxy, but not always. For example, chironomid data are measured as count or assemblage data, but are included in the database as a calibrated temperature variable. So the proxy is correctly “chironomid” and the units are correctly “degC”, but the variable name should be “temperature” or something similar. This is particularly useful for making coherent axes labels using the metadata form the database.

My final major suggestion is that, to the extent possible, citation information for the original studies is provided in the manuscript and in the database. These data compilations are critical for large-scale study of the Common Era, but also tend to make it more difficult for the authors of the original study to be credited. Users will likely be unable to cite all of the studies if they’re using the whole databases, but many will only use a subset and it’s helpful to enable citation of the original studies whenever possible. Ideally this would be an additional field in the database.
Citation: https://doi.org/10.5194/essd-2025-364-RC3
- AC4: 'Reply on RC3', Michael Evans, 31 Oct 2025
  
  The comment was uploaded in the form of a supplement: https://essd.copernicus.org/preprints/essd-2025-364/essd-2025-364-AC4-supplement.pdf
  
  Citation: https://doi.org/10.5194/essd-2025-364-AC4

Michael N. Evans, Lucie J. Lücke, Kevin J. Fan, and Feng Zhu

Data sets

DoD2k Database of Databases for Common Era Paleoclimatology Michael N. Evans et al. https://doi.org/10.25921/sptp-g618

Interactive computing environment

DoD2k compile proxy database Python functions and Jupyter notebooks Lucie J. Luecke https://zenodo.org/records/15676256

Michael N. Evans, Lucie J. Lücke, Kevin J. Fan, and Feng Zhu

Viewed

Total article views: 1,523 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,391	88	44	1,523	30	44

HTML: 1,391
PDF: 88
XML: 44
Total: 1,523
BibTeX: 30
EndNote: 44

Views and downloads (calculated since 01 Jul 2025)

Month	HTML	PDF	XML	Total
Jul 2025	211	25	21	257
Aug 2025	213	27	10	250
Sep 2025	865	11	4	880
Oct 2025	82	15	2	99
Nov 2025	20	10	7	37

Cumulative views and downloads (calculated since 01 Jul 2025)

Month	HTML	PDF	XML	Total
Jul 2025	211	25	21	257
Aug 2025	213	27	10	250
Sep 2025	865	11	4	880
Oct 2025	82	15	2	99
Nov 2025	20	10	7	37

Viewed (geographical distribution)

Total article views: 1,502 (including HTML, PDF, and XML) Thereof 1,502 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 08 Nov 2025

Short summary

We present a database of databases (DoD2k) for Common Era (1–2000 A.D.) paleoclimate research. The DoD2k leverages existing code and 5 curated databases, eliminates duplicates, and contains 4613 records. We analyze for common features across moisture and temperature sensitive records, and we test cave carbonate data simulations against observations. DoD2k is expected to be useful for detecting climate change on decadal timescales and for improving data models and paleoclimate reconstructions.


Total:	0
HTML:	0
PDF:	0
XML:	0