The IPY 2007 – 2008 data legacy – creating open data from IPY publications

The International Polar Year (IPY) 2007–2008 was a synchronized effort to simultaneously collect data from polar regions. Being the fourth in a series of IPYs, the demand for interdisciplinarity and new data products was high. However, despite all the research done on land, people, ocean, ice and atmosphere and the large amount of data collected, no central archive or portal was created for IPY data. In order to improve the availability and visibility of IPY data, a concerted effort between PANGAEA – Data Publisher for Earth and Environmental Science, the International Council for Science (ICSU) World Data System (WDS), and the International Council for Scientific and Technical Information (ICSTI) was undertaken to extract data resulting from IPY publications for long-term preservation. Overall, 1380 IPY-related references were collected. Of these, only 450 contained accessible data. All data were extracted, quality checked, annotated with metadata and uploaded to PANGAEA. The 450 articles dealt with a multitude of IPY topics – plankton biomass, water chemistry, ice thickness, whale sightings, Inuit health, alien species introductions by travellers or tundra biomass change, to mention just a few. Both the Arctic and the Antarctic were investigated in the articles, and all realms (land, people, ocean, ice and atmosphere) and a wide range of countries were covered. The data compilation can now be found with the identifier doi:10.1594/PANGAEA.150150, and individual parts can be searched using the PANGAEA search engine (www.pangaea.de) and adding “+project:ipy”. With this effort, we hope to improve the visibility, accessibility and long-term storage of IPY data for future research and new data products.


Introduction
The International Polar Year (IPY) 2007-2008 was a synchronized effort of over 60 nations and numerous organizations and institutes to simultaneously collect data from polar regions (Krupnik et al., 2011).During March 2007 until March 2009 a broad range of research topics were addressed, from glaciology to biology, from biochemistry to biophysics, from oceanography to physiology, and from atmospheric to social sciences and even human health -in other words, research on land, people, ocean, ice and atmosphere.The IPY 2007-2008 was the fourth in a series of international polar initiatives, the previous ones having been held in 1882-1883 (11 nations) and 1932-1933 (46 nations); furthermore, the International Geophysical Year (IGY) of 1957-1958 was inspired by the previous IPY (67 countries; Bulkeley (2010)).A major outcome of the IGY was the creation of the World Data Centres (WDC) (Krupnik et al., 2011), which became the World Data System (WDS) in 2012.
The IPY 2007-2008Data Policy (IPY, 2008) states that "IPY generated data should be carefully and thoughtfully collected, used collaboratively, and adequately preserved."It also requires that "IPY data, including operational data delivered in real time, are made available fully, freely, openly, and on the shortest feasible timescale."The IPY was in-Published by Copernicus Publications.
A. Driemel et al.: IPY data legacy tended to help deepen the understanding of polar environmental change and its impact on society, which requires the creative use of a myriad of data from many disciplines (Parsons et al., 2011).However, despite all the data collected, and plans of a full-time, professional data unit (IPY-DIS -IPY data and information service), in the end no central archive or portal was funded for IPY data (Parsons et al., 2011).
A project-specific subset of the Global Change Master Directory (GCMD; http://gcmd.nasa.gov/portals/ipy/) is what comes closest to an IPY data portal.Here, a set of 642 metadata descriptions are listed for IPY (accessed 22 April 2015), being comprised of documents, images, articles and contact links to scientists and institutes.Thus, GCMD only contains data set descriptions and links either directly to external sources (data sets) or to the data centres, where data are supposedly stored.For specific projects or countries, a multitude of smaller data centres can be found offering IPY-related data: data on local or traditional community knowledge and observations, for example, are distributed via the Exchange for Local Observations and Knowledge in the Arctic (http://eloka-arctic.org).The US National Snow and Ice Data Center (http://nsidc.org/data) and the Advanced Cooperative Arctic Data and Information Service (www.aoncadis.org)as well as the Database of the Norwegian Polar Institute (http://metadata.data.npolar.no/datasets)and the DAMOCLES database of the Norwegian Meteorological Institute (http://damocles.met.no)offer data on Arctic IPY research.Norway also initiated the Norwegian permafrost database, NORPERM, during IPY 2007-2008(Juliussen et al., 2010).The New Zealand National Institute of Water and Atmospheric Research operates a Coastal and Marine Data Portal (www.os2020.org.nz) for selected projects, Australia stores data in the Australian Antarctic Data Centre (www1.data.antarctica.gov.au), and Canadian polar data can be obtained via the Polar Data Catalogue (www.polardata.ca).Fortunately, there has been some success in facilitating access to some of these archives by working toward a common Arctic data portal; see http://nsidc.org/acadis/search/.For a good illustration of the inherent problems and challenges connected to creating common data portals, see Mokrane and Parsons (2014).
As can be seen, IPY data are available from all these distributed sources (and the above-mentioned ones are just some examples).Nevertheless, in the world of data centres and data providers, there are various things a scientist searching for data has to contend with.The problems most often encountered are the following: (1) project databases are not maintained after the project ends and data links thus get 404 errors; (2) the actual data are nowhere to be found, i.e. huge amounts of metadata are not backed by data; (3) the data have not been quality checked, abbreviations are not explained and/or units are missing; (4) the data are not geocoded,; (5) data and metadata are stored in different files, and considerable effort has to be put into combining both to make sense of the data; (6) file-based archives do not harmonize contents, which significantly complicates the integration of data from different sources; (7) the data are hidden in a map or figure, and the source data are not accessible; (8) data centres tend to be focused on a very specific discipline, which hinders interdisciplinary work; (9) finding data is hard and tedious, even with powerful search engines (Parsons et al., 2011); and, lastly, (10) the data are not freely available.The so-called IPY-DIS (http://ipydis.org)had the intention of centralizing and improving this situation (Parsons et al., 2011).However, the website is not supported any more1 .This example nicely illustrates the importance of stable and permanent links and reliable long-term maintenance of databases.
Generally speaking, IPY data are often fragmented, sometimes poorly managed, hidden or hard to find.A large part of the IPY knowledge is recorded in publications, but the related data mostly are contained in pdf tables and thus not machine-readable and unavailable for further processing.

Implementation
Our motivation to extract IPY data from publications had the following aims: -make data machine-readable and thus usable for the public; -allow the integration into existing data and thus the compilation of individual new data products with, e.g., a data warehouse (http://wiki.pangaea.de/wiki/Data_warehouse 2 ), which serves the IPY demand of interdisciplinarity to create new knowledge; In order to address these issues, a concerted effort between PANGAEA -Data Publisher for Earth and Environmental Science, the International Council for Science (ICSU) World Data System (WDS), and the International Council for Scientific and Technical Information (ICSTI) was undertaken to extract data resulting from IPY publications for long-term preservation.The data rescue started in 2013 and ended at the beginning of 2015.It was organized into iterative tasks: researching and identifying legacy data from the scientific literature, extracting (i.e.capturing or digitizing) the numeric values, generating ISO 19115 (ISO, 2014) standard metadata, performing quality assurance and control processes on captured tables, and publishing data and its metadata with appropriate citations through PANGAEA (http://www.pangaea.de).
The process of researching and identifying legacy data in IPY publications began with the compilation of a list of 1380 references by ICSTI, using keywords relevant to IPY projects as well as author and project names retrieved from IPY-DIS data files.Bibliographic searches using Web of Sci- ence and Pascal databases were conducted with broad search criteria in order not to miss relevant articles.This bibliography served as a basis for the PANGAEA editor to filter out journal articles containing extractable data -either from the articles themselves (in the form of tables) or from the supplement supplied with the publication.Extraction and digitization of the data were performed such that numerical data or data tables were transformed into an ASCII format.Preparation of data included a technical quality control and editorial review (checking for typos, the correctness of geocoding and units, the precision of values, etc.) and annotation with metadata.Data and metadata were imported into the relational database of the PANGAEA system.

Results of literature extraction
In total 450 of the 1380 articles collected by ICSTI fulfilled the criteria needed for PANGAEA.These 450 articles contained 1270 extractable data sets (i.e.data tables), which were assembled into 450 so-called parents, meaning that, if an article contained several data sets, the data sets ("children") were combined into a general parent with slightly reduced (general) metadata and with links to the single data sets containing all metadata and data.The parents have a clearly defined citation showing their status as a supplement to the paper in question.All 1270 data sets are now available to the public by open access (doi:10.1594/PANGAEA.150150);they are identified with a citable and persistent DOI (which eliminates 404 errors) and are directly linked to the article and authors from which they originate.The data can be searched for with the PAN-GAEA search engine using standard search terms and adding "+project:ipy".In PANGAEA, there are no metadata without data, and no data without metadata.All data taken at a specific point in space and time are geo-and time-coded.When downloading several files of similar data, the tables can be directly compared.PANGAEA is laid out as a permanent facility, holding mandates of the ICSU World Data System as well as the World Meteorological Organization (WMO).Therefore, PANGAEA is guaranteeing the long-term availability and accessibility of the archived data and metadata in secure and machine-readable formats.For an example of an IPY data set in PANGAEA, see Fig. 1 (Toyota et al., 2011b).In total, 1400 different parameter-unit combinations were used to describe the data.Most parameters belonged to the field of marine taxa (abundances, biomass etc.) and chemistry (water chemistry, organic pollutants etc.); see Table 1 for PANGAEA parameter groups.General parameters included unspecific descriptive ones such as length, height, distance, type, area, number of XY, percentage, and so on.Often, a comment had to be added to these parameters (e.g.doi:10.1594/PANGAEA.836655) to describe out-of-the-way (i.e.exotic) data.We extracted articles on plankton biomass, water chemistry and ice thickness, but also the ones dealing with whale sightings, Inuit health, alien species introductions Earth Syst.Sci.Data, 7, 239-244, 2015 www.earth-syst-sci-data.net/7/239/2015/ Please note, that the data sets mentioned here can consist of core documentations, single radiosonde flights and other lower-level data types.
by travellers or tundra biomass change.Both the Arctic and the Antarctic were investigated in the articles, and all realms (land, people, ocean, ice and atmosphere) and a wide range of countries were covered (see Fig. 2).One drawback of this kind of data extraction is that it is too time consuming to engage the authors of the paper in a proof-reading process.E-Mail addresses are often outdated, or authors do not reply in time, so the whole process would not be feasible anymore.However, as the data have been published in an article and thereby also have been approved for public reuse (see, e.g., http://www.copdess.org/statement-of-commitment/), we assumed that they had been quality checked by the authors before publication.Our approach also has the drawback that publication-related data may represent only subsets of the original research data.But in our opinion, digitizing subsets is better than having no data whatsoever.

Other IPY-related data in PANGAEA
As a voluntary contribution of PANGAEA to IPY 2007-2008, various old books on IPY 1882-1883 research were digitized, and extractable data were archived in PANGAEA (Krause et al., 2010).This additional effort resulted in 94 data sets for IPY 1 (http://www.pangaea.de/search?q= ipy-1), which can now be compared to and/or combined with recent IPY data.PANGAEA also contains 705 IPY-2007IPY- -2008-related singular data sets not connected to publications.Due to the fact that these data sets derive from continuous, polar research observations and measurements (mostly from Germany), they are not explicitly labelled as IPY data.An overview of these data sets which complement the IPY data discussed above can be found in Table 2.
Last but not least, PANGAEA was one of the data sources used (Van de Putte et al., 2014) for a major outcome of the IPY 2007IPY -2008: : The Biogeographic Atlas of the Southern Ocean (http://atlas.biodiversity.aq).The Biogeographic Atlas of the Southern Ocean was published as a compilation of all benthos data available so far from the Southern Ocean floor.Due to the fact that many scientists from the international community archived their data in PANGAEA, the repository was able to make a substantial contribution to this census as an IPY legacy.

Availability and distribution of IPY-related data in PANGAEA
All data sets related to the 450 publications are accessible via the DOI doi:10.1594/PANGAEA.150150.Apart from the standard search via the PANGAEA search engine, IPY data in PANGAEA are distributed and can be found in various ways: the content of PANGAEA is integrated into the Data Citation Index of Thomson Reuters.It is furthermore distributed via web services through portals, search engines and library catalogues (http://wiki.pangaea.de/wiki/Portal).
To give an example, the IPY data set of Toyota et al. (2011b) can be found easily via Google with various three to four letter search terms (try, e.g., sipex snow toyota, tateyama ice sipex, or searching for the complete title of the article).IPY data sets can also be found via the DataCite Metadata search (http://data.datacite.org/10.1594/PANGAEA.839311) and via the catalogue of the German National Library of Science and Technology (TIB).WorldCat -the world's largest network of library content and services -incorporates the content of all repositories, catalogues and databases which can be utilized following the OAI-PMH (Open Archives Initiative -Protocol for Metadata Harvesting) standard and thus also includes the PANGAEA content with its IPY collection.Last but not least, Elsevier journals display a data reference and a map as soon as article-related data are archived in PANGAEA (see, e.g., doi:10.1016/j.dsr2.2010.12.002;Toyota et al., 2011a).Efforts to make the PANGAEA content available via the GCMD and in particular via the specific IPY and Arctic/Antarctic portals are in progress.

Conclusions
Our effort can be described as successful post-IPY data management.However, notwithstanding the overall positive effect of this work, it also shows the limitations of data management which is not synchronized with science activities in  2007-2008(Goodwin et al., 2012, http://www.nisc.com/ipy/),but no funds remain to extract the relevant data.Another problem encountered by the PANGAEA editor performing the digitization was that often only graphs and figures were included in the article, with no numerical data that could be extracted.Even more data are probably still stored on personal computers and will never be available to the public.The lesson that data is better shared still has to be learnt.
improve the visibility of IPY data through portals, search engines and library catalogues; and -allow open access to IPY results (open data, CC-BY licence).

Figure 2 .
Figure 2. Overview of Arctic and Antarctic sample sites of the 450 IPY articles now stored in PANGAEA.Green denotes singular events; the orange lines are from a radar sounding data set of Young et al. (2011).

Table 1 .
Occurrence of different PANGAEA parameter groups used for classifying the IPY data sets.
www.earth-syst-sci-data.net/7/239/2015/ Earth Syst.Sci.Data, 7, 239-244, 2015 A. Driemel et al.: IPY data legacy a timely way.Large parts of existing data from IPY projects are still stored in publications only -the IPY publication database is still growing and now contains almost 4000 entries for IPY