Description of the ERA-CLIM historical upper-air data

. Historical, i.e. pre-1957, upper-air data are a valuable source of information on the state of the atmosphere, in some parts of the world dating back to the early 20th century. However, to date, reanalyses have only partially made use of these data


Introduction
Historical, aerological data represent the only primary source of observations of the free atmosphere before the satellite era.Even though the aerological network reached today's density only from the International Geophysical Year 1957/58 on, large amounts of data are in fact available for the time before.The earliest observations go back even to the late 19th century, and the longest, almost continuous observational series at one single location begins already in 1905 (Linden-berg, Germany).However, the data sources are often scattered over a large number of archives, and many of them have not yet been digitised.The latter, besides the not very common knowledge of their sheer existence, is the main reason why, to date, no aerological data from the period before 1948 have been assimilated into any reanalysis.
Reanalyses are often used as a complete substitute for primary observational data in atmospheric studies.There are several good reasons for this fact: (1) reanalyses are spatially and temporally complete (with respect to their proper space and time resolution); (2) they offer a larger number of interesting variables that is not regularly or cannot even be directly observed; (3) in the ideal case, the assimilation of various observational data into a physics-based, atmospheric model should lead to a better quality of the reanalysis compared to the single, primary observations on average; and (4) unlike the variety of observational data, reanalyses are available in a universal format (netCDF) which is easy to handle, and for which simple analyses are easily possible with standard software.Therefore it is not surprising that, based on the amount of citations, the NCEP/NCAR (National Centers for Environmental Prediction/National Center for Atmospheric Research) 50-Year Reanalysis (Kalnay et al., 1996;Kistler et al., 2001) and the ECMWF's (European Centre for Medium-Range Weather Forecasts) ERA-40 Re-Analysis (Uppala et al., 2005), the only full reanalyses (i.e.including surface, upper-air and satellite data) reaching back to 1957 or further, have been the most cited atmospheric data sets since their publication (1146/13 311 and 545/3094 2012/total citations, Web of Knowledge: http:// apps.webofknowledge.com/,last access: 22 October 2013).
The popularity and doubtlessly benefits of reanalyses, as well as the fact that a number of scientifically interesting climate and weather extremes in the first half of the 20th century are not covered by these reanalyses, have led to undertakings such as the Twentieth Century Reanalysis (20CR) project (Compo et al., 2011) that have tried to expand the time span covered by reanalyses further back into the past.However, 20CR only assimilates surface pressure, using sea surface temperatures and sea ice as boundary conditions.This approach certainly has advantages, such as a more stable observational network over time which should lead to improved homogeneity compared to other products that make use, e.g. of satellite data.On the other hand, the method seems less reliable in the tropics with its small surface pressure differences due to the small horizontal component of the Coriolis acceleration, and also in the Arctic and over the oceans (Compo et al., 2011;Brönnimann et al., 2013).
Another approach to extend reanalyses back into the past is to digitise and subsequently assimilate the abovementioned additional historical, meteorological observations that have not been available before.Even for the period 1948-1957, new upper-air data are expected to improve the quality of future reanalyses, particularly in the tropics and in the Southern Hemisphere.This is the approach pursued by the European FP7 (7th Framework Programme) project ERA-CLIM (European Reanalysis of Global Climate Observations; http://www.era-clim.eu).The project partners in work package 1 (WP1) -dealing with the recovery, imaging and digitisation of historical upper-air observations and metadata -were the Climatology Group at the Institute of Geography of the University of Bern, Switzerland (UBERN, WP1 lead), Météo-France in Toulouse, France (METFR), the Russian Research Institute for Hydrometeorological Information (RIHMI) in Obninsk, Russia, and the Fundação da Faculdade de Ciências da Universidade de Lisboa at the Dom Luiz Institute of the University of Lisbon (FFCUL, Portugal).
In the framework of the project, more than 1.3 million station days of upper-air data have been inventoried, more than 200 000 images of the sources taken, and ca.750 000 station days digitised.Data were recovered for large parts of the globe, but with a focus on, so far, less well covered regions such as the tropics, the polar regions and the oceans, and on very early 20th century upper-air data from Europe and the US.The data rescue activities of ERA-CLIM were organised in close arrangement with the broader Atmospheric Circulation Reconstructions over the Earth initiative (ACRE; http://www.met-acre.org), and, in the case of surface pressure and temperature data, in cooperation with the International Surface Pressure Databank (ISPD; http://reanalyses.org/observations/international-surface-pressure-databank) and the International Surface Temperature Initiative (ISTI, http://www.surfacetemperatures.org).The new upper-air data complement the already available Comprehensive Historical Upper-Air Network (CHUAN, Stickler et al., 2010), which already compiled large amounts of pre-1957 upper-air data, and which is also planned to be assimilated into the new ECMWF full reanalyses.The new ERA-CLIM data are available from the PANGAEA data repository (doi:10.1594/PANGAEA.821222).
In this paper, we describe the ERA-CLIM upper-air data that result from the ERA-CLIM data rescue activities, including the search for the sources and the metadatabase (http://www.oeschger-data.unibe.ch/metads)which contains the complete metainformation on all inventoried records (Sect.2), the imaging and digitisation process (Sect.3), and the unit conversions, reformatting and quality checking procedures applied to the data (Sect.4).A detailed discussion of the distribution of the data in space and time and of their usefulness, e.g. for analysing past weather extremes, can be found in Stickler et al. (2013), and will therefore not be repeated in the present paper.Finally, Sect. 5 draws conclusions and gives an outlook.

Cataloguing of historical data and metadatabase
The development of the ERA-CLIM historical, observational data set started with the search for and systematic cataloguing of potential data sources.While some project partners were more or less aware of the data sources they planned to digitise or at least knew in which archive to search for the respective metadata (RIHMI, FFCUL), a detailed inventory of available archives or potential sources first needed to be established in the case of others (UBERN, METFR).The METFR, RIHMI and FFCUL upper-air data rescue activities focussed on their home institutional archives (even though METFR has meanwhile also started with the inventorying of sources in the French National Archives), whereas UBERN Earth Syst.Sci.Data, 6, 29-48, 2014 www.earth-syst-sci-data.net/6/29/2014/  1).
The preparatory work resulted in a catalogue of historical, meteorological data that are available in hard copy, digitally imaged or on micro-film in a large number of archives or libraries worldwide.To facilitate the coordination between the different project partners and to prevent unnecessary, duplicate efforts during ongoing or future projects within the data rescue community, the collected metadata have been made publicly available in the form of a centralised, web-based metadatabase (http://www.oeschger-data.unibe.ch/metads)that contains all inventoried records that could be obtained for imaging, whether or not they have been imaged or digitised, including the entire relevant information on the data.In total, more than 450 000 digital images have been taken (including surface and atmospheric transmission data1 , more than 200 000 for upper-air data alone).More than 1.3 million station days of upper-air data and more than 1.5 million station days of surface data have been catalogued.Of these, more than 700 000 station days of each upper-air and surface data have been digitised to date.The total number of digitised/inventoried records is 13/16 for atmospheric transmission data, 80/214 for surface data, 61/101 for moving upper-air data (i.e.data from ships etc.), and 735/1783 for fixed station upper-air data.The largest single sources of moving upper-air data and upper-air data that have been inventoried, sorted according to the estimated number of station days, are listed in Tables 2 and 3.
The metadatabase is organised according to the different types of observations, and contains a unique, project internal ID for each entry, followed by detailed information on the contact person/institution and data owner, station and/or vehicle information (the latter in the case of "moving upper-air records", e.g.identifiers or vehicle name, measurement network, type of measurement platform, location or region, and brief reference to the source of data), the period of station operation and of the record, the frequency of observations and the estimated record volume (station days), the vertical coordinate (e.g.pressure or geometrical altitude for upperair data) or the number of resolved wavelengths (for transmission data), the available parameters and possibly instruments used, and the current status of the record (levels 0-5, correspondingly, imaged, digitised in native format, converted to common format/units, integrated to database, quality checked, and homogenised).Measurement platforms in  (1931,1932,1933,1934) ship the case of upper-air data are aircraft, pilot balloons, kites, registering balloons, captive balloons and radiosondes (see also Table 9 for the type specification in the data files).The metadatabase also holds a unique collection identifier assigned by the ECMWF to each different source.This allows, e.g. for excluding certain data sources during assimilation.
The digital images of all upper-air sources as well as the digitised data produced by all groups participating in the project (raw and corrected) were collected and stored centrally at UBERN.The images will be made available as stitched PDF files, sorted according to the different sources, via a web link from each record inside the metadatabase (http://www.oeschger-data.unibe.ch/metads).In the following subsections, we elaborate in more technical detail the imaging and digitisation process at the different institutions.

UBERN
A large part of the data sources of types a, b or c (see Table 1) digitised at UBERN could be downloaded directly from the web in digitally imaged form, particularly all images obtained from the NOAA Central Library Foreign Climate Data website (http://docs.lib.noaa.gov/rescue/data_rescue_home.html,e.g.Vols. 1904Vols. -1912Vols. and 1914Vols. -1926 of the Täglicher Wetterbericht, the Pakistan Daily Weather Report, the Daily Weather Report (Cairo, Egypt) and the Boletim mensal das observações meteorológicas listed in Table 3), but also the 1914-1949 Monthly Weather Review supplements that contain early single ascent US kite data, and a few other journal publications such as the early 20th century volumes of the Nachrichten von der Gesellschaft der Wissenschaften zu Göttingen (including the data publications on the Samoa Observatory mentioned in Sect.2), and some reports, such as the one from the US military operation Highjump 1946/47 in Antarctica (see Table 2).All other sources had to be ordered from different libraries via library or interlibrary loan, mainly in Switzerland and Germany, but also in Austria, Denmark, Norway and the UK.A few, relatively large sources were imaged on location in the central library of the Karlsruhe Institute of Technology (Germany), because it was not possible to obtain the respective meteorological periodicals via interlibrary loan (Upper Air Data, India Vols. 1928-1936; Suomen Meteorologinen Vuosikirja (Meteorological Yearbook of Finland) Vols. 1901Vols. -1937;;Täglicher Wetterbericht (Daily Weather Report, Germany) Vols. 1913, 1927-1934; see Table 3 for exact references).
All imaging was done with the help of an RSX repro rack with an RTX camera arm and an RB 5055 HF high frequency lighting (Kaiser, Germany).A standard Nikon D80 digital camera with a Nikon DX AF-S NIKKOR 18-135 mm lens was used.Practical tests with different camera settings led to the following optimum setting that was subsequently applied: manual mode (M), exposure time: 1/40 s (variable), aperture: F22 (variable), ISO: 100, image optimisation: intense, image quality: fine, image size: L, grid lines: on, remote-control release: on, automatic focus: on, monitor: off (to save battery lifetime).In the case of some thick books, first all left and then all right pages had to be photographed separately using an angle of 45 • between the camera axis and the book plane in order to keep the pages flat.The flexible Kaiser Repro set which allows for rotation of both the camera arm and the lighting arms proved to be very helpful in this context.
The images were then cut to a standard size using the Ir-fanView (http://www.irfanview.com)Batch Conversion function.This function was also used to quickly rotate large numbers of images.To rename the final image files to a standard notation, the program Bulk Rename Utility was used (http://www.bulkrenameutility.co.uk).The files were saved in JPEG (Joint Photographic Experts Group) format and later on merged to multi-page PDF files for distribution to the digitisers.
For a relatively small part of the self-imaged, printed sources, it was possible to use optical character recognition (OCR) software for digitisation (the German Atlantic Meteor Expedition, the Bulletin of the Mount Weather Observatory, see Sect. 2, and for the supplements of the Monthly Weather Review mentioned above).In most cases, the formats of the many relatively small sources changed too often (from source to source and even from year to year inside sources such as weather reports), or special characters appeared in the tables which made it too complicated to apply OCR.In the case of the NOAA Central Library downloads, the image quality was generally too low (resolution and contrast not optimal) for OCR to be useful and many sources were hand-written.In our case, with mainly tables and numbers to be digitised, AB-BYY FineReader 10 Professional (ABBYY, Russia) proved to be the optimal software product, with a clearly lower error rate compared to OmniPage 16 (Nuance, USA).
However, for most sources, digitisation had to be done manually by keying.For this purpose, Excel templates were created for each different type of source containing an info tab with an image of a typical source page, describing exactly which data on the pages had to be digitised, and a template tab containing the data entry mask.The info tab was especially useful because additional data was sometimes present on the pages that were not in the scope of the project.Also, for some sources certain parameters such as time or surface altitude were not given with each observation, but only once for the whole month on top of the page or even on a totally different page.Furthermore, introducing the given format and units had the advantage of making the assistants more aware of possible value ranges and errors in the data.This was important, since the assistants were also given the task to flag questionable, illegible or implausible values, which required a certain expert knowledge.These flags were later on transferred into flag columns following each data column in the final files and used for quality checks.Generally, the templates closely followed the format of the sources themselves to facilitate the typing procedure.Only existing values needed to be entered by the assistants, and they had to be entered as is (i.e.without doing any unit conversions, etc.).Empty fields of the tables were filled up with −999 at the end of the process.Any further work with the data was done as post-processing (see Sect. 4).
The pages containing the upper-air data were scanned with two A3 scanners (MFC-6490 CW printer, Brother, USA and Must Page Express A3 USB 1200 Pro, Mustek, USA) with a resolution of at least 300 dpi.Most of the images produced were in TIFF format and were used as input in an OCR program (ABBYY FineReader 8.0 Professional Edition) to obtain Excel tables (digitisation process).The remaining files were subjected to typing in order to recover the data.
Previously prepared tables in Excel format were used to transfer the OCR Excel files.The pilot balloon data were mostly subjected to this procedure, whereas the radiosonde data were all typed into previously prepared Excel tables.The initial format of the recovered tables was respected due to the fact that they were already typed with days in rows and levels in columns.Every table corresponded to a month of data.A first level of quality control was applied when the OCR and typing processes were performed, signalling potentially wrong values.The pilot balloon data were relatively easy to deal with due to the fact that they only had wind direction and speed, but the radiosonde data contained surface pressure and temperatures, wind and geopotential values for each pressure level, as well as tropopause values.The geopotential heights were coded in the original documents, sometimes by omitting the first and last number, and needed to be restored to their real values.This process was performed by saving the monthly table values into tab-delimited ASCII (American Standard Code for Information Interchange) files, and introducing them into a Fortran program that calculated the required values and that also transformed the wind speeds published in knots into metres per second (for conversion constants used, see Table 4).The Fortran program also verified that the wind direction "Calm" corresponded to 0 m s −1 .The wind directions were multiplied by 10 to obtain the values in degrees (initially published in tens of degrees).The radiosondes for Maputo (1952)(1953)(1954) contained significant level values that we chose to digitise.The later year radiosondes did not present these levels.It was especially necessary to be careful with the vertical heights at which values of pilot balloon data were taken, as these changed from year to year, as did the time of pilot balloon ascents.
All values were supplied to UBERN in tab-delimited ASCII files where one line represented an ascent, containing the year, month, day and hour (in UTC, coordinated universal time), as well as all the values measured by the device.Columns contained the several parameters at each pressure level or height.The process of more refined quality control and flagging was left to UBERN to perform, as described in Sect. 4.

RIHMI
At RIHMI-WDC, the collection contained book publications, which provided the most data, as well as observation booklets from different observatories and stations, and expedition reports.The fraction of data contained in typographically printed publications was about 80 % of that contained in all materials, while handwritten materials provided the remaining 20 %.Nevertheless, the paper in most of the typographical publications was aged, and the quality of the printed text was low.This led to numerous errors in the OCR processing, which could however efficiently be detected during the QC (quality check) process, as will be detailed in Sect. 4.
The printed and handwritten materials were first scanned.The ELAR PlanScan C2X and ELAR PlanScan A2M highproduction units were used for scanning.The page images of each publication were combined into electronic books of PDF format by scanning and image managing software.
The OCR was the most essential part of the digitisation process at RIHMI.Like some of the other participants of the project (FFCUL, UBERN) we used ABBYY FineReader 10 Software (ABBYY, Russia).This software at RIHMI demonstrated acceptable quality of recognition and a smaller number of errors compared to other similar software products.In addition, it has a Russian language interface, thus allowing for its use by personnel that had no foreign language skills.
The decisions on how to arrange the OCR process were taken based on the characteristics of the different table formats in each publication, case-by-case.Only few image sources were acceptable for recognition of whole pages, whereas most sources contained structures of tabular data www.earth-syst-sci-data.net/6/29/2014/ Earth Syst.Sci.Data, 6, 29-48, 2014 that enabled recognition of separate parts (tabular blocks) of the pages.Nevertheless, the experience of the personnel in each case permitted us to outline separate parts of similar content for recognition, and to collect the contained data step-by-step in Microsoft Excel template sheets using a copy-paste mechanism.
For handwritten sources as well as for printed, but unstructured or otherwise improper materials that could not be digitised by OCR, manual digitisation with Microsoft Access templates was used.

METFR
METFR has launched a national action creating an inventory of upper-air data sources and holdings in mainland France and overseas territories.Meanwhile, METFR has also started inventorying sources in the French National Archives, and in the departmental archives of the French West Indies.
In total, 400 records for fixed upper-air stations have been inventoried and catalogued from the holdings of 37 French archives: the Météo-France centres and French institutional archives in mainland France, and Météo-France archives in the French overseas departments and territories.The three largest holdings of early mainland France upperair data sources are located in mainland France (library of Météo-France in Saint-Mandé, Directorate of Climatology in Toulouse and French National Archives in Fontainebleau), while the largest part of the overseas data sources is scattered all over the French overseas departments and territories (French West Indies, French Guyana, French Polynesia, Réunion Island and New Caledonia).
The largest upper-air sources encompass the Météo-France climate database and four long collections of daily reports and bulletins in hardcopy (see Table 3 (BDCLIM, 1948(BDCLIM, -1958) ) This is the main source for French digitised radiosonde data for the period 1948-1958 with pressure, temperature and humidity observations ordered by altitude level and provided on mandatory and required pressure levels, standard geopotential levels, and significant levels, as detailed in Durre et al. (2006).Twelve long-term series of French stations including three overseas territories and two Antarctic stations have been provided complementary to already available data from CHUAN.A total of 42 168 additional ascents have been extracted from BDCLIM and provided to the ERA-CLIM data set.The original sources of these data are the reports named Compte Rendu Aérologique and Compte Rendu Vent (see Table 3).

Compte Rendu
Quotidien (CRQ, 1923(CRQ, -1957) ) This is the main source for French upper-wind data from pilot balloons for the period 1923-1957.The collection of original, handwritten four-page reports contains surface data on the first three pages and upper-wind data on the last page: speed and direction given on pre-printed, geometrical altitude levels below 10 000 m a.s.Part of the collection of these original, handwritten daily reports, in hard copy for France and as microfiches for northern Africa, has been imaged in the framework of the project ERA-CLIM.Imaging was performed at several locations by private companies in mainland France, Guadeloupe, and Nouméa, with different professional flatbed scanners (300 dpi), and by Météo-France agents using cameras: eight long-term series in mainland France (1923)(1924)(1925)(1926)(1927)(1928)(1929)(1930)(1931)(1932)(1933)(1934)(1935)(1936), three stations in the overseas departments and territories, and three stations in the Antarctic territories.The files were saved in JPEG format and in multi-page PDF format.The names of the files contain the type of the document (CRQ), the name of the station and the date of the ascent.A total of 55 792 pages of CRQ containing upper-wind data have been imaged.Seven of the imaged upper-wind mainland long-term series and one New Caledonian long-term series (Nouméa, period 1947(Nouméa, period -1957) ) have been manually keyed by two private companies (48 700 ascents).For this purpose, digitisation had to be prepared by one meteorologist, expert in historical upperair reports scrutinising carefully the recording information in all the documents (readability of the images, identification of the station, date and hours of observation, units, levels, gaps, etc.), and afterwards by an expert on technical specifications, who would describe exactly how and which data had to be digitised according to the description made by the first expert.Digitisation was checked at first by a meteorologist in a sampling inspection.
Earth Syst.Sci.Data, 6, 29-48, 2014 www.earth-syst-sci-data.net/6/29/2014/ 3.4.3 Observations Quotidiennes (1937-1939, 1947-1957) This is the main source for upper-air observations in mainland France 1937France -1939France and 1947France -1957. .The Observations Quotidiennes are an institutional publication containing pilot balloon observations for 20 stations and the early operational radiosonde data from Trappes.The collection of this publication, stored at the library of Météo-France, has been imaged by a private company with a professional flatbed scanner, but the upper-air data have not yet been digitised.The files are in JPEG and multi-page PDF format.

Compte Rendu Aérologique and Compte Rendu
Vent (1948)(1949)(1950)(1951)(1952)(1953)(1954)(1955)(1956)(1957) These are the largest sources for French (mainland France, overseas, Antarctic, ex-colonies) radiosonde data in hard copy.The pre-printed and handwritten daily aerological reports are stored on site at each upper-air station archive, with one exception: for the Antarctic territories it is archived in Toulouse.A microfiche listing, dated from 1985, with pressure, temperature, humidity and wind has been found at METFR for the period 1948-1957.A total of 43 979 microfiches have been imaged in order to recover the wind data that is still not integrated into CHUAN and the Météo-France climate database.(1929-1930, 1933-1916, 1951-1957) This collection of daily bulletins published by the French meteorological service contains upper-air data from northern Africa.The collection stored at the library of Météo-France, that is unfortunately incomplete, has been imaged by a private company with a professional flatbed scanner, but upperair data have not yet been digitised.The files are in JPEG and multi-page PDF format.More data sources were identified than could be imaged, and more data sources were imaged than could be digitised within ERA-CLIM.Digitisation was focussed on early pilot balloon data.Metadata are a valuable source of information, too.Météo-France has recovered old instructions on how to fill in daily climate reports, course books on upper-air observations, and notes on measuring instruments.All these documents have been imaged and can be downloaded from the web in PDF format from the Météo-France library website (http://bibliotheque.meteo.fr/).

Unit conversions, reformatting and QC
After the digitisation, units and format needed to be standardised for all data of the same type (unit conversion constants used are listed in Table 4).Four tab-delimited ASCII data formats are used for the final files: two formats for each fixed station upper-air data and moving upper-air data (i.e.data from ships, aircraft, manned balloons, etc.), on geometrical altitude and on pressure levels, but independent of the observation platform (pilot balloon, radiosonde, kite, aircraft, registering balloon, tethered balloon).Both pressure level and altitude level formats are flexible in that they allow for a maximum of 50 and 100 arbitrary levels in each line of data, respectively.Observation values from one single ascent can be reported either in one line or in several lines (e.g. if a different time is reported for each observation level).The data formats, the parameters and variables contained, and the respective units used in the final files are listed in Tables 5-10.Due to the very large size of the ASCII files, the data archives are being made available as compressed RAR (Roshal ARchive) files.
The first column always gives the type of observation (i.e.aircraft, kite, pilot balloon, etc.).For the moving platform format, the geographical position is explicitly contained for each line of observations in the file, whereas it is specified in the inventory for the fixed station records.The following columns contain the date and time stamp (hours in UTC).Both the pressure level and the altitude level format have a fixed number of columns (corresponding to the maximum of 50/100 reported levels in each line for pressure/altitude level data).Due to the large number of arbitrary levels encountered in the historical data, each level value (in hPa or m a.s.l.) is reported explicitly in front of the respective variable columns.
The first variable is geopotential height/pressure for pressure/altitude level data, followed by temperature, wind direction and speed, u and v wind, relative humidity, dew point difference, and specific humidity.Each value column is followed by a flag column to mark, e.g.implausible, suspicious or interpolated values.A flag suffix exists in some cases to discriminate data obtained during ascents from that obtained during descents.Table 9 lists the flag values used in the upper-air and moving upper-air formats.Table 10 gives the numbers used to specify the type of observation (observational platform) in the upper-air data files.
The complete conversions, reformatting and QC were done at UBERN for the UBERN and FFCUL data.RIHMI did unit conversions, reformatting to a proper format and QC for their own data (described in detail in the following paragraphs).Nevertheless, UBERN again did a raw QC of the finished RIHMI data and reformatted it to the common, agreed final format.METFR applied some basic QC to their own data which were delivered to UBERN in the final format.
At UBERN, thousands of implausible or suspicious values had already been flagged during the digitisation process.These values were re-checked against the respective images afterwards.All files belonging to one record listed in the inventory were finally merged together before the quality checks were done and the data was reformatted.The additional raw quality control which was also applied to the data delivered by FFCUL, RIHMI and METFR was in general Table 5. Definition of the columns in the moving upper-air geometrical height level format ASCII files used for the ERA-CLIM data.Suffices .1/.2 added to flag values signify observation values obtained during ascent/descent of a kite or tethered balloon (see Table 7).n runs from 0 to 100.The new, surface-only based reanalysis of ECMWF (ERA-20C; Poli et al., 2013) that was also produced in the framework of ERA-CLIM, was used to calculate reanalysis departures for each digitised value (including the values from METFR that were not subjected to range checks by UBERN), and for all parameters: temperature, pressure, geopotential height, wind and humidity.To improve the quality of the data, all temperature values with absolute departures > 30 K were re-checked manually.The value of 30 K was chosen globally, based on scatter plots of reanalysis departure against collection identifiers, as a cut-off value representative of strong outliers.This process led to the correction and/or flagging of an additional 2325 temperature values (ca. 2 ‰ of all temperature values collocated with reanalysis values).

Column
We decided to use a rather relaxed outlier criterion for temperature since the scatter of the respective departures was found to be quite large, and therefore a more detailed QC  would have required either the application of further filtering criteria in order to reduce the number of values to be checked manually in an objective way (e.g. using information on the climate zones of the stations), or alternatively the use of an automated flagging procedure with a smaller, but possibly more subjective cut-off value (e.g. by simply defining the upper and lower 1-2 percentile of all digitised values as outliers, based on typical detected error rates in quality-checked atmospheric data from previous digitisation projects).
On the "warm" side of the departures (which also tended to show larger outliers on average), we found that the large scatter can at least partly be explained by strong, lower tropospheric inversions, which are not always correctly represented in the reanalysis system.The scatter is further enhanced by the difference between the reanalysis and observation time combined with frontal passages or dislocations of frontal systems in the reanalysis in the midlatitudes.On the "cold" side, outliers > 30 K were often connected to previously undetected sign errors (missing negative sign) in the digitised data (which were subsequently corrected), while Earth Syst.Sci.Data, 6, 29-48, 2014 www.earth-syst-sci-data.net/6/29/2014/  negative departures with absolute values < 30 K can again be caused by spatio-temporal incoherences between frontal passages in the reanalysis vs. observations in the midlatitudes.For pressure, geopotential height, wind and humidity, an additional QC based on reanalysis feedback has not yet been applied to the data, but is planned in the framework of its inclusion into CHUAN V2.0.
The upper-air data digitised at RIHMI were tested in a more sophisticated way than at UBERN, e.g. by checking for vertical consistency using Excel formatting options, as described in the next paragraph.Though OCR processing was found to be 4-5 times more rapid than manual keying at this institution, errors specific to the recognition process appeared at RIHMI due to the low quality of the original sources.The most frequent of these errors were "," instead of "." and vice versa (such errors resulted in an incorrect type of cell values), distortions of digit symbols, such as "1" instead of "4", "9" instead of "2", "0" instead of "8", losses of minus and of other symbols, etc.
A very simple and effective instrument for the pre-check was found in the Microsoft Excel conditional formatting option.This feature fills the cells of a column by colour depend-  ing on the numeric value contained in the cell, and scaled to the whole range of the column values.The cells containing errors leading to character values instead of numeric ("," instead of ".") are not coloured at all, and thus are easily detected and corrected.Use of conditional formatting with colour scale, say, blue (minimal values) to red (maximal values), further allowed to easily detect errors connected to outliers or errors in the order of numeric values (say, violation of pressure decrease and height increase order within one launch of a radiosonde, etc.).All these errors were detected, re-checked in the original publications or page images, and corrected in the Excel sheets.
For the data manually digitised at RIHMI, pre-checking of possible values (min/max values, lists of acceptable values, etc.) was implemented based on Microsoft Access options.Even though manual digitisation is a simpler, but slower process, it was maximally optimised.
The further operations with data at RIHMI included import of Excel or Access data into SAS software: numerous transformations, data management operations, such as sorting, transposing, recalculations of variables, etc., based on the SAS programming language.After that, SAS was used to  A special, unexpected challenge for RIHMI occurred in connection with observation times.When the first version of the RIHMI data set was created, it still contained observation dates and times as "book values" (and, for the general goal of creating digital archives of the originals, it is useful to preserve these values).However, data ready to be assimilated for reanalysis are required to be given in UTC.A special block of operations was required to correct the transition from digitised date/time values to the unified UTC date/time values.
These digitised "book" interpretations were diverse in different books, and even within the same book, they were different for various stations and for various observational subperiods.Primarily, they were UTC, Moscow time, local time, etc.The transfer of dates and time values into unified UTC proved to be a non-trivial problem.Specifically, the vast eastwest range of the former USSR territory and specificities of the time zones arrangement may lead to serious errors in this transfer.In marginal cases, month and even year could be changed.
Based on introductory text paragraphs in the books, comments, table descriptions, notes, and so on used at RIHMI, it was possible in most cases to identify case-by-case what was the used "book time" in each situation, and to calculate the appropriate time shift.The correction that was done in SAS then included several steps.The first step was to use the "book values" of year, month, day, hour, and minutes, to apply special transformation functions and to compose a BOOK DATETIME variable in the SAS format.The second step was to use the special shifting functions for date and time variables, and to produce a new UTC DATETIME variable.The third step was to apply special transformation functions to the UTC DATETIME variable in order to decompose it and to create modified UTC values separately of year, month, day, hour, and minutes.
After all of the described operations, the data were output in column TXT (text file) format, so that each value had a complete set of spatial coordinates (horizontal and vertical) and temporal coordinate (unified UTC).
The final data have been submitted to the ECMWF data repository and to the project work package dealing with homogenisation and quality assessment of upper-air data, led by the University of Vienna.

Conclusions and outlook
In this paper, we have presented a newly available, historical upper-air data set that has been produced in the framework of the EU FP7 project ERA-CLIM (http://www.era-clim.eu).More than 1.3 million station days have been inventoried, more than 200 000 images of the data sources including useful metadata have been taken, and more than 700 000 station days have been digitised.The data have been subjected to quality control procedures described in the paper.Generally, the latter consisted of range checks.For the data digitised at RIHMI, additional vertical consistency checks were applied, and for all temperature data, a manual outlier inspection was done for all values with absolute departures Earth Syst.Sci.Data, 6, 29-48, 2014 www.earth-syst-sci-data.net/6/29/2014/ > 30 K from the new ERA-20C surface-only based reanalysis of ECMWF, also a product of ERA-CLIM.The new data are being integrated into CHUAN V2.0, applying quality control and homogenisation methods introduced in Wartenburger et al. ( 2014), and will presumably be assimilated into upcoming, full reanalyses at ECMWF and at similar centres.Furthermore, the University of Vienna is undertaking an effort to homogenise a combined upper-air data set merged from CHUAN including the new ERA-CLIM upper-air data plus IGRA (Durre et al., 2006) and the ECMWF upper-air holdings, using well established homogenisation methods (Ramella Pralungo et al., 2013;Haimberger, 2007;Haimberger et al., 2012).This data set will encompass pilot balloon, kite and radiosonde data.The ERA-CLIM data are available on the PANGAEA data repository (doi:10.1594/PANGAEA.821222).
The historical upper-air data rescue efforts will continue in the follow-up project ERA-CLIM 2. On the one hand, the related ERA-CLIM 2 activities will focus on the remaining data that have been inventoried, but not yet digitised during ERA-CLIM (see the ERA-CLIM metadatabase, including the whole record inventory on http://www.oeschger-data.unibe.ch/metads).On the other hand, additional data sources have already been identified in different archives and libraries (such as the MetOffice library, the Karlsruhe Institute of Technology central library, or at the Federal Maritime and Hydrographic Agency of Germany), or even imaged (e.g. at the Swedish Meteorological and Hydrological Institute).

Table 1 .
Types of data sources identified.
performed a general, web-based literature search on historical upper-air and atmospheric transmission data, concentrating on tropical, polar, oceanic, and very early observations.METFR inventoried and catalogued all pre-1957 upper-air data available in the Météo-France climate database and not included in CHUAN.Generally, nine types of data sources were identified (see Table

Table 2 .
The largest single inventoried ERA-CLIM sources of moving upper-air data (> 100 station days).

Table 3 .
The largest single inventoried ERA-CLIM sources of fixed station upper-air data (> 15 000 station days).In the "type" column, A stands for aircraft, CB for captive balloon, K for kite, P for pilot balloon, R for radiosonde, and RB for registering balloon.

Table 4 .
Unit conversion constants used.
l. until 1930, and below 8000 m a.s.l.hereafter (with the geometrical altitude given in m a.g.l. until 1926 and in m or km a.s.l.afterwards).The collection of this type of document is to date the unique source for long-term pilot balloon series for the entire mainland France during the period 1923-1936, for the overseas departments during the period 1946-1957, for the French southern Antarctic territories during the period 1950-1957 and for the North African ex-colonies during the period 1923-1925.The whole collection of CRQ for mainland France is stored in paper form at the French National Archives in 1304 boxes.The gap during WWII depends on the location of the station.One long series without a gap in the south of France has been inventoried and imaged: Antibes-Nice.

Table 6 .
Definition of the columns in the moving upper-air pressure level format ASCII files used for the ERA-CLIM data.Suffices .1/.2 added to flag values signify observation values obtained during ascent/descent of a kite or tethered balloon (see Table7). n runs from 0 to 50.

Table 7 .
Definition of the columns in the fixed station upper-air geometrical height level format ASCII files used for the ERA-CLIM data.Suffices .1/.2 added to flag values signify observation values obtained during ascent/descent of a kite or captive balloon (see Table7). n runs from 0 to 100.

Table 8 .
Definition of the columns in the fixed station upper-air pressure level format ASCII files used for the ERA-CLIM data.Suffices .1/.2 added to flag values signify observation values obtained during ascent/descent of a kite or captive balloon (see Table7). n runs from 0 to 50.

Table 9 .
Meaning of flag values used in the upper-air data files., to check values of some key elements, such as month, day, hour, minutes, to visualise data by statistical graphs, such as box-and-whisker plots.All this enabled us to detect errors.The erroneous or suspicious values were re-checked using the original sources, the corrections were made with the direct feedback in the Excel and Access sheets, and the processing was repeated.Such checks, based on tabulated statistics and on statistical graphs, together with the previous checks based on Microsoft Excel conditional formatting, allowed us to detect and immediately correct most of the erroneous values in the upper-air data at RIHMI.On the other hand, such checks are not comprehensive, and further checking of the data was performed at UBERN and RIHMI using feedback from the Operational Feedback Archive (OFA) of ECMWF, as described above.

Table 10 .
Numbers used to specify observational platform ("data type") in the upper-air data files.