Historical ecological surveys serve as a baseline and provide
context for contemporary research, yet many of these records are not
preserved in a way that ensures their long-term usability. The
National Eutrophication Survey (NES) database is currently only
available as scans of the original reports (PDF files) with no
embedded character information. This limits its searchability,
machine readability, and the ability of current and future
scientists to systematically evaluate its contents. The NES data
were collected by the US Environmental Protection Agency between
1972 and 1975 as part of an effort to investigate eutrophication in
freshwater lakes and reservoirs. Although several studies have
manually transcribed small portions of the database in support of
specific studies, there have been no systematic attempts to
transcribe and preserve the database in its entirety. Here we use
a combination of automated optical character recognition and manual
quality assurance procedures to make these data available for
analysis. The performance of the optical character recognition
protocol was found to be linked to variation in the quality
(clarity) of the original documents. For each of the four archival
scanned reports, our quality assurance protocol found an error rate
between 5.9 and 17 %. The goal of our approach was to
strike a balance between efficiency and data quality by combining
entry of data by hand with digital transcription technologies. The
finished database contains information on the physical
characteristics, hydrology, and water quality of about 800 lakes in
the contiguous US (, 10.5063/F1639MVD). Ultimately, this database could be
combined with more recent studies to generate meta-analyses of water
quality trends and spatial variation across the continental US.
Introduction
Effective management of inland freshwater lakes requires an
understanding of the factors that affect water quality and how
these factors change over time. One of these factors, termed
eutrophication, occurs when excess nutrient inputs from human
activities fuels increases in algal growth, which can cause hypoxia
and decreases in water clarity. Eutrophication of surface waters
from increased phosphorus and nitrogen loading has been observed in
connection with altered land use, especially in areas of rapid
urbanization and intensive agriculture
. As human
populations and their impacts continue to grow, eutrophication is
expected to become more widespread . Historical datasets are needed in order
to track, understand, and manage eutrophication in lakes and
reservoirs because they serve as an important baseline for modern
studies.
Survey locations colored by sampling year (1972 northeastern: light blue;
1973 southeastern: blue; 1974 central: light green; 1975 western: green).
The US Environmental Protection Agency (EPA) designed and
implemented the National Eutrophication Survey (NES) in order to
investigate the extent of eutrophication in freshwater lakes and
reservoirs across the contiguous US. Sampling took place in over
800 lakes and reservoirs from 1972 to 1975 and included a variety
of physical, chemical, and biological metrics including data on
nutrients and nutrient loading, hydrologic retention time,
morphometry, and plankton community diversity. Each lake was
sampled on a monthly basis for a period of 1 year. Except for the
phytoplankton distribution subset, which we did not transcribe
see , the NES data are provided as annual
averages. Unlike current EPA National Lakes Assessments (NLAs) that select
a random sample of lakes across the US, the NES targeted only lakes
impacted directly or indirectly by municipal sewage treatment plant
discharge . Until recently, these
data were only available in their entirety as four separate scanned
reports representing the northeastern and north-central
(northeastern), eastern and southeastern (southeastern), central,
and western regions of the US (Fig. ). In
the remainder of the present paper we refer to the former two
regions as simply the northeastern and southeastern regions.
To our knowledge, there have been no attempts to transcribe the
data into a usable, searchable digital database despite its use in
previous studies. For example, large portions of the dataset were
used to examine large-scale relationships between residence time
and phytoplankton abundance . Also, it was
used to predict eutrophication incidence in a Bayesian framework
(Lamon and Stow 2004). Smaller portions of the data were used to
explore drivers of nutrient loading
. However, to our
knowledge, the only study to use the NES dataset and provide
a publicly available data supplement is that of , but
their data supplement was limited to a small subset of the
available variables relating to phytoplankton community diversity.
The present study is the first to leverage digital transcription
technologies to unlock the full NES dataset. In this paper, we
describe the digital transcription of the full NES dataset with the
goal of making the dataset openly accessible to the research
community. Specifically, our objective was to exactly reproduce the
contents of the original dataset rather than to evaluate its
scientific integrity. We introduce and publish the data in an open
format that requires no proprietary software. It can be easily
downloaded, used for analysis, and amended. The provided summary
statistics and figures also allow users to quickly assess the
utility of the data. Finally, the code and raw data files are
provided to facilitate the extraction of fields not represented in
our completed dataset (mostly phytoplankton diversity data).
Methods
Data were collected from multiple locations within the water column
and included in situ measurements as well as laboratory
analyses. Flow estimates and drainage area calculations were
provided by the US Geological Survey and were determined from flow gauges when
present. More detailed information on sampling methods, units,
equipment, and accuracy can be found in the EPA survey methods
publication . Due to the historical nature of the
dataset, the NES sampling design differs from more modern efforts
. For example, the original NES data were
collected from four separate regions of the US over the course of
4 years, whereas current assessments complete nationwide
sampling in a single summer. As such, NES data values represent the
mean of measurements taken in the spring, summer, and fall in
either 1972 (northeastern), 1973 (southeastern), 1974 (central), or
1975 (western) rather than summer measurements taken in a single
year.
We obtained the NES archival scanned reports from the EPA National
Service Center for Environmental Publications (available at:
https://www.epa.gov/nscep). The data for each NES region are
contained in four separate files. We extracted the data from each
file using automated techniques followed by manual quality
assurance and checking of each value. To begin, we enhanced
(de-noised) each file using the local adaptive filtering algorithm
as provided by the ImageMagick program (v6.8.9-9; available at
https://www.imagemagick.org/). Next, we processed the
enhanced files using the Tesseract optical character recognition
(OCR) program . The output of
these initial extraction steps was recorded in a set of “raw
data” files in which each file contains the raw unprocessed text of
each document page. The contents of specific fields in the raw data
were extracted to a database using the automated rules provided by
the nesR software package . Finally, all values in
the database were manually checked for accuracy against the
original scanned reports. Inaccurate OCR outputs were
corrected by hand in the final database. Because our goal was to
reproduce the data from the original reports and not to verify the
technical correctness of the original data, we only changed values
if they did not match the original data reports. For example, we did
not change data from the five NES lakes that had phosphate
(PO4)
values exceeding their corresponding total phosphorus (TP) values
despite the fact that this is not physically possible (PO4 is
a component of TP).
We provide the final dataset in an open nonproprietary format
(comma-delimited, *.csv). In addition, we generated metadata
descriptions from the contents of the original scanned reports. All
calculations, table construction, and figure generation were
performed in R and saved as reproducible R scripts
. Table and figure generation was accomplished with
the use of the reshape2, plyr, and sp packages .
Number of measurements (n) for each variable in each NES region.
VariableWesternCentralNortheasternSoutheasternDrainage area122138171232Surface area152177200245Mean depth149174174242Total inflow124138170232Retention time124140158230Alkalinity153177200245Conductivity153176200245Secchi depth153177200245Total P153177200245Total inorg. P153177200245Total inorg. N153177200245Total N1521761245P pt. source mun.5283139189P pt. source ind.711024P pt. source sep.6588111175P nonpt. source122133167231P total inputs122133167231N pt. source mun.5284139189N pt. source ind.71822N pt. source sep.7790111184N nonpt. source122129167231N total inputs122129167231P total exports119132167227P retention99115144201P load per area122133167231N total exports119133166227N retention88111122170N load per area122135167231Results
The final NES dataset contains observations from 775 lakes and the
distribution of these lakes was spatially variable. Although there
were more lakes measured in the northeastern and southeastern US,
the number of locations was close to evenly distributed among the
remaining regions (Fig. ,
Table ). Specifically, the number of lakes sampled in
each region were as follows: northeastern – 200 lakes,
southeastern – 245 lakes, central – 177 lakes, and western –
153 lakes.
Map of log-scaled alkalinity (mg L-1) interpolated using inverse distance weighting.
Map of Secchi depth (m) interpolated using inverse distance weighting.
Mean and standard deviation (SD) for each variable in each NES region.
RegionWesternCentralNortheasternSoutheasternVariableMean ± SDMean ± SDMean ± SDMean ± SDDrainage area (km2)2.5×104±7.8×1042.1×104±7.5×1043.2×103±1.4×1045.3×103±1.4×104Surface area (km2)44.57 ± 99.8354.38 ±1.4×10227.25 ± 99.0142.7 ±1.4×102Mean depth (m)16.71 ± 27.085.97 ± 4.497 ± 9.376.4 ± 6.07Total inflow (m3s-1)52.1 ±1.1×10231.82 ± 71.7723.1 ± 65.2682.6 ±2.3×102Retention time (yr)7.27 ± 43.322.78 ± 6.982.01 ± 4.770.59 ± 1.12Alkalinity (mgL-1)1.7×102±3.7×1021.5×102± 91.511.2×102±1.6×10272.18 ± 66.25Conductivity (µΩ)4.9×102±1.0×1036.4×102±7.6×1023.3×102±4.0×1022.5×102±2.2×102Secchi depth (m)2.86 ± 2.641.2 ± 0.911.81 ± 1.711.22 ± 0.82Total P (mgL-1)0.07 ± 0.130.11 ± 0.160.16 ± 0.350.12 ± 0.27Total inorg. P (mgL-1)0.04 ± 0.110.04 ± 0.070.11 ± 0.30.05 ± 0.15Total inorg. N (mgL-1)0.14 ± 0.230.33 ± 0.580.47 ± 0.660.72 ± 0.91Total N (mgL-1)0.62 ± 0.651.22 ± 1.110.121.56 ± 1.25P pt. source mun. (kgyr-1)2.5×104±8.7×1042.3×104±5.6×1043.5×104±1.5×1054.5×104±1.1×105P pt. source ind. (kgyr-1)2.5×104±4.0×1041.3×104± NA2.7×104±4.9×1041.7×104±4.5×104P pt. source sep. (kgyr-1)56.62 ±1.4×10260.62 ± 93.671.6×102±3.4×10298.55 ±2.3×102P nonpt. source (kgyr-1)1.4×105±4.2×1051.8×105±6.8×1055.6×104±2.1×1051.9×105±5.5×105P total inputs (kgyr-1)1.5×105±4.7×1052.0×105±7.0×1058.7×104±3.4×1052.3×105±5.8×105N pt. source mun. (kgyr-1)7.8×104±2.5×1057.3×104±1.7×1051.4×105±5.4×1051.4×105±3.8×105N pt. source ind. (kgyr-1)2.3×107±6.1×1074.0×103± NA1.6×105±4.2×1051.7×105±5.6×105N pt. source sep. (kgyr-1)5.7×106±5.0×1072.2×103±3.5×1034.3×103±5.5×1033.3×103±6.7×103N nonpt. source (kgyr-1)1.8×106±4.9×1061.8×106±4.4×1061.2×106±4.1×1063.1×106±8.9×106N total inputs (kgyr-1)6.8×106±5.7×1071.8×106±4.3×1061.3×106±4.6×1063.2×106±9.0×106P total exports (kgyr-1)6.2×104±1.7×1057.4×104±1.9×1057.3×104±3.1×1051.9×105±6.3×105P retention (%)47.77 ± 28.557.55 ± 26.0136.93 ± 25.242.7 ± 23.34P load per area (gm-2yr-1)5.61 ± 21.363.3 ± 9.228.46 ± 97.499.43 ± 17.06N total exports (kgyr-1)1.6×106±4.0×1061.2×106±2.8×1061.2×106±4.9×1063.0×106±8.3×106N retention (%)39.33 ± 27.1343.41 ± 23.9728.41 ± 23.6226.28 ± 18.85N load per area (gm-2yr-1)1.8×102±1.1×10342.67 ±1.1×1022.8×102±9.1×1021.3×102±2.4×102
In addition to differences in the total number of lakes measured
in each region, there were also differences in the proportion of
lakes classified as impoundments rather than as natural lakes. For
example, slightly more than half of all the lakes studied (462 of
775) were classified as impoundments yet the northeastern region
had only 54 impoundments while the southeastern region had 168
impoundments. Conversely, the number of natural lakes sampled in
the northeastern region (146 lakes) was more than double that of
any other region (77, 48, and 42 for the southeastern, western, and
central US, respectively).
We observed substantial spatial variation in many of the
individual lake characteristics. For example, lakes in the eastern
subregions were generally smaller and shallower than lakes in the
western subregion (Table ). In addition, lakes
in the western subregion generally had higher alkalinity and
higher water clarity (Figs. and
). Lakes with particularly low alkalinity were
found in coastal areas, whereas lakes with particularly high
alkalinity were found in Nevada, western Washington, and parts of
North Dakota. Comparisons among regions was easy for some
well-sampled lake chemistry parameters such as TP
but more difficult for undersampled lake chemistry
parameters. A particularly extreme example of this difficulty was
total nitrogen measurements in the eastern region, as this
parameter was only measured for a single lake
(Table ).
The ability to examine these spatial trends was made possible by
our OCR procedure, which had
6–17 % accuracy depending on region and archival report
scan quality. In total, we carried out approximately 5000
corrections to the automated data product by hand as part of our
manual quality control review. A total of approximately 650 lakes
had values for at least 80 % of the total number of
variables shown in Table . On an individual lake
basis, the most common “missing” data were nutrient loading
estimates for individual point- and nonpoint-source components. In
many cases, these data may not actually be missing but they may not have
been a component of the budget for that particular lake. For
example, not all lakes have industrial land use so no data are
expected in these cases.
Code and data availability
Original scanned reports from the EPA are
available from the EPA National Service Center for Environmental
Publications (https://www.epa.gov/nscep). Our cleaned and
useable data are available for download at . The data
are provided as a zip file, which contains all versions of the
data including the raw and quality-checked versions
. Moreover, the R package and R code used to scrape
and analyze the data are provided by so that the
methods may be reproduced and openly available for (re)use. All figures and summary
statistics were generated with R scripts available in the data supplement.
Discussion
We have demonstrated an approach for rescuing historical data from
scanned documents. In particular, our approach involved a two-step
process of automated data scraping followed by curation by hand and
quality assurance. Overall, we found that OCR was an efficient method for reducing the labor
associated with transcribing analog text records (e.g.,
). Unfortunately, OCR technology does not have absolute accuracy. In our
case, transcription was hampered by poor print and scan quality of
the source paper documents. We discovered through our manual
validation procedure that the OCR computations produced inaccurate
values in approximately 6–17 % of the cells in the
complete dataset (n=4836). We expect that accuracy could be
improved by experimenting with varying the window size of the
local adaptive thresholding algorithm relative to the document
font size. Our ability to experiment with thresholding window size
was limited due to the computationally expensive nature of these
extractions.
The end result of our approach was data from every lake and nearly
every variable in the NES survey dataset. The only primary subset
of the NES data that is not included in our final product is the
phytoplankton distribution data, which have already been digitally
transcribed by . The results of the present
study could be used to explore anthropogenic and environmental
drivers of lake eutrophication as well as to verify previously
documented trends. One example is the 2007 National Lakes
Assessment Report, which included a reanalysis of some of
the NES study lakes . This reanalysis
considered population level trends in the NES lakes but did not
consider trends in individual lakes or potential environmental
drivers contributing to observed trends. On a population basis,
the NLA reanalysis found that less than 30 % of the NES
lakes had increased chlorophyll and phosphorus concentrations. The
results of the present study could be used to verify these claims
as well as to compare the NES data with more recent work such as
the 2012 National Lakes Assessment. Note that sampling techniques
may differ from current techniques; thus, care should be given when
making comparisons. In addition to their utility in validating
historical trends, this dataset has value because it contains data
on a number of hydrographic variables that are difficult to
estimate, such as water residence (retention) time. Such data are
critical to a variety of hydrological and water quality modeling
efforts .
Although our goal was to digitally transcribe the full NES dataset
to facilitate studies on historical nutrient loading, it is worth
noting the similarities between the present study and other
scientific record digitization initiatives. Such initiatives are
common in the climate and ocean sciences but they are just
starting to gain momentum in the biological sciences
. To our
knowledge, the present study is the first large-scale attempt at
digitization of historical limnology records. We hope that by
making our analysis open and reproducible we will inspire future
efforts to recover important records from the pre-digital era.
Author contributions
All authors contributed to data quality assurance and edited the article text.
JS conceived the study and implemented the optical character recognition
code. CF, DK, and RN performed the data analysis and made figures. KK, HM,
and JS wrote major parts of the paper.
Competing interests
The authors declare that they have no conflict of
interest.
Acknowledgements
This work was developed as part of the Reproducible Quantitative Methods
course (https://cbahlai.github.io/rqm-template/) led by Christie
Bahlai, which was funded by the Mozilla Foundation, the Leona M. and Harry
B. Helmsley Charitable Trust, the Michigan State University Program in Ecology
and Evolutionary Biology, the BEACON Center for the Study of Evolution in Action, and the
Kellogg Biological Station Long-Term Ecological Research site (NSF-DEB
no. 1027253). Jemma Stachelek was supported by National Science Foundation grant ICER-1517823.
Edited by: David Carlson
Reviewed by: two anonymous referees
References
Allan, R., Brohan, P., Compo, G. P., Stone, R., Luterbacher, J., and Brönnimann, S.: The international atmospheric circulation reconstructions over the earth (ACRE) initiative, B. Am. Meteorol. Soc., 92, 1421–1425, 2011.
Bennett, E. M., Carpenter, S. R., and Caraco, N. F.: Human impact on erodable phosphorus and eutrophication: a global perspective: increasing accumulation of phosphorus in soil threatens rivers, lakes, and coastal oceans with eutrophication, AIBS Bulletin, 51, 227–234, 2001.Brett, M. T. and Benjamin, M. M.: A review and reassessment of lake
phosphorus retention and the nutrient loading concept, Freshwater Biol., 53,
194–211, 10.1111/j.1365-2427.2007.01862.x, 2008.
Drinkwater, R. E., Cubey, R. W., and Haston, E. M.: The use of Optical
Character Recognition (OCR) in the digitisation of herbarium specimen labels,
PhytoKeys, 38, 15–30, 2014.Freeman, E., Woodruff, S. D., Worley, S. J., Lubker, S. J., Kent, E. C.,
Angel, W. E., Berry, D. I., Brohan, P., Eastman, R., Gates, L., Gloeden, W.,
Ji, Z., Lawrimore, J., Rayner, N. A., Rosenhagen, G. and Smith, S. R.: ICOADS
Release 3.0: a major update to the historical marine climate record, Int. J.
Climatol., 37, 2211–2232, 10.1002/joc.4775, 2017.Ooms, J.: tesseract: Open Source OCR Engine for R, available at:
https://CRAN.R-project.org/package=tesseract, R package version 1.6,
last access: 2 February 2017.Pebesma, E. and Bivand, R.: sp: Classes and Methods for Spatial Data,
available at: https://CRAN.R-project.org/package=sp, R package
version 1.2-5, last access: 2 February 2017.R Core Team: R: A Language and Environment for Statistical Computing, R
Foundation for Statistical Computing, Vienna, Austria, available at:
https://www.R-project.org/, last access: 2 February 2017.
Smith, R.: An overview of the Tesseract OCR engine, in: Document Analysis and Recognition, ICDAR 2007, Ninth International Conference on, vol. 2, 629–633, IEEE, 2007.
Smith, V. H., Tilman, G. D., and Nekola, J. C.: Eutrophication: impacts of excess nutrient inputs on freshwater, marine, and terrestrial ecosystems, Environ. Pollut., 100, 179–196, 1999.
Smith, V. H., Dodds, W. K., Havens, K. E., Engstrom, D. R., Paerl, H. W., Moss, B., and Likens, G. E.: Comment: Cultural eutrophication of natural lakes in the United States is real and widespread, Limnol. Oceanogr., 59, 2217–2225, 2014.
Soballe, D. and Kimmel, B.: A large-scale comparison of factors influencing phytoplankton abundance in rivers, lakes, and impoundments, Ecology, 68, 1943–1954, 1987.Stachelek, J.: nesR: Scrape Data from National Eutrophication Survey archival PDFs, 10.5281/zenodo.1048154, R package version 0.2, 2017.Stachelek, J., Ford, C., Kincaid, D., King, K., Miller, H., and
Nagelkirk, R.: The National Eutrophication Survey: lake characteristics and
historical nutrient concentrations. Knowledge Network for Biocomplexity,
10.5063/F1639MVD, 2017.Stomp, M., Huisman, J., Mittelbach, G. G., Litchman, E., and Klausmeier, C. A.: Large-scale biodiversity patterns in freshwater phytoplankton, Ecology, 92, 2096–2107, 2011.
Taranu, Z. E. and Gregory-Eaves, I.: Quantifying relationships among phosphorus, agriculture, and lake depth at an inter-regional scale, Ecosystems, 11, 715–725, 2008.
USEPA: National Eutrophication Survey Methods 1973–1976 (Working Paper No.
175), Tech. rep., United States Environmental Protection Agency, Office of
Research and Development, Corvallis, OR, USA, 1975.
USEPA: National Lakes Assessment: A Collaborative Survey of the Nation's
Lakes, Tech. rep., United States Environmental Protection Agency, Office of
Research and Development, Washington, D.C., USA, 2009.Wickham, H.: plyr: Tools for Splitting, Applying and Combining Data,
available at: https://CRAN.R-project.org/package=plyr (last access:
2 February 2017), R package version 1.8.4, 2016.