Abstract

ESSD

Earth System Science Data

ESSD

Earth Syst. Sci. Data

1866-3516

Copernicus Publications

Göttingen, Germany

10.5194/essd-10-81-2018

The National Eutrophication Survey: lake characteristics and historical nutrient concentrations

National Eutrophication Survey

Stachelek

Jemma

stachel2@msu.edu

https://orcid.org/0000-0002-5924-2464

Ford

Chanse

Kincaid

Dustin

King

Katelyn

Miller

Heather

Nagelkirk

Ryan

1Department of Fisheries and Wildlife, Michigan State University, East Lansing, MI, USA 2Department of Earth and Environmental Sciences, Michigan State University, East Lansing, MI, USA 3Department of Integrative Biology, Michigan State University, East Lansing, MI, USA 4Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, USA 5Department of Geography, Environment, and Spatial Sciences, Michigan State University, East Lansing, MI, USA 6W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, USA

Jemma Stachelek (stachel2@msu.edu)

16January2018

10 1 8186 15June2017 4December2017 30November2017 18July2017

2018

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://essd.copernicus.org/articles/10/81/2018/essd-10-81-2018.html

The full text article is available as a PDF file from https://essd.copernicus.org/articles/10/81/2018/essd-10-81-2018.pdf

Abstract

Historical ecological surveys serve as a baseline and provide context for contemporary research, yet many of these records are not preserved in a way that ensures their long-term usability. The National Eutrophication Survey (NES) database is currently only available as scans of the original reports (PDF files) with no embedded character information. This limits its searchability, machine readability, and the ability of current and future scientists to systematically evaluate its contents. The NES data were collected by the US Environmental Protection Agency between 1972 and 1975 as part of an effort to investigate eutrophication in freshwater lakes and reservoirs. Although several studies have manually transcribed small portions of the database in support of specific studies, there have been no systematic attempts to transcribe and preserve the database in its entirety. Here we use a combination of automated optical character recognition and manual quality assurance procedures to make these data available for analysis. The performance of the optical character recognition protocol was found to be linked to variation in the quality (clarity) of the original documents. For each of the four archival scanned reports, our quality assurance protocol found an error rate between 5.9 and 17 %. The goal of our approach was to strike a balance between efficiency and data quality by combining entry of data by hand with digital transcription technologies. The finished database contains information on the physical characteristics, hydrology, and water quality of about 800 lakes in the contiguous US (, 10.5063/F1639MVD). Ultimately, this database could be combined with more recent studies to generate meta-analyses of water quality trends and spatial variation across the continental US.

1Introduction

Effective management of inland freshwater lakes requires an understanding of the factors that affect water quality and how these factors change over time. One of these factors, termed eutrophication, occurs when excess nutrient inputs from human activities fuels increases in algal growth, which can cause hypoxia and decreases in water clarity. Eutrophication of surface waters from increased phosphorus and nitrogen loading has been observed in connection with altered land use, especially in areas of rapid urbanization and intensive agriculture . As human populations and their impacts continue to grow, eutrophication is expected to become more widespread . Historical datasets are needed in order to track, understand, and manage eutrophication in lakes and reservoirs because they serve as an important baseline for modern studies.

Figure 1

Survey locations colored by sampling year (1972 northeastern: light blue; 1973 southeastern: blue; 1974 central: light green; 1975 western: green).

The US Environmental Protection Agency (EPA) designed and implemented the National Eutrophication Survey (NES) in order to investigate the extent of eutrophication in freshwater lakes and reservoirs across the contiguous US. Sampling took place in over 800 lakes and reservoirs from 1972 to 1975 and included a variety of physical, chemical, and biological metrics including data on nutrients and nutrient loading, hydrologic retention time, morphometry, and plankton community diversity. Each lake was sampled on a monthly basis for a period of 1 year. Except for the phytoplankton distribution subset, which we did not transcribe see , the NES data are provided as annual averages. Unlike current EPA National Lakes Assessments (NLAs) that select a random sample of lakes across the US, the NES targeted only lakes impacted directly or indirectly by municipal sewage treatment plant discharge . Until recently, these data were only available in their entirety as four separate scanned reports representing the northeastern and north-central (northeastern), eastern and southeastern (southeastern), central, and western regions of the US (Fig. ). In the remainder of the present paper we refer to the former two regions as simply the northeastern and southeastern regions.

To our knowledge, there have been no attempts to transcribe the data into a usable, searchable digital database despite its use in previous studies. For example, large portions of the dataset were used to examine large-scale relationships between residence time and phytoplankton abundance . Also, it was used to predict eutrophication incidence in a Bayesian framework (Lamon and Stow 2004). Smaller portions of the data were used to explore drivers of nutrient loading . However, to our knowledge, the only study to use the NES dataset and provide a publicly available data supplement is that of , but their data supplement was limited to a small subset of the available variables relating to phytoplankton community diversity.

The present study is the first to leverage digital transcription technologies to unlock the full NES dataset. In this paper, we describe the digital transcription of the full NES dataset with the goal of making the dataset openly accessible to the research community. Specifically, our objective was to exactly reproduce the contents of the original dataset rather than to evaluate its scientific integrity. We introduce and publish the data in an open format that requires no proprietary software. It can be easily downloaded, used for analysis, and amended. The provided summary statistics and figures also allow users to quickly assess the utility of the data. Finally, the code and raw data files are provided to facilitate the extraction of fields not represented in our completed dataset (mostly phytoplankton diversity data).

2Methods

Data were collected from multiple locations within the water column and included in situ measurements as well as laboratory analyses. Flow estimates and drainage area calculations were provided by the US Geological Survey and were determined from flow gauges when present. More detailed information on sampling methods, units, equipment, and accuracy can be found in the EPA survey methods publication . Due to the historical nature of the dataset, the NES sampling design differs from more modern efforts . For example, the original NES data were collected from four separate regions of the US over the course of 4 years, whereas current assessments complete nationwide sampling in a single summer. As such, NES data values represent the mean of measurements taken in the spring, summer, and fall in either 1972 (northeastern), 1973 (southeastern), 1974 (central), or 1975 (western) rather than summer measurements taken in a single year.

We obtained the NES archival scanned reports from the EPA National Service Center for Environmental Publications (available at: https://www.epa.gov/nscep). The data for each NES region are contained in four separate files. We extracted the data from each file using automated techniques followed by manual quality assurance and checking of each value. To begin, we enhanced (de-noised) each file using the local adaptive filtering algorithm as provided by the ImageMagick program (v6.8.9-9; available at https://www.imagemagick.org/). Next, we processed the enhanced files using the Tesseract optical character recognition (OCR) program . The output of these initial extraction steps was recorded in a set of “raw data” files in which each file contains the raw unprocessed text of each document page. The contents of specific fields in the raw data were extracted to a database using the automated rules provided by the nesR software package . Finally, all values in the database were manually checked for accuracy against the original scanned reports. Inaccurate OCR outputs were corrected by hand in the final database. Because our goal was to reproduce the data from the original reports and not to verify the technical correctness of the original data, we only changed values if they did not match the original data reports. For example, we did not change data from the five NES lakes that had phosphate (PO4) values exceeding their corresponding total phosphorus (TP) values despite the fact that this is not physically possible (PO4 is a component of TP).

We provide the final dataset in an open nonproprietary format (comma-delimited, *.csv). In addition, we generated metadata descriptions from the contents of the original scanned reports. All calculations, table construction, and figure generation were performed in R and saved as reproducible R scripts . Table and figure generation was accomplished with the use of the reshape2, plyr, and sp packages .

Table 1

Number of measurements (n) for each variable in each NES region.

Variable Western Central Northeastern Southeastern Drainage area 122 138 171 232 Surface area 152 177 200 245 Mean depth 149 174 174 242 Total inflow 124 138 170 232 Retention time 124 140 158 230 Alkalinity 153 177 200 245 Conductivity 153 176 200 245 Secchi depth 153 177 200 245 Total P 153 177 200 245 Total inorg. P 153 177 200 245 Total inorg. N 153 177 200 245 Total N 152 176 1 245 P pt. source mun. 52 83 139 189 P pt. source ind. 7 1 10 24 P pt. source sep. 65 88 111 175 P nonpt. source 122 133 167 231 P total inputs 122 133 167 231 N pt. source mun. 52 84 139 189 N pt. source ind. 7 1 8 22 N pt. source sep. 77 90 111 184 N nonpt. source 122 129 167 231 N total inputs 122 129 167 231 P total exports 119 132 167 227 P retention 99 115 144 201 P load per area 122 133 167 231 N total exports 119 133 166 227 N retention 88 111 122 170 N load per area 122 135 167 231

3Results

The final NES dataset contains observations from 775 lakes and the distribution of these lakes was spatially variable. Although there were more lakes measured in the northeastern and southeastern US, the number of locations was close to evenly distributed among the remaining regions (Fig. , Table ). Specifically, the number of lakes sampled in each region were as follows: northeastern – 200 lakes, southeastern – 245 lakes, central – 177 lakes, and western – 153 lakes.

Figure 2

Map of log-scaled alkalinity (mg L-1) interpolated using inverse distance weighting.

Figure 3

Map of Secchi depth (m) interpolated using inverse distance weighting.

Table 2

Mean and standard deviation (SD) for each variable in each NES region.

Region Western Central Northeastern Southeastern Variable Mean ± SD Mean ± SD Mean ± SD Mean ± SD Drainage area (km2)

2.5×104±7.8×104

2.1×104±7.5×104

3.2×103±1.4×104

5.3×103±1.4×104

Surface area (km2) 44.57 ± 99.83 54.38 ±1.4×102 27.25 ± 99.01 42.7 ±1.4×102 Mean depth (m) 16.71 ± 27.08 5.97 ± 4.49 7 ± 9.37 6.4 ± 6.07 Total inflow (m3s-1) 52.1 ±1.1×102 31.82 ± 71.77 23.1 ± 65.26 82.6 ±2.3×102 Retention time (yr) 7.27 ± 43.32 2.78 ± 6.98 2.01 ± 4.77 0.59 ± 1.12 Alkalinity (mgL-1)

1.7×102±3.7×102

1.5×102± 91.51

1.2×102±1.6×102

72.18 ± 66.25 Conductivity (µΩ)

4.9×102±1.0×103

6.4×102±7.6×102

3.3×102±4.0×102

2.5×102±2.2×102

Secchi depth (m) 2.86 ± 2.64 1.2 ± 0.91 1.81 ± 1.71 1.22 ± 0.82 Total P (mgL-1) 0.07 ± 0.13 0.11 ± 0.16 0.16 ± 0.35 0.12 ± 0.27 Total inorg. P (mgL-1) 0.04 ± 0.11 0.04 ± 0.07 0.11 ± 0.3 0.05 ± 0.15 Total inorg. N (mgL-1) 0.14 ± 0.23 0.33 ± 0.58 0.47 ± 0.66 0.72 ± 0.91 Total N (mgL-1) 0.62 ± 0.65 1.22 ± 1.11 0.12 1.56 ± 1.25 P pt. source mun. (kgyr-1)

2.5×104±8.7×104

2.3×104±5.6×104

3.5×104±1.5×105

4.5×104±1.1×105

P pt. source ind. (kgyr-1)

2.5×104±4.0×104

1.3×104± NA

2.7×104±4.9×104

1.7×104±4.5×104

P pt. source sep. (kgyr-1) 56.62 ±1.4×102 60.62 ± 93.67

1.6×102±3.4×102

98.55 ±2.3×102 P nonpt. source (kgyr-1)

1.4×105±4.2×105

1.8×105±6.8×105

5.6×104±2.1×105

1.9×105±5.5×105

P total inputs (kgyr-1)

1.5×105±4.7×105

2.0×105±7.0×105

8.7×104±3.4×105

2.3×105±5.8×105

N pt. source mun. (kgyr-1)

7.8×104±2.5×105

7.3×104±1.7×105

1.4×105±5.4×105

1.4×105±3.8×105

N pt. source ind. (kgyr-1)

2.3×107±6.1×107

4.0×103± NA

1.6×105±4.2×105

1.7×105±5.6×105

N pt. source sep. (kgyr-1)

5.7×106±5.0×107

2.2×103±3.5×103

4.3×103±5.5×103

3.3×103±6.7×103

N nonpt. source (kgyr-1)

1.8×106±4.9×106

1.8×106±4.4×106

1.2×106±4.1×106

3.1×106±8.9×106

N total inputs (kgyr-1)

6.8×106±5.7×107

1.8×106±4.3×106

1.3×106±4.6×106

3.2×106±9.0×106

P total exports (kgyr-1)

6.2×104±1.7×105

7.4×104±1.9×105

7.3×104±3.1×105

1.9×105±6.3×105

P retention (%) 47.77 ± 28.5 57.55 ± 26.01 36.93 ± 25.2 42.7 ± 23.34 P load per area (gm-2yr-1) 5.61 ± 21.36 3.3 ± 9.2 28.46 ± 97.49 9.43 ± 17.06 N total exports (kgyr-1)

1.6×106±4.0×106

1.2×106±2.8×106

1.2×106±4.9×106

3.0×106±8.3×106

N retention (%) 39.33 ± 27.13 43.41 ± 23.97 28.41 ± 23.62 26.28 ± 18.85 N load per area (gm-2yr-1)

1.8×102±1.1×103

42.67 ±1.1×102

2.8×102±9.1×102

1.3×102±2.4×102

In addition to differences in the total number of lakes measured in each region, there were also differences in the proportion of lakes classified as impoundments rather than as natural lakes. For example, slightly more than half of all the lakes studied (462 of 775) were classified as impoundments yet the northeastern region had only 54 impoundments while the southeastern region had 168 impoundments. Conversely, the number of natural lakes sampled in the northeastern region (146 lakes) was more than double that of any other region (77, 48, and 42 for the southeastern, western, and central US, respectively).

We observed substantial spatial variation in many of the individual lake characteristics. For example, lakes in the eastern subregions were generally smaller and shallower than lakes in the western subregion (Table ). In addition, lakes in the western subregion generally had higher alkalinity and higher water clarity (Figs. and ). Lakes with particularly low alkalinity were found in coastal areas, whereas lakes with particularly high alkalinity were found in Nevada, western Washington, and parts of North Dakota. Comparisons among regions was easy for some well-sampled lake chemistry parameters such as TP but more difficult for undersampled lake chemistry parameters. A particularly extreme example of this difficulty was total nitrogen measurements in the eastern region, as this parameter was only measured for a single lake (Table ).

The ability to examine these spatial trends was made possible by our OCR procedure, which had 6–17 % accuracy depending on region and archival report scan quality. In total, we carried out approximately 5000 corrections to the automated data product by hand as part of our manual quality control review. A total of approximately 650 lakes had values for at least 80 % of the total number of variables shown in Table . On an individual lake basis, the most common “missing” data were nutrient loading estimates for individual point- and nonpoint-source components. In many cases, these data may not actually be missing but they may not have been a component of the budget for that particular lake. For example, not all lakes have industrial land use so no data are expected in these cases.

4Code and data availability

Original scanned reports from the EPA are available from the EPA National Service Center for Environmental Publications (https://www.epa.gov/nscep). Our cleaned and useable data are available for download at . The data are provided as a zip file, which contains all versions of the data including the raw and quality-checked versions . Moreover, the R package and R code used to scrape and analyze the data are provided by so that the methods may be reproduced and openly available for (re)use. All figures and summary statistics were generated with R scripts available in the data supplement.

5Discussion

We have demonstrated an approach for rescuing historical data from scanned documents. In particular, our approach involved a two-step process of automated data scraping followed by curation by hand and quality assurance. Overall, we found that OCR was an efficient method for reducing the labor associated with transcribing analog text records (e.g., ). Unfortunately, OCR technology does not have absolute accuracy. In our case, transcription was hampered by poor print and scan quality of the source paper documents. We discovered through our manual validation procedure that the OCR computations produced inaccurate values in approximately 6–17 % of the cells in the complete dataset (n=4836). We expect that accuracy could be improved by experimenting with varying the window size of the local adaptive thresholding algorithm relative to the document font size. Our ability to experiment with thresholding window size was limited due to the computationally expensive nature of these extractions.

The end result of our approach was data from every lake and nearly every variable in the NES survey dataset. The only primary subset of the NES data that is not included in our final product is the phytoplankton distribution data, which have already been digitally transcribed by . The results of the present study could be used to explore anthropogenic and environmental drivers of lake eutrophication as well as to verify previously documented trends. One example is the 2007 National Lakes Assessment Report, which included a reanalysis of some of the NES study lakes . This reanalysis considered population level trends in the NES lakes but did not consider trends in individual lakes or potential environmental drivers contributing to observed trends. On a population basis, the NLA reanalysis found that less than 30 % of the NES lakes had increased chlorophyll and phosphorus concentrations. The results of the present study could be used to verify these claims as well as to compare the NES data with more recent work such as the 2012 National Lakes Assessment. Note that sampling techniques may differ from current techniques; thus, care should be given when making comparisons. In addition to their utility in validating historical trends, this dataset has value because it contains data on a number of hydrographic variables that are difficult to estimate, such as water residence (retention) time. Such data are critical to a variety of hydrological and water quality modeling efforts .

Although our goal was to digitally transcribe the full NES dataset to facilitate studies on historical nutrient loading, it is worth noting the similarities between the present study and other scientific record digitization initiatives. Such initiatives are common in the climate and ocean sciences but they are just starting to gain momentum in the biological sciences . To our knowledge, the present study is the first large-scale attempt at digitization of historical limnology records. We hope that by making our analysis open and reproducible we will inspire future efforts to recover important records from the pre-digital era.

Author contributions

All authors contributed to data quality assurance and edited the article text. JS conceived the study and implemented the optical character recognition code. CF, DK, and RN performed the data analysis and made figures. KK, HM, and JS wrote major parts of the paper.

Competing interests

The authors declare that they have no conflict of interest.

Acknowledgements

This work was developed as part of the Reproducible Quantitative Methods course (https://cbahlai.github.io/rqm-template/) led by Christie Bahlai, which was funded by the Mozilla Foundation, the Leona M. and Harry B. Helmsley Charitable Trust, the Michigan State University Program in Ecology and Evolutionary Biology, the BEACON Center for the Study of Evolution in Action, and the Kellogg Biological Station Long-Term Ecological Research site (NSF-DEB no. 1027253). Jemma Stachelek was supported by National Science Foundation grant ICER-1517823. Edited by: David Carlson Reviewed by: two anonymous referees

References Allan et al.(2011)Allan, Brohan, Compo, Stone, Luterbacher, and Brönnimann

Allan, R., Brohan, P., Compo, G. P., Stone, R., Luterbacher, J., and Brönnimann, S.: The international atmospheric circulation reconstructions over the earth (ACRE) initiative, B. Am. Meteorol. Soc., 92, 1421–1425, 2011.

Bennett et al.(2001)Bennett, Carpenter, and Caraco

Bennett, E. M., Carpenter, S. R., and Caraco, N. F.: Human impact on erodable phosphorus and eutrophication: a global perspective: increasing accumulation of phosphorus in soil threatens rivers, lakes, and coastal oceans with eutrophication, AIBS Bulletin, 51, 227–234, 2001.

Brett and Benjamin(2008)

Brett, M. T. and Benjamin, M. M.: A review and reassessment of lake phosphorus retention and the nutrient loading concept, Freshwater Biol., 53, 194–211, 10.1111/j.1365-2427.2007.01862.x, 2008.

Drinkwater et al.(2014)Drinkwater, Cubey, and Haston

Drinkwater, R. E., Cubey, R. W., and Haston, E. M.: The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels, PhytoKeys, 38, 15–30, 2014.

Freeman et al.(2017)

Freeman, E., Woodruff, S. D., Worley, S. J., Lubker, S. J., Kent, E. C., Angel, W. E., Berry, D. I., Brohan, P., Eastman, R., Gates, L., Gloeden, W., Ji, Z., Lawrimore, J., Rayner, N. A., Rosenhagen, G. and Smith, S. R.: ICOADS Release 3.0: a major update to the historical marine climate record, Int. J. Climatol., 37, 2211–2232, 10.1002/joc.4775, 2017.

Ooms(2017)

Ooms, J.: tesseract: Open Source OCR Engine for R, available at: https://CRAN.R-project.org/package=tesseract, R package version 1.6, last access: 2 February 2017.

Pebesma and Bivand(2017)

Pebesma, E. and Bivand, R.: sp: Classes and Methods for Spatial Data, available at: https://CRAN.R-project.org/package=sp, R package version 1.2-5, last access: 2 February 2017.

R Core Team(2017)

R Core Team: R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, available at: https://www.R-project.org/, last access: 2 February 2017.

Smith(2007)

Smith, R.: An overview of the Tesseract OCR engine, in: Document Analysis and Recognition, ICDAR 2007, Ninth International Conference on, vol. 2, 629–633, IEEE, 2007.

Smith et al.(1999)Smith, Tilman, and Nekola

Smith, V. H., Tilman, G. D., and Nekola, J. C.: Eutrophication: impacts of excess nutrient inputs on freshwater, marine, and terrestrial ecosystems, Environ. Pollut., 100, 179–196, 1999.

Smith et al.(2014)Smith, Dodds, Havens, Engstrom, Paerl, Moss, and Likens

Smith, V. H., Dodds, W. K., Havens, K. E., Engstrom, D. R., Paerl, H. W., Moss, B., and Likens, G. E.: Comment: Cultural eutrophication of natural lakes in the United States is real and widespread, Limnol. Oceanogr., 59, 2217–2225, 2014.

Soballe and Kimmel(1987)

Soballe, D. and Kimmel, B.: A large-scale comparison of factors influencing phytoplankton abundance in rivers, lakes, and impoundments, Ecology, 68, 1943–1954, 1987.

Stachelek(2017)

Stachelek, J.: nesR: Scrape Data from National Eutrophication Survey archival PDFs, 10.5281/zenodo.1048154, R package version 0.2, 2017.

Stachelek et al.(2017)Stachelek, Ford, Kincaid, King, Miller, and Nagelkirk

Stachelek, J., Ford, C., Kincaid, D., King, K., Miller, H., and Nagelkirk, R.: The National Eutrophication Survey: lake characteristics and historical nutrient concentrations. Knowledge Network for Biocomplexity, 10.5063/F1639MVD, 2017.

Stomp et al.(2011)Stomp, Huisman, Mittelbach, Litchman, and Klausmeier

Stomp, M., Huisman, J., Mittelbach, G. G., Litchman, E., and Klausmeier, C. A.: Large-scale biodiversity patterns in freshwater phytoplankton, Ecology, 92, 2096–2107, 2011.

Taranu and Gregory-Eaves(2008)

Taranu, Z. E. and Gregory-Eaves, I.: Quantifying relationships among phosphorus, agriculture, and lake depth at an inter-regional scale, Ecosystems, 11, 715–725, 2008.

USEPA(1975)

USEPA: National Eutrophication Survey Methods 1973–1976 (Working Paper No. 175), Tech. rep., United States Environmental Protection Agency, Office of Research and Development, Corvallis, OR, USA, 1975.

USEPA(2009)

USEPA: National Lakes Assessment: A Collaborative Survey of the Nation's Lakes, Tech. rep., United States Environmental Protection Agency, Office of Research and Development, Washington, D.C., USA, 2009.

Wickham(2016)

Wickham, H.: plyr: Tools for Splitting, Applying and Combining Data, available at: https://CRAN.R-project.org/package=plyr (last access: 2 February 2017), R package version 1.8.4, 2016.