BCUB – a large-sample ungauged basin attribute dataset for British Columbia, Canada

Kovacek, Daniel; Weijs, Steven

doi:https://doi.org/10.5194/essd-17-259-2025

Articles | Volume 17, issue 1

https://doi.org/10.5194/essd-17-259-2025

Articles | Volume 17, issue 1

Data description paper

28 Jan 2025

Data description paper |

| 28 Jan 2025

BCUB – a large-sample ungauged basin attribute dataset for British Columbia, Canada

Daniel Kovacek and Steven Weijs

Abstract

The British Columbia Ungauged Basin (BCUB) dataset is an open-source, extensible dataset of attributes describing terrain, soil, land cover, and climate indices of over 1.2 million ungauged catchments in British Columbia, Canada, including trans-boundary regions. The attributes included in the dataset are similar to those found in the large-sample hydrology literature for their association with hydrological processes. The BCUB database is intended to support water resources research and practice, namely monitoring network analysis studies, or hydrological modelling where basin characterization is used for model calibration. The dataset and the complete workflow to collect and process input data, to derive stream networks, and to delineate sub-basins and extract attributes, are available under a Creative Commons BY 4.0 license. The DOI link for the BCUB dataset is https://doi.org/10.5683/SP3/JNKZVT (Kovacek and Weijs, 2023).

Download & links

Article (PDF, 6460 KB)

Download & links

How to cite.

Received: 07 Dec 2023 – Discussion started: 22 Jan 2024 – Revised: 22 Nov 2024 – Accepted: 28 Nov 2024 – Published: 28 Jan 2025

1 Introduction

Spatial datasets available for geoscience research and practice are increasing in size, scale, resolution, and variety. Advances in the capture and processing of remote sensing data have in recent years led to open-access publication of continental- and global-scale geospatial datasets at high resolution (U.S. Geological Survey, 2024; Huscroft et al., 2018; Latifovic et al., 2010; Lehner et al., 2021; Thornton et al., 2021). Geospatial data availability has supported the emergence of large-sample hydrology (LSH) datasets, which combine streamflow and climate time series with diverse physical attributes of streamflow monitored catchments to advance hydrological research at a global scale (Addor et al., 2020). The growth of LSH datasets realizes the age of high-quality, open-access geospatial data anticipated by Hrachowitz et al. (2013) following the decade of prediction in ungauged basins (PUB).

In contrast to the more recent expansion of geospatial data products, the streamflow monitoring network in Canada has contracted over the last 3 decades. Based on the HYDAT dataset accessed at Environment Canada's national water data archive (https://www.canada.ca/en/environment-climate-change/services/water-overview/quantity/monitoring/survey/data-products-services/national-archive-hydat.html, last access: 25 October 2024), the number of streamflow observation locations across Canada peaked in the order of 2300 in the 1980s and reduced to roughly 1700 in 2022 (on average per day). According to surface water monitoring density standards developed by the World Meteorological Organization (WMO) (via Coulibaly et al., 2013), nearly 90 % of Canada's terrestrial area is under-monitored, and almost 40 % is classified as ungauged. In general this trend holds for the province of British Columbia (BC), where outside of a few small regions in the south it is predominantly classified as ungauged or poorly gauged (Coulibaly et al., 2013).

The streamflow data used in a wide range of research and practice today come from monitoring networks built over many decades, highlighting the significant lag between monitoring objectives of the past and information needs of the present. Monitoring network decisions today must anticipate information needs decades into the future.

Recent deep learning (DL) approaches to regional hydrological modelling have relied on LSH datasets to infer relationships between climate inputs and streamflow. Model performance depends on the streamflow monitoring network used for training (Gauch et al., 2021) and the inclusion of static catchment attributes (Kratzert et al., 2019), though this may be due in part to increasing training sample heterogeneity despite uncertainty in attribute values (Li et al., 2022). DL models benefit from training datasets (streamflow monitoring networks) representing hydrologically diverse catchments, yet there is no clear consensus on how to define or evaluate such diversity (Gauch et al., 2021). Simply adding stations to the network increases diversity according to the idea of “uniqueness of place” (Beven, 2000). Alternatively, a basis for comparison could be established using a large sample of ungauged catchments defined by hydrologically relevant attributes.

The continued expansion of geospatial data requires significant effort to organize and process for use in catchment-based water resources research. A large, catchment-based dataset of geophysical attributes could support research in related fields that use attributes in large-sample catchment comparisons, for example in understanding the role of land cover in changing water temperature and its effect on fish habitat (Daigle et al., 2017). Water resource management decisions are typically made at the catchment level, so research and practice may be well served by datasets that are catchment-based, diverse in characteristics, and large in size and scale to reflect the scale dependency of physical processes governing the rainfall-runoff response (Arsenault et al., 2020).

1.1 Motivation

The monitoring deficit of a region can be addressed by adding more stations, or, under limited resources, optimal network arrangements can be approximated based on models trained on existing streamflow monitoring records, combined with information about unmonitored locations (Mishra and Coulibaly, 2010; Werstuck and Coulibaly, 2017, 2018). Large-sample datasets improve prediction in ungauged basins (PUB) by learning from diversity (Addor et al., 2017), yet to evaluate the effectiveness of current monitoring networks, a network must be compared in hydrological terms to the broader region it aims to represent.

Large-sample hydrology (LSH) datasets focus on monitored catchments, providing detailed catchment characteristics and streamflow data. They represent a small fraction of the catchments found in hydrographic datasets, which detail the geometry and network structure of rivers, lakes, and catchments but lack the catchment characteristics essential for large-scale PUB studies. LSH datasets could be meaningfully complemented by a much larger set of ungauged catchments accompanied by characteristics to support regional studies across a broader range of catchments.

To complement LSH data and support hydrological studies, the goal is to develop a dataset that (i) uses only open-access data sources with continuous coverage over the study region, (ii) is derived from the highest-resolution DEM available to cover the range of catchment areas represented in large-sample hydrology (monitored catchment) datasets, (iii) is published under an open-source license, (iv) allows expansion both spatially and dimensionally as new information arises, and (v) includes full replication code built on widely used open-source libraries. Several existing datasets were reviewed for the desired qualities listed above and for their potential to support research in network optimization, prediction in ungauged catchments, and broader water resources research.

1.2 Related datasets

1.2.1 Hydrographic datasets

The BC Freshwater Atlas (FWA) (Gray, 2010) is the definitive source of freshwater feature mapping for British Columbia (BC). It contains roughly 3 million geometries representing the province-wide set of first-order fundamental component watershed units, with a reference system designed to facilitate aggregation into larger watershed assessment units. The FWA dataset is strictly limited to the administrative bounds of BC, cutting off many important trans-boundary basins at borders. Since the dataset is primarily hydrographic, it does not include catchment attribute information commonly used in rainfall-runoff model calibration. The FWA is provided with an open-use license (https://www2.gov.bc.ca/gov/content/data/open-data/open-government-licence-bc, last access: 2 December 2021), but the code used to derive the dataset is to our knowledge unpublished, and as such it is not readily replicable or extensible with consistent input data and methodology.

The National Hydrographic Network (NHN) (Geobase, 2004) contains a hydrographic feature set similar to the BC FWA. It covers all of Canada and includes trans-boundary basins along the US border, but the geometries are organized in work unit limits (WULs) which break up complete basins. The watershed attributes are similarly limited, and the code used to derive the geometries is to our knowledge unpublished.

HydroSHEDS is a dataset for global-scale applications featuring river networks, watershed boundaries, and other hydrological features derived from the NASA Shuttle Radar Topography Mission (SRTM) DEM for most of North America at a resolution of roughly 90 m. At latitudes >60° north, corresponding to the northern border of BC with the Yukon territory, HydroSHEDS catchments are derived from more coarse (≈500 m) HYDRO1k elevation data (Wickel et al., 2007). Attributes derived from distinct elevation data sources are difficult to compare as discussed in Sect. 2.2, as the stream networks (and catchment boundaries) are unique to a DEM source and to the data processing methodology (Datta et al., 2022). Studies using the HydroSHEDS dataset typically exclude catchments smaller than 100 km² (Guth, 2011; Zhang et al., 2021; Kratzert et al., 2023).

1.2.2 Large-sample hydrology datasets

Do et al. (2018) present a review of key precedents in the emergence of LSH datasets, beginning with the Global Runoff Database (GRDB), which gathered daily and monthly streamflow time series observations from over 9000 catchments around the world. The MOPEX (Model Parameter Estimation Experiment) dataset (Duan et al., 2006), which provides hydrometeorological time series data for 438 US catchments, and the GAGES-II dataset (Falcone, 2011), which features geospatial attributes of monitored catchments, both laid the groundwork for the CAMELS dataset (Addor et al., 2017). CAMELS represents a combination of both attributes and hydrometeorological time series for a larger number of catchments in the United States. The CAMELS approach has since expanded to other regions, with CAMELS-CL (Chile) (Alvarez-Garreton et al., 2018), CAMELS-BR (Brazil) (Chagas et al., 2020), CAMELS-AUS (Australia) (Fowler et al., 2021), and CAMELS-GB (Great Britain) (Coxon et al., 2020) published between 2018 and 2021. The HYSETS dataset (Addor et al., 2020) expanded continental-scale LSH efforts across Canada, the United States, and Mexico, combining catchment attributes with streamflow records and gridded climate products, contributing to the convergence of large-sample hydrology research. The recent Caravan dataset (Kratzert et al., 2023) standardizes and integrates these regional and global datasets into a unified resource, supporting large-scale hydrological modelling across diverse geographic contexts.

Large-sample hydrology (LSH) datasets have excluded small basins, primarily because of uncertainties in basin delineation (Arsenault et al., 2020; Addor et al., 2017) but also to ensure sufficient sample size for estimating attributes from gridded data sources at different resolution (Guth, 2011). However, the rationale for a specific threshold is generally not given. The HYSETS dataset flags basins smaller than 50 km², representing nearly one-third of the dataset.

Regional datasets such as HYSETS and CAMELS often rely on catchment polygons sourced from official governing bodies, but the methods used for delineation are to our knowledge unpublished and likely vary in both approach and underlying data source, leading to uncertainties in basin delineation and attribute estimation. This uncertainty highlights a gap that can be addressed in part with continuous and complete DEM coverage at higher resolution.

A large and diverse set of ungauged locations and associated attributes is sought to represent the decision space for monitoring network analysis and optimization and more generally to support water resources research where catchment-based geospatial attributes are relevant.

1.3 British Columbia Ungauged Basin (BCUB) database

The BCUB database features a wide array of attributes describing the terrain, land cover, soil permeability and porosity, and climate of over 1.2 million (sub-)basins. We use the term “basin” to refer to the local watershed of any confluence or outlet in a stream network, including individual upstream branches and their combination. Figure 1 shows the set of active and discontinued streamflow monitoring stations from the HYSETS dataset (Arsenault et al., 2020) that lie within the study region and includes an inset showing the density of pour points representing the BCUB dataset over a small sample area. The study region represents any terrestrial area within or upstream of any point within the BC administrative boundary (https://open.canada.ca/data/en/dataset/a883eb14-0c0e-45c4-b8c4-b54c4a819edb, last access: 21 January 2025) (dashed red line in Fig. 1) and a buffer to include trans-boundary catchments and to mitigate the edge selection bias of optimal sensor placement in random fields (Hershfield, 1965; Rouhani, 1985; Krause et al., 2006).

https://essd.copernicus.org/articles/17/259/2025/essd-17-259-2025-f01

Figure 1The BCUB study region expands beyond the British Columbia administrative border to capture trans-boundary regions. Active and discontinued streamflow monitoring stations (those included in HYSETS; Arsenault et al., 2020) are sparse and unevenly distributed as shown in the main part of the figure, and the inset detail (left middle) shows the high density of pour points (purple) defining catchments in the BCUB dataset. Key map and inset maps © OpenStreetMap (https://www.openstreetmap.org/copyright, last access: 29 November 2024) contributors 2024. Distributed under the Open Data Commons Open Database License (ODbL) v1.0.

The attribute set describing each sub-basin follows the HYSETS dataset as much as possible and includes select additional climate indices following the Camels dataset (Addor et al., 2017) to demonstrate how derived parameters can be added to the dataset. Three sets of land cover indices from the North American Land Change Monitoring System (NALCMS) (Latifovic et al., 2010) representing 2010, 2015, and 2020 are included to support questions about land cover change at the basin level as called for by Addor et al. (2020). Examples of these data in use are provided in Sect. 4.

Following Wilkinson et al. (2016), to support knowledge discovery, innovation, and integration of data and methods in subsequent work, both the data and the code used to generate the data are openly available. The code is provided not to champion a particular method but to highlight the nuance involved in developing large-sample datasets that for brevity and clarity are generally left out of dataset description papers. There are no stochastic elements in the methodology, yet there are a large number of methodological choices that yield distinct outcomes. Providing the complete code at minimum aims to be explicit about these choices.

Our goal with the BCUB dataset was to provide a representative set of catchment attributes that cover key groups commonly found in the literature–terrain, land cover, climate, and soil. While the attribute set is not as extensive as the attributes found in the LSH literature, we prioritized creating a transparent, extensible data product with complete code and tutorial-like supporting information. Given the rapid development of attributes in LSH research, we focused on providing a solid framework rather than the most exhaustive or up-to-date set of attributes.

2 Data and methods

2.1 Data collection and pre-processing overview

Attributes of ungauged basins were clipped from the digital elevation, land cover, soil, and climate geospatial data sources listed in Table 1 through a data preparation and processing pipeline described in Sect. 2. Individual catchment polygons were delineated from the set of pour points in the stream network representing river confluences. The stream network was derived from the 1 arcsec (30 m at the Equator) resolution USGS 3DEP (U.S. Geological Survey, 2024) digital elevation model (DEM) using the open-source software library Whitebox (version 2.3) (Lindsay, 2016). Streams are defined by a minimum upstream accumulation of 1 km² to match the smallest monitored catchment in the HYSETS dataset.

U.S. Geological Survey (2024)Huscroft et al. (2018)Latifovic et al. (2010)Thornton et al. (2022)

Table 1Summary of catchment attribute source data.

¹ 3DEP: 3D Elevation Program, U.S. Geological Survey. ² NALCMS: North American Land Change Monitoring System, accessed at http://www.cec.org/north-american-land-change-monitoring-system/ (last access: 23 May 2024). ³ Global hydrogeology maps. ⁴ Gridded daily climate estimates on the 1 km Grid for North America, Version 4. https://daymet.ornl.gov/ (last access: 1 July 2024).

Download Print Version | Download XLSX

https://essd.copernicus.org/articles/17/259/2025/essd-17-259-2025-f02

Figure 2Schematic of the BCUB development pipeline, from retrieving input datasets from external sources to creating a final database of sub-basins and their representative catchment attributes.

Download

The study region was divided into complete basin sub-regions (no surface inflow across boundaries) as shown in Fig. 3 assembled from HydroBASINS (Lehner et al., 2021) data to simplify the automated sub-basin delineation and attribute extraction workflow. The data processing pipeline is described as follows:

Define study region and sub-regions. Level 5 and 6 watersheds from the HydroBASINS dataset were used as a first approximation to break the study region into smaller components for memory management in data pre-processing. Study region bounds were refined by deriving the covering set of basins in each region independently; see Sect. 2.2.1 for more detail about the treatment of region bounds.
Retrieve DEM data. The study region bounding box was used to download the covering set of digital elevation tiles from the USGS 3D Elevation Program (https://www.usgs.gov/3d-elevation-program, last access: 3 March 2024) (U.S. Geological Survey, 2024). In addition, lower-resolution (90 m) DEM tiles from EarthEnv DEM90 (Robinson et al., 2014) were used in the data validation analysis presented in Sect. 2.2.
Pre-process DEM raster. Hydraulic conditioning techniques of the DEM, including depression filling, resolving flats, computing flow direction and accumulation, and stream network extraction, were processed using the open-source geospatial analysis software Whitebox (https://www.whiteboxgeo.com/, last access: 26 May 2024) (Lindsay, 2016).
Define and filter pour points. Pour points define the outlet of each catchment, and their precise location is specific to the input DEM and pre-processing steps. Each ungauged catchment is delineated from a pour point defined by the stream network. Lake polygons from HydroBASINS were used to filter out pour points within lakes. Points are flagged (in_perennial_ice) where the 2020 NALCMS land cover classification is perennial ice and snow.
Catchment delineation. Catchment polygons were derived from sets of input pour point coordinates using the “UnnestBasins” function in Whitebox (https://www.whiteboxgeo.com/).
Attribute extraction. Catchment polygons were used as clipping masks to capture representative values from the various geospatial layers. Attribute indices were aggregated from raster and vector layers as described in Table 2.

Additional detail about pour point selection, catchment attribute extraction, and data processing follows.

https://essd.copernicus.org/articles/17/259/2025/essd-17-259-2025-f03

Figure 3The study region is divided into complete watershed sub-regions (encoded in the “region_code” parameter) by merging level 5 & 6 HydroBASINS Arctic and North America (N.Am.) polygon sets to cover the BC boundary. The study region extends beyond the administrative border of BC to include trans-boundary basins and a minimum buffer of ≈100 km. The purpose of merging complete watershed regions is to manage computational resources the DEM pre-processing → sub-basin delineation → attribute extraction pipeline.

Table 2Basin attributes in the BCUB database derived from USGS 3DEP (DEM), NALCMS (land cover), GLHYMPS (soil), and NASA Daymet (climate) datasets.

¹ Spatial aspect is expressed in degrees counter-clockwise from the east direction. ² The <year> suffix specifies the land cover dataset (2010, 2015, or 2020). ³ Soil parameters follow definitions from Huscroft et al. (2018). ⁴ Only the climate parameters directly extracted from distinct Daymet source variables are shown here. Additional computed parameters are discussed in Sect. 2.1.3. ⁵ A high-precipitation event is defined as total daily precipitation greater than 5 times the annual mean, and the duration refers to the mean duration of high-precipitation events. ⁶ A low-precipitation event is defined as total daily precipitation less than 0.1 mm, and the duration refers to the mean duration of low-precipitation events.

Download Print Version | Download XLSX

2.1.1 Pour point set selection

The outlet for runoff in a catchment is defined by the pour point. The catchments in the BCUB database are delineated from pour points, which are a subset of raster cells representing the stream network. The set of pour points used for catchment delineation is called the candidate monitoring location (CML) set. By limiting the CML set to river confluences, the number of polygons to process reduces to <5 % of the complete set of stream network cells. Since changes in upstream accumulated area are small along reaches between confluences, and by extension changes in the hydrologic properties of the sub-basin are small, eliminating points along stream lines between confluences reduces redundancy and data processing.

The CML set is defined by the following criteria:

confluences, stream cells with more than two neighbouring stream cells (eight-direction grid), where the flow direction of more than one neighbouring stream cell is pointed toward the target cell, and
river outlets, intersections of stream network lines with ocean coastline, major regional watershed outlets at the study region boundary, and connections with lakes where the upstream contributing area is at least 1 km².

Pour points at confluences within lakes were excluded from the pour point set, as illustrated in Fig. 4, where a red “x” denotes a spurious confluence within a lake, and a yellow triangle represents the location where a river drains into or out of a lake. Green circles represent pour points of catchments defining each upstream branch of a confluence and their combination.

https://essd.copernicus.org/articles/17/259/2025/essd-17-259-2025-f04

Figure 4The pour point identification algorithm derives the stream network (blue lines) and identifies catchment pour points. Spurious pour points at confluences within lakes (red “x”) are excluded, while pour points defining catchment outlets at river–lake connections (yellow triangle) are added. Pour points at confluences include the upstream branches of converging streams and their combination.

The headwaters mapped in the stream network are simply a vestige of the minimum area threshold (1 km²) used to define a stream network, so they are excluded from the pour point set. Accurate headwater identification (network extent mapping) requires a more rigorous approach to address uncertainty related to stream permanence (Shavers and Stanislawski, 2020). Mutzner et al. (2016) found classical (i.e. cumulative drainage area) threshold approaches do not capture spatial variability of headwater drainage networks in mountainous regions compared to detailed field survey mapping, and statistical methods are likewise unable to resolve local topography to accurately map headwater streams at low-resolution. Further discussion of uncertainty is provided in Sect. 2.2.

2.1.2 Sub-basin delineation and notes on attributes

A catchment boundary polygon was generated for each pour point in the CML set using the “unnest basins” (https://www.whiteboxgeo.com/manual/wbt_book/available_tools/hydrological_analysis.html#UnnestBasins, last access: 26 May 2024) function in the Whitebox software library (Lindsay, 2016). Attributes were derived for each sub-basin by (i) using the polygons as raster clipping masks and (ii) spatial intersection of the polygon and geospatial raster and vector data in PostGIS (PostGIS Project., 2018).

Attribute values were computed using the geometric mean of the raster pixel values contained in basin polygons in the case of soil permeability, the circular mean in the case of slope aspect, the fraction of total area in the case of land use, and the spatial mean for all other attributes. Physical attributes are described in Table 2, and metadata attributes are described in Table 3.

Table 3BCUB dataset metadata attributes.

^* Geometries are projected to the BC Albers (EPSG:3005) coordinate reference system.

Download Print Version | Download XLSX

Several binary attributes are included in the attribute set to represent uncertainty in geometry and value estimates. A 'soil_flag' value of 1 indicates that the clipped soil data differ from the catchment polygon area by more than 5 % to indicate gaps in the GLHYMPS (soil) data. A 'permafrost_flag' value of 1 represents the presence of permafrost in the basin. A value of 1 for the 'in_perennial_ice' flag represents a pour point location where the land cover classification is “perennial snow and ice” as defined by Latifovic et al. (2010). A 'geometry_flag' value of 1 represents a catchment intersecting or touching an uncertain area along the region boundary whose area is ≥5 % of the catchment area, as described in Sect. 2.2.1.

2.1.3 Data processing notes

Beyond data sources, the offline approach of deriving sub-basins from source data and writing code to process attributes was adopted despite the elegant online polygon aggregation and processing approach demonstrated by Kratzert et al. (2023) in developing the Caravan dataset with use of Google Earth Engine (GEE) (Gorelick et al., 2017). Such an approach is preferable from the perspective of standardized methods of catchment attribute extraction, but for our target of ungauged catchments it does not eliminate the need for DEM pre-processing to generate stream networks, for filtering and extracting pour points, or for sub-basin delineation. These steps represent a substantial portion of the attribute extraction workflow, and what remains to process with GEE is still subject to usage limits, namely for processing the very large set of polygons, even considering an aggregated polygon approach.

A benefit of the offline approach is generating a set of sub-basin polygons from the highest-resolution DEM available that is continuous and complete and ensuring that basin polygons match the DEM source from which terrain attributes are derived.

Expansion of the study region or addition of new attributes can be accomplished by following the processing methodology in the code repository provided. Four parameters derived from the Daymet daily precipitation data are processed in the code provided do demonstrate how computed parameters can be added to the BCUB from existing input data. The examples follow the Camels dataset (Addor et al., 2017) and include the following:

low precipitation frequency, frequency of days where precipitation <1 mm d⁻¹;
low precipitation duration, average duration of low-precipitation events, or the number of consecutive low-precipitation days <1 mm d⁻¹;
high precipitation frequency, frequency of days where precipitation is ≥5 times the mean daily precipitation; and
high precipitation duration, average duration of consecutive high-duration events, number of consecutive high- precipitation days ≥5 times mean daily precipitation.

2.2 Technical validation

The large number of geometries in the BCUB dataset requires an automated approach to validate the sub-basin polygons used to capture attributes. The representativeness of attributes is a function of the accuracy of the stream network derived from DEM. Higher-resolution DEMs can better resolve lower-relief topographic features resulting in better basin delineation performance, particularly for small basins (Zhang and Montgomery, 1994; Tarolli and Dalla Fontana, 2009; Woodrow et al., 2016).

It is important to emphasize that the 1 km² minimum drainage area threshold introduces significant uncertainty in the accuracy of the smallest sub-basins and those where topographic relief is low. Detailed validation of stream network accuracy is left to future work that the BCUB is intended to support, and validation of the smallest sub-basins used in studies is left to the user. Next we discuss indirect attribute validation methods and limitations of the dataset and methods.

2.2.1 Region boundary treatment

While the region polygons assembled from HydroBASINS are a helpful tool for organizing the data processing pipeline, the resulting bounds are different from those produced by independently delineating basins from the 1 arcsec DEM used in this study. These differences are comparable in size to the smallest sub-basins in the BCUB dataset, introducing uncertainty into the attributes of any catchment whose boundary touches or intersects them. Boundary deviations are defined as (i) gaps between region bounds where the DEM does not resolve an outlet and (ii) boundary overlaps between regions with shared boundaries.

The Caravan dataset (Kratzert et al., 2023) clearly describes the issue with aggregating attributes from catchment boundary polygons that do not precisely align with the HydroBASINS polygons. By independently deriving the region bounds from a single continuous DEM source (1 arcsec USGS 3DEP), we avoid the problem of misalignment with HydroBASINS polygon. This process does not guarantee perfect alignment of region bounds, but the mean size of deviations is significantly reduced.

To avoid restricting the catchment boundary delineation by the clipping mask, a (5 km) buffer was applied to the region boundaries aggregated from level 5 and 6 HydroBASINS polygons. The buffered polygons were used as clipping masks on the DEM before deriving the covering set of polygons (catchments) for each region. The covering set is defined as the smallest number of non-overlapping polygons covering a region. The exterior edges (of the union of intersecting geometries) were checked to verify that they do not touch the edge of each buffered region polygon. Where the edges intersect, the buffer (DEM clipping mask) was manually expanded in QGIS and the process repeated until the buffer was sufficient; i.e. the covering set of basins does not touch the edge of the clipping mask. The use of a buffer produces small peripheral catchments draining to adjacent region basins, and these are excluded by identifying that they are completely contained by the clipping mask of the adjacent regions.

Delineating region boundaries independently from the HydroBASINS polygons does not yield perfectly shared boundaries, but the resulting deviations are substantially smaller. The distribution of the size of deviations from shared sub-region boundaries is shown in Fig. 5. The red series represents differences between the BCUB region bounds and HydroBASINS-derived bounds (median area of 0.13 km²), while the blue series represents disagreement (overlaps and gaps) between the BCUB sub-region boundaries (median area 0.025 km²). Polygons smaller than 0.01 km², or 1 % of the smallest sub-basin in the BCUB dataset, were neglected. The boundary deviation polygons (gaps and overlaps) are included in the code repository.

https://essd.copernicus.org/articles/17/259/2025/essd-17-259-2025-f05

Figure 5Boundary uncertainties are significantly reduced relative to the smallest catchments in the dataset (1 km²) when region bounds are generated from the same DEM source as the catchments (blue, median uncertainty 0.025 km²), compared to HydroBASINS-derived regions (red, median uncertainty 0.13 km²).

Download

The uncertainty introduced by missing or overlapping areas along sub-region bounds is addressed in the BCUB dataset in two ways. The 'geometry_flag' attribute indicates that a catchment polygon intersects or touches an uncertain region bound if the total uncertain area represents at least 5 % of the catchment area. Where catchments derived from distinct basin outlets overlap, either catchment may overestimate the area, and where an area is not covered by any basin but is not necessarily endorheic, either bordering sub-basin may underestimate the catchment area. Where a catchment polygon touches or intersects with an uncertain boundary, the size of the uncertain area is represented by a positive integer value to indicate potential overestimation (“inside_pct_area_flag”) or underestimation (“outside_pct_area_flag”) of the catchment as a percentage of the catchment area. The purpose of including these quantities is to identify and express the significance of uncertain catchment bounds.

2.2.2 Vestigial effects of DEM resolution

In addition to the hydraulic conditioning process for stream network derivation, the grid representation of elevation introduces vestigial artifacts in the representation of basins and, consequently, catchment attribute estimates.

The stream network derived from DEM does not capture permanent water bodies, resulting in spurious river confluences. These vestigial confluences were excluded by using the lakes geometry layer from HydroBASINS as a mask, as described in Sect. 2.1.1. Since HydroBASINS is derived from different sources, hydrographic features do not align exactly with the stream network we derived from the 1 arcsec DEM.

The disk space required to store a polygon is a linear function of the number of vertices defining it and the precision of geographic coordinates describing the geometry. The sub-basin polygons are simplified (using the Shapely library (Gillies, 2021) “simplify” function) using a tolerance equal to one-half the diagonal length of the raster pixel resolution. Simplifying (or smoothing) polygons represents a trade-off between reducing the disk and bandwidth required to store and transmit large sets of geometries and the representativeness of attributes that are captured by intersecting each polygon with the various geospatial raster layers. The effect of polygon simplification is discussed in more detail in Sect. 2.2.3.

The set of raster pixels representing each sub-basin is captured using the “crop-to-cutline” function from the open-source GDAL library (Rouault et al., 2023), which by default captures pixels whose centroid lies within the polygon (pixels are not points but quadrilaterals). Alternatively the larger set of intersecting pixels can be selected by setting the “CUTLINE_ALL_TOUCHED=TRUE” keyword argument. As drainage area decreases (or raster resolution decreases), the difference in edge pixel selection method represents an increasing proportion of total pixels, which may then yield significant differences in attribute values depending upon the clipping method used. Figure 6 shows that the proportion of edge pixels representing the catchment increases with decreasing area and uses the USGS 3DEP (30 m grid at the Equator) and EarthEnv DEM90 (EENV) DEM (90 m grid at the Equator) to show how the proportion of edge pixels changes with DEM resolution. The purpose of this exercise is to highlight one source of uncertainty introduced by the data processing methodology and to demonstrate the effect of the clipping method as a function of catchment scale.

https://essd.copernicus.org/articles/17/259/2025/essd-17-259-2025-f06

Figure 6(a) The proportion of edge pixels increases as drainage area decreases, with points showing bin medians (equiprobable binning, N≈600 samples per bin) and whiskers representing the 90 % confidence interval. (b) The higher-resolution DEM reveals greater topographic relief, illustrated by comparing mean slope distributions from 30 m (USGS 3DEP) and 90 m (EarthEnv) DEMs across 10 000 randomly sampled basins.

Download

Mean slope is a widely used attribute in large-sample hydrology (Addor et al., 2017; Alvarez-Garreton et al., 2018) to describe the degree of topographic relief of a catchment, defined in Arsenault et al. (2020) as “the average slope when considering the individual elevation differences between tiles” (raster pixels). We used WhiteboxTools to compute the slope of each DEM pixel using a third-order Taylor polynomial fit (Florinsky, 2016) with a kernel size of 5×5 pixels. Mean catchment slope increases with increasing resolution because topographic relief is better captured at higher resolution (Zhang and Montgomery, 1994). Figure 6 compares the mean slope between 30 and 90 m resolution DEM sources, where the higher-resolution DEM is able to resolve greater topographic detail. The comparison is based on a random sample of roughly 10 000 polygons in the BCUB dataset ranging in size from 1 km² to 2×10⁵ km². The sample of sub-basins in Fig. 6 shows a bias toward lower calculated mean slope from the lower-resolution DEM source using the same polygon mask to capture pixels. Further interpretation of these differences is left to future work.

2.2.3 Catchment attributes and self-similarity

Mandelbrot (1967) described the measurement of coastline length as a function of the scale of observation, and the lines describing features like catchment boundaries and stream networks also exhibit self-similarity. Perimeter, stream gradient, and shape factors like elongation or compactness are length-based attributes used in many LSH datasets (Arsenault et al., 2020; Klingler et al., 2021; Kratzert et al., 2023). The compactness coefficient is defined as the ratio of polygon perimeter to the circumference of a circle with equal area (Gravelius, 1914, as cited in Sassolas-Serrayet et al., 2018). Length-based attributes are not comparable without consistent input DEM resolution and data pre-processing.

https://essd.copernicus.org/articles/17/259/2025/essd-17-259-2025-f07

Figure 7An example edge detail of the same catchment boundary from three different sources where the intersecting area is over 98 % of the published value. The HYSETS dataset polygon (dashed–dotted black line) comes from an earlier revision published by the WSC representing the Kiskatinaw River near Farmington (WSC ID 07FD001), while a recent revision (July 2022) by the WSC (solid blue) shows a distinct difference in polygon edges. The polygon from the BCUB (dashed orange) derived from USGS 3DEP DEM is different from both. Key map and inset map © OpenStreetMap (https://www.openstreetmap.org/copyright) contributors 2024. Distributed under the Open Data Commons Open Database License (ODbL) v1.0.

https://essd.copernicus.org/articles/17/259/2025/essd-17-259-2025-f08

Figure 8Example visualization using the BCUB dataset to map the 2010–2020 change in forest cover (as a percentage of the catchment area) for catchments with drainage area between 20 and 25 km² on Vancouver Island (VCI) according to the NALCMS dataset (Latifovic et al., 2010). The number (N) of catchments in each category is indicated in the legend. Map tiles © Stamen Design 2024 (http://stamen.com/, last access: 29 November 2024) distributed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0, last access: 29 November 2024). Basemap data © OpenStreetMap contributors 2024 (http://openstreetmap.org/, last access: 29 November 2024). Distributed under the Open Data Commons Open Database License (ODbL) v1.0 (http://www.openstreetmap.org/copyright, last access: 29 November 2024).

https://essd.copernicus.org/articles/17/259/2025/essd-17-259-2025-f09

Figure 9Example visualization using the BCUB dataset to map the 2010–2020 change in perennial snow and ice cover (as a percentage of the catchment area) for catchments between 400 and 500 km² with ice cover in 2010 according to the NALCMS dataset (Latifovic et al., 2010). The number (N) of catchments in each category is indicated in the legend. Map tiles © Stamen Design 2024 (http://stamen.com/) distributed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0). Basemap data © OpenStreetMap contributors 2024 (http://openstreetmap.org/). Distributed under the Open Data Commons Open Database License (ODbL) v1.0 (http://www.openstreetmap.org/copyright).

Table 4Summary of data repository contents.

Download Print Version | Download XLSX

In July 2022 the Water Survey of Canada (WSC) published updated catchment boundaries representing the majority of the streamflow monitoring network. These updated geometries can be accessed at the WSC National Water Data Archive (https://collaboration.cmc.ec.gc.ca/cmc/hydrometrics/www/HydrometricNetworkBasinPolygons/, last access: 5 September 2024). The README from this repository indicates that the input DEM used for catchment delineation is 30 m. We found all polygons common to both the HYSETS dataset and this revised polygon set and computed pairwise comparisons of perimeter and area. The sample used for the comparison includes 715 sub-basins where the original and updated polygons were a close match to control for significant changes in the polygon shape. A close match is defined as the ratio of intersecting area to union area (Jaccard similarity index) ≥95 %. There were 1035 sub-basin polygon revisions that did not meet the similarity criteria, reflecting the difficulty in retrospectively defining streamflow monitoring station locations from historical records (Arsenault et al., 2020).

For the revised WSC polygons based on the higher-resolution DEM, the deviation in catchment perimeter, expressed as a percent change increased significantly ( $μ = 35, σ = 9$ ), while the area remained stable ( $μ = 0, σ = 1$ ). This difference highlights the need to ensure consistent input DEM and data processing methodology if length-based attributes are included attribute datasets. The BCUB dataset does not include a perimeter attribute, but it can be computed from the catchment polygon geometry.

Average stream gradient is a length-based attribute that is a function of both raster resolution and the assumed location of channel head, usually by minimum area threshold. Robinson et al. (2014) calculated the mean stream gradient as the ratio of the maximum total elevation change in the basin stream network to the length of the corresponding river reach. Stream length is a function of DEM resolution, and the length of reach is measured from the catchment outlet to an uncertain headwater location (Hafen et al., 2020, 2022). In the derivation of the stream network for the BCUB dataset, headwater locations are simply a vestige of the assumed minimum drainage area threshold, and as a result, an attribute representing average stream gradient is not included in the BCUB database.

3 Code and data availability

The BCUB dataset (Kovacek and Weijs, 2023) is accessible under a Creative Commons BY 4.0 license through the Borealis data repository at https://doi.org/10.5683/SP3/JNKZVT. A summary of the dataset contents and supporting information is presented in Table 4. The sub-basin polygon geometries are provided in the open-source, cross-language Apache Parquet format (https://parquet.apache.org/, last access: 24 June 2024), which has the convenience of supporting multiple geometries. The Parquet file format is supported by several widely used Python libraries, including Dask (https://docs.dask.org/, last access: 24 June 2024) and GeoPandas (https://geopandas.org/, last access: 20 January 2024), and the Arrow package features an interface for the R programming language (https://arrow.apache.org/docs/r/, last access: 24 June 2024). The Dask–GeoPandas library in Python (https://dask-geopandas.readthedocs.io/, last access: 24 June 2024) is recommended for performance with large datasets.

The catchment attributes are provided in two forms in the Borealis data repository. The larger form includes catchment boundary, centroid, and pour point geometries. These are saved in the Parquet file format under the “basin_polygons” folder (select the “tree” view for easier navigation). The Parquet file naming convention follows the sub-region codes shown in Sect. 3. A “light” format without geometries is provided in comma-delimited format in BCUB_attributes_20240630.csv. Sub-region geometries with their associated codes are provided for reference in BCUB_regions_4326.geojson (https://borealisdata.ca/file.xhtml?fileId=685474&version=2.0, last access: 10 July 2024). Metadata describing the dataset are provided in MetaData.pdf (https://borealisdata.ca/file.xhtml?fileId=685560&version=2.0, last access: 10 July 2024), and additional sub-basin attribute information, including descriptions and sources, is provided in the Readme.pdf (https://borealisdata.ca/file.xhtml?fileId=685561&version=2.0, last access: 10 July 2024).

The scripts used to derive the dataset and the validation results and figures shown in this paper are provided in an open-source GitHub repository (https://github.com/dankovacek/bcub, last access: 10 October 2024; DOI: https://doi.org/10.5281/zenodo.14708323, Kovacek, 2025 a). The code to replicate the figures in Sect. 2.2 is provided in the “validation” folder of the repository. Figures 1, 3, 4, and 7 to 9 were prepared with the QGIS software (https://doi.org/10.5281/zenodo.5869838, QGIS Contributors, 2022), and all remaining figures were created using the Bokeh data visualization library (https://bokeh.org, Bokeh Development Team, 2023) in Python.

Interactive computing environment (ICE)

An example guide is provided (https://dankovacek.github.io/bcub_demo/0_intro.html, last access: 10 October 2024; DOI: https://doi.org/10.5281/zenodo.14709179, Kovacek, 2025 b) through a set of Jupyter (Kluyver et al., 2016) Notebooks to demonstrate the complete process of data retrieval, pre-processing, sub-basin delineation, attribute extraction, data product usage, and plotting.

4 Usage notes

It is the hope that the BCUB dataset will serve a wide range of water resource research and practice where catchment-based attributes are integral to the methodology or perhaps more importantly to express the limits of appropriate use and interpretation. Figures 8 and 9 provide two basic examples of the kind of sub-basin level querying the BCUB is designed to support. Figure 8 shows catchment-level changes in forest cover between 2010 and 2020 for Vancouver Island catchments in the range of 20 to 25 km², and Fig. 9 shows the change in perennial ice cover over the same period for catchments in all regions between 400 and 500 km² with ice cover in 2010. The changes in both figures represent the change as a percentage of the catchment area.

Stream networks are unique to the input DEM, and they are affected by the choice of pre-processing steps. The greatest degree of uncertainty is associated with the smallest catchments with the lowest topographic relief. Zhang and Montgomery (1994) provide guidance about interpreting features at scales relative to DEM resolution. The representativeness of stream networks, and by extension the attributes captured by polygon masks generated from stream networks, is an important component of uncertainty analysis and data reliability assessment. This aspect of the analysis is left to future work that the BCUB dataset is designed to support, in particular the lower limit of basin scale that can be supported by 1 arcsec DEM.

Author contributions

DK built the dataset, crafted the Jupyter Notebook tutorials, and wrote the manuscript. SW provided research supervision and manuscript review.

Competing interests

The contact author has declared that neither of the authors has any competing interests.

Disclaimer

Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.

Acknowledgements

The authors wish to thank the editor and the three anonymous reviewers of this work for their thoughtful feedback, which significantly improved the manuscript. We are also grateful to Richard Arsenault for his responsiveness to our questions, which helped guide this work. Finally, we extend our gratitude to all those who contribute to open-source scientific software.

Financial support

This research has been supported by the British Columbia Ministry of Environment and Climate Change Strategy (grant no. TP23EPEMA0031MY).

Review statement

This paper was edited by Kang Yang and reviewed by three anonymous referees.

References

Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P.: The CAMELS data set: catchment attributes and meteorology for large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, https://doi.org/10.5194/hess-21-5293-2017, 2017. a, b, c, d, e, f

Addor, N., Do, H. X., Alvarez-Garreton, C., Coxon, G., Fowler, K., and Mendoza, P. A.: Large-sample hydrology: recent progress, guidelines for new datasets and grand challenges, Hydrolog. Sci. J., 65, 712–725, 2020. a, b, c

Alvarez-Garreton, C., Mendoza, P. A., Boisier, J. P., Addor, N., Galleguillos, M., Zambrano-Bigiarini, M., Lara, A., Puelma, C., Cortes, G., Garreaud, R., McPhee, J., and Ayala, A.: The CAMELS-CL dataset: catchment attributes and meteorology for large sample studies – Chile dataset, Hydrol. Earth Syst. Sci., 22, 5817–5846, https://doi.org/10.5194/hess-22-5817-2018, 2018. a, b

Arsenault, R., Brissette, F., Martel, J.-L., Troin, M., Lévesque, G., Davidson-Chaput, J., Gonzalez, M. C., Ameli, A., and Poulin, A.: A comprehensive, multisource database for hydrometeorological modeling of 14,425 North American watersheds, Scientific Data, 7, 243, https://doi.org/10.1038/s41597-020-00583-2, 2020. a, b, c, d, e, f, g

Beven, K. J.: Uniqueness of place and process representations in hydrological modelling, Hydrol. Earth Syst. Sci., 4, 203–213, https://doi.org/10.5194/hess-4-203-2000, 2000. a

Bokeh Development Team: Bokeh: Python library for interactive visualization, https://bokeh.org (last access: 1 December 2024), 2023. a

Chagas, V. B. P., Chaffe, P. L. B., Addor, N., Fan, F. M., Fleischmann, A. S., Paiva, R. C. D., and Siqueira, V. A.: CAMELS-BR: hydrometeorological time series and landscape attributes for 897 catchments in Brazil, Earth Syst. Sci. Data, 12, 2075–2096, https://doi.org/10.5194/essd-12-2075-2020, 2020. a

Coulibaly, P., Samuel, J., Pietroniro, A., and Harvey, D.: Evaluation of Canadian National Hydrometric Network density based on WMO 2008 standards, Can. Water Resour. J., 38, 159–167, 2013. a, b

Coxon, G., Addor, N., Bloomfield, J. P., Freer, J., Fry, M., Hannaford, J., Howden, N. J. K., Lane, R., Lewis, M., Robinson, E. L., Wagener, T., and Woods, R.: CAMELS-GB: hydrometeorological time series and landscape attributes for 671 catchments in Great Britain, Earth Syst. Sci. Data, 12, 2459–2483, https://doi.org/10.5194/essd-12-2459-2020, 2020. a

Daigle, A., Caudron, A., Vigier, L., and Pella, H.: Optimization methodology for a river temperature monitoring network for the characterization of fish thermal habitat, Hydrolog. Sci. J., 62, 483–497, 2017. a

Datta, S., Karmakar, S., Mezbahuddin, S., Hossain, M. M., Chaudhary, B. S., Hoque, M. E., Abdullah Al Mamun, M., and Baul, T. K.: The limits of watershed delineation: implications of different DEMs, DEM resolutions, and area threshold values, Hydrol. Res., 53, 1047–1062, 2022. a

Do, H. X., Gudmundsson, L., Leonard, M., and Westra, S.: The Global Streamflow Indices and Metadata Archive (GSIM) – Part 1: The production of a daily streamflow archive and metadata, Earth Syst. Sci. Data, 10, 765–785, https://doi.org/10.5194/essd-10-765-2018, 2018. a

Duan, Q., Schaake, J., Andréassian, V., Franks, S., Goteti, G., Gupta, H., Gusev, Y., Habets, F., Hall, A., Hay, L., Hogue, T., Huang, M., Leavesley, G., Liang, X., Nasonova, O. N., Noilhan, J., Oudin, L., Sorooshian, S., Wagener, T., and Wood, E. F.: Model Parameter Estimation Experiment (MOPEX): An overview of science strategy and major results from the second and third workshops, J. Hydrol., 320, 3–17, 2006. a

Falcone, J. A.: GAGES-II: Geospatial attributes of gages for evaluating streamflow, Tech. Rep., US Geological Survey, https://doi.org/10.3133/70046617, 2011. a

Florinsky, I.: Digital terrain analysis in soil science and geology, Academic Press, ISBN 0-12-804632-5, 978-0-12-804632-6, 2016. a

Fowler, K. J. A., Acharya, S. C., Addor, N., Chou, C., and Peel, M. C.: CAMELS-AUS: hydrometeorological time series and landscape attributes for 222 catchments in Australia, Earth Syst. Sci. Data, 13, 3847–3867, https://doi.org/10.5194/essd-13-3847-2021, 2021. a

Gauch, M., Mai, J., and Lin, J.: The proper care and feeding of CAMELS: How limited training data affects streamflow prediction, Environ. Modell. Softw., 135, 104926, https://doi.org/10.1016/j.envsoft.2020.104926, 2021. a, b

Geobase: National Hydro Network Data Production Catalogue, https://open.canada.ca/data/en/dataset/a4b190fe-e090-4e6d-881e-b87956c07977 (last access: 20 January 2025), 2004. a

Gillies, S.: Shapely: Manipulation and analysis of geometric objects, GitHub [code], https://github.com/Toblerity/Shapely (last access: 19 August 2024), 2021. a

Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., and Moore, R.: Google Earth Engine: Planetary-scale geospatial analysis for everyone, Remote Sens. Environ., 202, 18–27, 2017. a

Gravelius, H.: Grundrifi der gesamten Gewcisserkunde. Band I: Flufikunde, Compendium of Hydrology, vol. I. Rivers, Goschen, Berlin, 1914 (in German). a

Gray, M.: Freshwater water atlas user guide, GeoBC Integrated Land Management Bureau, Victoria, BC, https://www2.gov.bc.ca/gov/content/data/geographic-data-services/topographic-data/freshwater (last access: 20 January 2025), 2010. a

Guth, P. L.: Drainage basin morphometry: a global snapshot from the shuttle radar topography mission, Hydrol. Earth Syst. Sci., 15, 2091–2099, https://doi.org/10.5194/hess-15-2091-2011, 2011. a, b

Hafen, K. C., Blasch, K. W., Rea, A., Sando, R., and Gessler, P. E.: The influence of climate variability on the accuracy of NHD perennial and nonperennial stream classifications, J. Am. Water Resour. As., 56, 903–916, https://doi.org/10.1111/1752-1688.12871, 2020. a

Hafen, K. C., Blasch, K. W., Gessler, P. E., Sando, R., and Rea, A.: Precision of headwater stream permanence estimates from a monthly water balance model in the Pacific Northwest, USA, Water, 14, 895, https://doi.org/10.3390/w14060895, 2022. a

Hershfield, D. M.: On the spacing of raingages, in: Proceedings of the WMO/IASH Symposium on Design of Hydrometerologic Networks, Int. Assoc. Sci. Hydrol. Publ., 67, 72–79, 1965. a

Hrachowitz, M., Savenije, H., Blöschl, G., McDonnell, J., Sivapalan, M., Pomeroy, J., Arheimer, B., Blume, T., Clark, M., Ehret, U., Fenicia, F., Freer, J. E., Gelfan, A., Gupta, H. V., Hughes, D. A., Hut, R. W., Montanari, A., Pande, S., Tetzlaff, D., Troch, P. A., Uhlenbrook, S., Wagener, T., Winsemius, H. C., Woods, R. A., Zehe, E., and Cudennec, C.: A decade of Predictions in Ungauged Basins (PUB) – a review, Hydrolog. Sci. J., 58, 1198–1255, 2013. a

Huscroft, J., Gleeson, T., Hartmann, J., and Börker, J.: Compiling and mapping global permeability of the unconsolidated and consolidated Earth: GLobal HYdrogeology MaPS 2.0 (GLHYMPS 2.0), Geophys. Res. Lett., 45, 1897–1904, 2018. a, b, c

Klingler, C., Schulz, K., and Herrnegger, M.: LamaH-CE: LArge-SaMple DAta for Hydrology and Environmental Sciences for Central Europe, Earth Syst. Sci. Data, 13, 4529–4565, https://doi.org/10.5194/essd-13-4529-2021, 2021. a

Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B. E., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J. B., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S., and Willing, C.: Jupyter Notebooks – a publishing format for reproducible computational workflows, Elpub, 2016, 87–90, 2016. a

Kovacek, D.: British Columbia Ungauged Basin – Replication Code Repository, v0.1.0-beta, Zenodo [code], https://doi.org/10.5281/zenodo.14708323, 2025a. a

Kovacek, D.: BCUB Data Pipeline Jupyter Notebook Tutorials v0.1.0-beta, Zenodo [code], https://doi.org/10.5281/zenodo.14709179, 2025b. a

Kovacek, D. and Weijs, S.: British Columbia Ungauged Basins Dataset, Borealis [data set], https://doi.org/10.5683/SP3/JNKZVT, 2023. a, b

Kratzert, F., Klotz, D., Herrnegger, M., Sampson, A. K., Hochreiter, S., and Nearing, G. S.: Toward improved predictions in ungauged basins: Exploiting the power of machine learning, Water Resour. Res., 55, 11344–11354, 2019. a

Kratzert, F., Nearing, G., Addor, N., Erickson, T., Gauch, M., Gilon, O., Gudmundsson, L., Hassidim, A., Klotz, D., Nevo, S., Shalev, G., and Matias, Y.: Caravan – A global community dataset for large-sample hydrology, Scientific Data, 10, 61, https://doi.org/10.1038/s41597-023-01975-w, 2023. a, b, c, d, e

Krause, A., Guestrin, C., Gupta, A., and Kleinberg, J.: Near-optimal sensor placements: Maximizing information while minimizing communication cost, in: Proceedings of the 5th international conference on Information processing in sensor networks, Nashville, TN, USA, 19–21 April 2006, https://doi.org/10.1145/1127777.1127782, 2–10, 2006. a

Latifovic, R., Homer, C., Ressl, R., Pouliot, D., Hossain, S., Colditz Colditz, R., Olthof, I., Giri, C., and Victoria, A.: North American land change monitoring system (NALCMS), in: Remote sensing of land use and land cover: principles and applications, CRC Press, Boca Raton, https://doi.org/10.1201/b11964-24, 2010. a, b, c, d, e, f

Lehner, B., Roth, A., Huber, M., Anand, M., Grill, G., Osterkamp, N., Tubbesing, R., Warmedinger, L., and Thieme, M.: HydroSHEDS v2.0 – Refined global river network and catchment delineations from TanDEM-X elevation data, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-9277, https://doi.org/10.5194/egusphere-egu21-9277, 2021. a, b

Li, X., Khandelwal, A., Jia, X., Cutler, K., Ghosh, R., Renganathan, A., Xu, S., Tayal, K., Nieber, J., Duffy, C., Steinbach, M., and Kumar, V.: Regionalization in a global hydrologic deep learning model: from physical descriptors to random vectors, Water Resour. Res., 58, e2021WR031794, https://doi.org/10.1029/2021WR031794, 2022. a

Lindsay, J. B.: Whitebox GAT: A case study in geomorphometric analysis, Comput. Geosci., 95, 75–84, 2016. a, b, c

Mandelbrot, B.: How long is the coast of Britain? Statistical self-similarity and fractional dimension, Science, 156, 636–638, 1967. a

Mishra, A. K. and Coulibaly, P.: Hydrometric network evaluation for Canadian watersheds, J. Hydrol., 380, 420–437, 2010. a

Mutzner, R., Tarolli, P., Sofia, G., Parlange, M. B., and Rinaldo, A.: Field study on drainage densities and rescaled width functions in a high-altitude alpine catchment, Hydrol. Process., 30, 2138–2152, 2016. a

PostGIS Project: PostGIS 3.5.2 Manual, https://postgis.net/docs/manual-3.5/ (last access: 26 September 2024), 2018. a

QGIS Contributors: QGIS (3.22.3), Zenodo [code], https://doi.org/10.5281/zenodo.5869838, 2022. a

Robinson, N., Regetz, J., and Guralnick, R. P.: EarthEnv-DEM90: A nearly-global, void-free, multi-scale smoothed, 90m digital elevation model from fused ASTER and SRTM data, ISPRS J. Photogramm., 87, 57–67, 2014. a, b

Rouault, E., Warmerdam, F., Schwehr, K., Kiselev, A., Butler, H., Łoskot, M., Szekeres, T., Tourigny, E., Landa, M., Miara, I., Elliston, B., Chaitanya, K., Plesea, L., Morissette, D., Jolma, A., Dawson, N., Baston, D., de Stigter, C., and Miura, H: GDAL (v3.10.0), Zenodo [code], https://doi.org/10.5281/zenodo.5884351, 2023. a

Rouhani, S.: Variance reduction analysis, Water Resour. Res., 21, 837–846, 1985. a

Sassolas-Serrayet, T., Cattin, R., and Ferry, M.: The shape of watersheds, Nat. Commun., 9, 3791, https://doi.org/10.1038/s41467-018-06210-4, 2018. a

Shavers, E. and Stanislawski, L. V.: Channel cross-section analysis for automated stream head identification, Environ. Modell. Softw., 132, 104809, https://doi.org/10.1016/j.envsoft.2020.104809, 2020. a

Tarolli, P. and Dalla Fontana, G.: Hillslope-to-valley transition morphology: New opportunities from high resolution DTMs, Geomorphology, 113, 47–56, 2009. a

Thornton, M., Shrestha, R., Wei, Y., Thornton, P., Kao, S., and Wilson, B.: Daymet: Monthly Climate Summaries on a 1-km Grid for North America, Version 4 R1, ORNL DAAC, Oak Ridge, Tennessee, USA [data set], https://doi.org/10.3334/ORNLDAAC/2131, 2022. a

Thornton, P. E., Shrestha, R., Thornton, M., Kao, S.-C., Wei, Y., and Wilson, B. E.: Gridded daily weather data for North America with comprehensive uncertainty quantification, Scientific Data, 8, 190, https://doi.org/10.1038/s41597-021-00973-0, 2021. a

U.S. Geological Survey: 1 Arc-second Digital Elevation Models (DEMs) USGS National Map 3D Elevation Program, U.S. Geological Survey [data set], https://data.usgs.gov/datacatalog/data/USGS:35f9c4d4-b113-4c8d-8691-47c428c29a5b, last access: 3 March 2024. a, b, c, d

Werstuck, C. and Coulibaly, P.: Hydrometric network design using dual entropy multi-objective optimization in the Ottawa River Basin, Hydrol. Res., 48, 1639–1651, 2017. a

Werstuck, C. and Coulibaly, P.: Assessing Spatial Scale Effects on Hydrometric Network Design Using Entropy and Multi-objective Methods, J. American Water Resour. As., 54, 275–286, 2018. a

Wickel, B., Lehner, B., and Sindorf, N.: HydroSHEDS: A global comprehensive hydrographic dataset, in: AGU Fall Meeting Abstracts, vol. 2007, H11H–05, 2007. a

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., et al.: The FAIR Guiding Principles for scientific data management and stewardship, Scientific data, 3, 1–9, https://doi.org/10.1038/sdata.2016.18, 2016. a

Woodrow, K., Lindsay, J. B., and Berg, A. A.: Evaluating DEM conditioning techniques, elevation source data, and grid resolution for field-scale hydrological parameter extraction, J. Hydrol., 540, 1022–1029, 2016. a

Zhang, J., Condon, L. E., Tran, H., and Maxwell, R. M.: A national topographic dataset for hydrological modeling over the contiguous United States, Earth Syst. Sci. Data, 13, 3263–3279, https://doi.org/10.5194/essd-13-3263-2021, 2021. a

Zhang, W. and Montgomery, D. R.: Digital elevation model grid size, landscape representation, and hydrologic simulations, Water Resour. Res., 30, 1019–1028, 1994. a, b, c

Articles

Short summary

We made a dataset for British Columbia describing the terrain, soil, land cover, and climate of over 1 million watersheds. The attributes are often used in hydrology because they are related to the water cycle. The data are meant to be used for water resources problems that can benefit from lots of watersheds and their attributes. The data and instructions needed to build the dataset from scratch are freely available. The permanent home for the data is https://doi.org/10.5683/SP3/JNKZVT.