GeoDAR: Georeferenced global dam and reservoir dataset for bridging attributes and geolocations

. Dams and reservoirs are among the most widespread human-made infrastructure on Earth. Despite their societal and environmental significance, spatial inventories of dams and reservoirs, even for the large ones, are insufficient. A dilemma of the existing georeferenced dam datasets is the polarized focus on either dam quantity and spatial coverage (e.g., GOODD) or detailed attributes for a limited dam quantity or region (e.g., GRanD and national inventories). One of the most comprehensive datasets, the World Register of Dams (WRD) maintained by the International Commission on Large Dams 20 (ICOLD), documents nearly 60,000 dams with an extensive suite of attributes. Unfortunately, the WRD records provide no geographic coordinates, limiting the benefits of their attributes for spatially explicit applications. To bridge the gap between attribute accessibility and spatial explicitness, we introduce the Georeferenced global Dam

The current attributes are limited to obstruction types such as locks, weirs, and multiple types of dams. In addition, GROD is tailored for the forthcoming Surface Water and Ocean Topography (SWOT) satellite mission which is designed to observe river reaches wider than 50-100 m (Biancamaria et al., 2016). While these rivers are sufficiently captured by GRWL, the obstruction infrastructure identified along the river mask in GRWL excludes many large dams on rivers narrower than 30 m.
In the US, for instance, there are at least 5170 NID-registered dams higher than 15 m (i.e., large dams according to ICOLD 70 criteria), but less than 8% of these dams intersect with GRWL (i.e., located on rivers wider than 30 m).
Among the few global dam/reservoir datasets that provide both georeferenced locations and essential attributes, are the United Nations Food and Agricultural Organization (FAO) AQUASTAT (Li et al., 2011) and the Global Reservoir and Dam database (GRanD) (Lehner et al., 2011). GRanD was constructed by harmonizing AQUASTAT and a wide range of regional gazetteers and inventories. Its latest version, v1.3, contains 7320 dams as well as their reservoir boundaries and 75 approximately 50 attributes, with a cumulative storage capacity of 6881 km 3 . Since its publication, GRanD has been applied extensively by a variety of studies, although its focus is on the world's largest dams (e.g., >0.1 km 3 ) and its quantity (7320 dams) is a fraction of the 59,000 dams documented in WRD. A spatially resolved inclusion of additional large dams, such as those in compliance with the ICOLD definition, has been increasingly desired by the hydrology community and encouraged by growing collaborations from multiple disciplines such as biogeochemistry, ecology, energy planning, and infrastructure 80 managements (Belletti et al., 2020;Boulange et al., 2021;Grill et al., 2019;Lin et al., 2019;Wada et al., 2017).
Here, we present the initial versions of the Georeferenced global Dam And Reservoir dataset, or GeoDAR. We built GeoDAR by utilizing multi-source dam and reservoir inventories and the Google Maps geocoding API. Our goal is to tackle the limitations of existing datasets by offering a dam inventory that is both spatially resolved and has an extended ability to access important attributes. As summarized in Table 1, our GeoDAR product includes two successive versions. GeoDAR 85 v1.0 is essentially a georeferenced subset of ICOLD WRD. It contains nearly 23,000 dam points, each indexed by an identifier (ID) that is associated with a WRD record, allowing for the potential retrieval of all its 40+ proprietary attributes from ICOLD. GeoDAR v1.1 consists of a) nearly 25,000 dam points which harmonized v1.0 and GRanD for an expanded inclusion of the largest dams, and b) the reservoir boundaries for most (86%) of the dam points. Due to geocoding challenges, GeoDAR v1.0 spatially resolved about 40% of the individual dams in WRD. However, these georeferenced 90 locations were quality controlled, and after the harmonization with GRanD, v1.1 captures a total storage capacity of 7297 km 3 , a magnitude comparable to the full storage capacity of WRD. While GeoDAR v1.1 can be considered as a version that supersedes v1.0, the latter was georeferenced independently from GRanD, and we opted to release both versions so that users have the flexibility to choose whichever works better for their cases and potentially improve the harmonization.
For proprietary reasons, neither GeoDAR version releases any WRD attributes. Instead, we offer an option for users if they 95 need to acquire the attributes: upon individual request we may assist the user who has purchased WRD (https://www.icold-cigb.org/GB/world_register/world_register_of_dams.asp) to associate the GeoDAR ID with the ICOLD "international code", through which WRD attributes can be linked to each GeoDAR feature (see Sections 3.3 and 7 for more details). Even without the proprietary WRD attributes, GeoDAR offers one of the most extensive and spatially-resolved global inventory of dams and reservoirs, which may benefit a variety of applications in hydrology, hydropower planning, and ecology. 100

Definitions and overview
We aim to georeference (i.e., acquire the latitude and longitude of) each dam listed in ICOLD WRD, by using the nominal location (i.e., descriptive information) available in the WRD attributes. Examples of the attributes that are important for 105 georeferencing include the names of the dam and reservoir, the administrative divisions the dam is affiliated with, and the name of the impounded river. Using such attribute information, spatial coordinates of a dam may be either a) queried from an existing register or inventory where dam records were already georeferenced and verified, or b) estimated through a geocoding service that can convert nominal locations to numeric spatial coordinates. Our preference was the former when possible to optimize the georeferencing accuracy. 110 The schematic procedure of GeoDAR production is illustrated in Fig. 1. We started by removing duplicate records from the 59,071 dams listed in the original ICOLD WRD (accessed in March 2019). Here "duplicates" are defined as the dams that are either a) repeatedly recorded with identical (or highly similar) attribute information or b) different dam structures but associated with the same reservoir. Examples of the second scenario include a reservoir's primary dam and secondary dyke such as the Boonton Dam and its associated Parsippany Dike (40.884° N,74.408° W) in New Jersey and multiple controls 115 for one reservoir such as Veersedam and Zandkreekdam for Veerse Meer (51.549° N,3.678° E) in the Netherlands.
Although "duplicates" in this scenario refer to different dam bodies, including them could lead to double or multiple counting of the storage capacity of the same reservoir. After removing the identified duplicates, the cleaned WRD contains 56,850 unique dams/reservoirs with a total water storage capacity of 7334 km 3 (based on WRD attribute values). Unless otherwise described, the ICOLD WRD mentioned in the following text refers to the version after duplicate removal. We 120 acknowledge that owing to the challenges of lacking explicit spatial information and occasional attribute errors in WRD, our duplicate removal is not perfect and may have misidentified or missed some duplicate dams.
We then compared the unique ICOLD WRD records against a collection of georeferenced dam registers we acquired from regional water authorities and agencies. When the attribute information of a WRD dam matched that in a regional register, the spatial coordinates from the latter were "borrowed" by the WRD record. We term this process "geo-matching", which 125 resulted in the georeferencing of 13,301 WRD dams. For the remaining dams in WRD, we applied the alternative approach "geocoding", which transforms a nominal location (such as the dam or reservoir address formulated by ICOLD attribute information) to a pair of spatial coordinates. The tool we used to implement geocoding was the Google Maps geocoding API (http://developers.google.com/maps). The geocoding process successfully retrieved the spatial coordinates of another 9410 WRD dams. The combined output from both geo-matching and geocoding were next collated with the spatial coordinates 130 and reservoir storage capacities of 133 WRD dams larger than 10 km 3 as documented in Wada et al. (2017). These processes resulted in GeoDAR v1.0, a total of 22,743 georeferenced WRD dam points with an accumulative storage capacity of 6436 km 3 (accounting for more than 80% of that in ICOLD WRD). The Venn diagram in Fig. 2a provides an overview of the logical relations among the georeferencing sources and methods for GeoDAR v1.0.
135 Figure 1. Schematic flowchart of GeoDAR production. Text in roman indicates applied or produced datasets, and text in italics indicates methods or procedures. The number noted by asterisk excludes 27 dams that are also in Wada et al. (2017).
To further improve our spatial inventory of the world's largest dams, we performed a harmonization between the dam points in GeoDAR v1.0 and GRanD v1.3. The harmonization aimed at merging both datasets, removing duplicates between them, and when possible, associating each new dam supplemented by GRanD with the corresponding WRD record. This process 140 identified another 2235 dam points, including 1425 associated with WRD but not georeferenced in GeoDAR v1.0. With removal of duplicates, this harmonization led to a total number of 24,978 georeferenced dam points, with an accumulative storage capacity of 7297 km 3 (comparable to that in the original WRD). An overview of this harmonization process is illustrated by the Venn diagram in Fig. 2b. Finally, the reservoir polygons for each of the georeferenced dams were retrieved as thoroughly as possible from three global water body datasets: GRanD v1.3 reservoirs (Lehner et al., 2011), HydroLAKES 145 v1.0 (Messager et al., 2016), and the Landsat-based UCLA Circa-2015 Lake Inventory (Sheng et al., 2016). These nearly 25,000 dam points and their associated reservoir polygons constitute GeoDAR v1.1. Details of production processes and their Quality Assurance and Quality Control (QA/QC) are included in the following method sections. GeoDAR v1.0 and (b) GeoDAR v1.1 (dams only). Boxes indicate final subsets in each GeoDAR version, and the arrows point to the georeferencing sources or methods. Topology of the shapes illustrates logical relations among the data/methods, but sizes of the shape were not drawn to scale of the data volume.

Geo-matching regional registers
The ICOLD WRD was a joint contribution from more than 100 member nations, some of which also release detailed and 155 publicly accessible dam registers that have been georeferenced. These regional/local registers, with reliable spatial coordinates already provided for each dam, were our preferred sources for georeferencing WRD. Since this type of register is not available for most countries, we searched multiple water authority and project websites, and collected seven georeferenced regional registers or inventories that are open access. Their names, sources, and numbers of documented dams are summarized in Table 2. 160 Table 2. Regional registers or inventories for geo-matching and the validation of geocoding.

Region
Register/Source Dam count (NRLD) of India, and Japan Dam Foundation (JDF). Regional inventories were collected with partial reference to the Global Dam Watch project (http://globaldamwatch.org). Dam numbers for regional registers are based on the records with valid geographic coordinates, and numbers for ICOLD WRD are based on the records after duplicate removal. See full registers, references, and download links in the reference list.
These seven registers/inventories cover Brazil, Canada, the United States, 31 European countries (including part of Russia), 170 South Africa, and part of Southeast Asia (Cambodia and Myanmar), with a total dam count of more than 126,000. Besides spatial coordinates, each of these registers also provides attributes for their documented dams, which were required by the geo-matching process. While other dam inventories could be available, our geo-matching effort for GeoDAR v1.0 was focused on these collected ones. However, we referred to additional registers from China, India, and Japan (Table 2) for the   validation of our WRD geocoding (see Validation). For these additional regional registers, it was either inconvenient to bulk-175 download the dam records, or we were legally restricted from releasing their dam coordinates. Therefore, we only used these registers for the purpose of validation.
The procedure of geo-matching is illustrated in Fig. 3. Given each regional register, our goal was to find its matching records from the subset of ICOLD WRD for the same region, by cross-checking value similarities for several key attributes between the two datasets. On one hand, the compared attributes must be mutually available in both datasets. On the other hand, the 180 attributes should cover various themes so that in combination, they are able to disambiguate records that represent different dams but may coincide in certain attributes. Taking both requirements into account, the key attributes used include the dam and reservoir names, multiple levels of administrative/political divisions for the dam, and the dam's completion year. The river on which the dam was constructed was also considered for all regions except Cambodia as the register does not contain such an attribute. For each of the key attributes, we considered values in WRD and the regional register agreeing with each 185 other if the similarity score between the value sequences exceeded about 85% (meaning that there are more than 8 pairs of identical elements, with consideration of their orders, between two 10-element string sequences). This similarity threshold tolerated minor variations in spelling that often occur among different data sources. If an agreement was not reached between the two full sequences (e.g., "Maharashtra Pradesh" and "Maharashtra"), the similarity was then tested between the main subsets of the sequences in order to increase the matching success. 190 Figure 3. Schematic procedure of geo-matching regional registers. Text in roman indicates applied or produced datasets, and text in italics indicates methods or procedures.
One of the geo-matching challenges was that the levels of political/administrative divisions are not always comparable or consistent between WRD and the regional registers. In WRD, the divisions were provided at the levels of country, 195 state/province, and the nearest town/city, which are inconsistent with some of the registers. For example, the register for Brazil (Dams Safety Report in 2017) provides the finest division at the county level, whereas the European inventory (from the MARS (Managing Aquatic ecosystems and water Resources under multiple Stress) project) documents no divisions below the national level. To improve the feasibility in division comparison, we performed a "reverse geocoding" for each georeferenced regional register using the Google Maps geocoding API. Opposite to regular (or "forward") geocoding which 200 converts a nominal location to numeric spatial coordinates, this reverse geocoding converted the spatial coordinates of each dam documented in the register, to a parsed address that contains administrative divisions at consecutive levels. These multilevel divisions and subdivisions were appended to the original regional registers (Fig. 3), thus enabling a more flexible and complete comparison with the WRD attributes and thus an increased success rate of geo-matching.
We considered a WRD record matched with a regional record if their agreements on the key attributes warranted a 205 reasonable confidence that the two are the same dam. In principle, a high confidence would require a unanimous agreement on all key attributes. However, this ideal scenario was often unnecessary and sometimes impossible. One of the reasons is that the key attributes do not always have valid values. In WRD, for instance, the values of "nearest town" for nearly all (>99%) US dams are null. While this attribute is valid for most other dams, the nearest town/city in WRD is not necessarily the division that administrates or contains the dam as is the case in the township in some regional registers. Another reason is 210 that our collected multi-source datasets were not collated by a universal standard. As a result, inherent discrepancies of the attribute definitions and/or values may exist among the datasets. One example is the dam's "completion year", which could be ambiguous between the year when the dam construction was concluded and the year when the dam operation was initiated or commissioned. These two definitions do not necessarily lead to the same year. To address such inconsistencies, we defined a baseline scenario that required any pair of matched WRD and regional records to agree on the following: 215 In compliance with this baseline, we implemented an automated QA to filter out any matching errors and optimize the 220 matching accuracy for each WRD record. In brief, any match that did not meet the baseline scenario was removed, and the remaining geo-matched pairs were ranked to three discrete QA levels (M1, M2, and M3) according to the quality of attribute agreements (see definitions in Supplementary Table S1). As the QA rank increases (from M3 to M1), agreements on the key attributes improved from the baseline to the ideal scenario (i.e., a unanimous agreement). If a WRD record was matched to more than one records in the regional register, the QA selected the match with the best rank. This way, each georeferenced 225 WRD record was only matched to the best-ranking regional record. Users may refer to the provided QA ranks as a measure of the general reliability of each geo-matched location. It is worth noting that our geo-matching purpose was to acquire the spatial coordinates of any matched WRD record from the regional register, rather than collating or correcting any existing attribute values. In other words, some of the WRD and regional records may actually refer to the same dams but were matched unsuccessfully due to major discrepancies between their attribute values. This led to a conservative success rate in 230 our automated geo-matching. More technical details about QA are given in our Python scripts at https://github.com/jidawang/georeferencing-ICOLD-dams-and-reservoirs.
Following the automated QA, we performed a manual QC to reassure the accuracy of the geo-matching results. We went through each geo-matched WRD record to examine whether its attributes (e.g., dam/reservoir name, administrative locations, river name, construction year, and storage capacity) indeed agreed with those of the regional source. If an evident 235 discrepancy was identified, the match was removed from the final result. Although we made every endeavour to be as rigorous as possible, remnant matching errors are still possible due to the challenges of incompleteness and intrinsic errors in the attribute information (refer to Section 4 for accuracies). Our manual QC identified ~4% error in the geo-matched WRD records, most of which came from QA rank M3. After removing these errors, the geo-matching process concluded with a total of 13,301 WRD records georeferenced (Fig. 3), including 3275, 7039, and 2987 for QA ranks M1, M2, and M3, 240 respectively (Supplementary Table S1). The success rate, i.e., the number of geo-matched dams as a percentage of the number of WRD records, varies from about 40% in Southeast Asia to about 80% in South Africa and US (Table 2), with an overall success of 71% in all geo-matched regions (Fig. 3).

Geocoding via Google Maps
The subset of ICOLD WRD that was not geo-matched includes the remaining 5403 dams in the geo-matched regions and the 245 entire 38,146 dams in the other regions of the world (Fig. 2a). For these dams, we applied the Google Maps geocoding API, a sophisticated cloud-based geocoding service, to retrieve the spatial coordinates of each dam as thoroughly and accurately as possible. To do so, we designed a recursive geocoding procedure that implemented three primary steps on each dam:  The forward geocoding (see Section 2.1 for definition) used the text address of each dam as the input, which we formatted by concatenating the WRD attribute values, to output the latitude and longitude of the dam. The WRS attributes used for 255 address formatting include dam name, reservoir name, statement/province, and country. "Nearest town" was excluded because it is not always the township administrating the dam or reservoir. Together with the spatial coordinates, the forward geocoding also output a Google Maps address associated with the coordinates, which was parsed to individual components including feature name, street name, and political divisions. These output address components, in return, provided valuable information for QA: if the geocoded coordinates are correct, the associated output address components should agree well 260 with those of the WRD input. However, we noticed that address components from forwarding geocoding are often limited in terms of division levels. To complement this limitation, we utilized reverse geocoding (see Section 2.2 for definition) to convert the coordinates from forward geocoding to an updated address with more complete division levels. The address components from both forward and reverse geocoding were combined and hereafter referred to as the "output address". Similar to geo-matching, we employed a QA filter to approach the optimal geocoding result. This process first arranged the 265 attributes of each WRD record to several address formats as they could result in different geocoding outputs. The address arrangements are listed in Supplementary Table S2, and their preference order is rationalized in Supplementary Text. Each of these WRD addresses was used iteratively for both forward and reverse geocoding (as described above). Their geocoded spatial coordinates were then ranked to five discrete QA levels based on how well the input and output addresses agree on individual components (C1 to C5, Supplementary Table S3). The iteration was terminated if the highest QA rank was 270 achieved; otherwise, the coordinates that render the best possible QA rank was used as the geocoding result. Table S3, the compared address components include the name of the dam or reservoir and its affiliated political divisions from town/city to country levels. Consistent with geo-matching, we considered that a component was agreed on if the similarity of its values from both input and output addresses exceeds about 85%. Since the nearest town in WRD was not used for forward geocoding, we treated it as an "independent reference" for validating the township 275 component in the output address. Although the town or city near the dam (from WRD) does not always coincide with that administrating the dam (from the geocoding output), their occasional agreement would strengthen our confidence of the geocoded coordinates if other components were also well matched between the WRD input and the geocoding output. For this reason, we opted to include the township comparison as a supplementary criterion in the geocoding QA process. The highest QA rank (C1) corresponds to a unanimous agreement on all address components. However, the minimum rank (C5) 280 only required the agreement on dam or reservoir name, which is a more flexible baseline in comparison with that for geomatching. This was because some of the large reservoirs, particularly those on or near political boundaries, have shared or ambiguous divisions, and the ambiguity might be further amplified by the geocoded coordinates which could fall in anywhere from the dam to across the reservoir water surface. Since we aimed to maximize the quantity of georeferenced records, a flexible baseline scenario was purposely adopted to keep as many geocoded dams as possible. As a result, the 285 automated geocoding procedure yielded a total of 16,088 WRD records, each with a pair of optimal spatial coordinates and the corresponding QA rank.

As explained in Supplementary
To complement the automated QA process, we then performed a rigorous QC to manually identify and remove geocoding errors. In principle, we reviewed each of the geocoded points against high-resolution Google Earth and Esri images, and deleted any identified error where (a) no dam or reservoir could be visibly verified or (b) the WRD attribute information is 290 inconsistent with the feature or division labels on Google Maps. The geo-matched coordinates from regional registers are usually on or close to the dam bodies, but the geocoded coordinates could be located on the reservoir. The latter case was not considered as an error. Due to China's GPS shift problem, geocoded points in mainland China tended to show a systematic offset of roughly 500 m from their actual dam or reservoir features. For such Chinese dams, we tried to reduce their geocoding offsets as much as possible, by manually relocating the coordinate points to their correct dams or reservoirs. Our 295 QC process ended up removing about 42% of the originally geocoded dams, most of which stemmed from relatively lower QA ranks (see statistics in Supplementary Table S3). The complete geocoding procedure resulted in 9410 georeferenced and quality controlled WRD records, with an overall success rate of 22%.

Supplementation with other global inventories
The outputs from both geo-matching and geocoding, a total of 22,711 georeferenced ICOLD WRD records (Fig. 2a), was 300 further supplemented or harmonized by two global dam/reservoir inventories to improve our inclusion of the world's largest dams. We considered this process necessary for two reasons. First, our georeferencing process, particularly geocoding via Google Maps API, did not warrant an exhaustive inclusion of the largest dams. This is particularly evident for regions where the address and label information in Google Maps is either lacking or difficult to pass the automated QA due to language ambiguity or naming discrepancies. Second, through cross-referencing we noted that the attribute values of reservoir storage 305 capacity provided in ICOLD WRD are occasionally erroneous (also noted by Mulligan et al. (2020)), e.g., by a factor of 1000 probably caused by unit confusion in WRD compilation. As part of the supplementation/harmonization process, we therefore collated the ICOLD reservoir storage capacities with those in the two global inventories below and corrected any evident errors in ICOLD. ( Among them, 139 dams were provided with spatial coordinates. We verified each of the dam locations and made minor adjustments and correction to further assure the quality. The attributes of these 139 dams were then manually compared with those in ICOLD WRD. We found that 133 of them were documented in WRD but 32 were georeferenced unsuccessfully in our geo-matching or geocoding procedure. Therefore, we borrowed the spatial coordinates of these 32 large dams in Wada et 315 al. (2017) to supplement what we had georeferenced. The coordinates of the other 101 large dams, which we georeferenced successfully (41 from geo-matching and 60 from geocoding), were also overwritten by those in Wada et al. (2017) to doubleassure and improve their spatial accuracies. This supplementation is illustrated by the Venn diagram in Fig. 2a.

Supplementation with Wada et al
We then compared the storage capacities of each of the 133 dams in Wada et al. (2017) with those in WRD and identified 22 of them exhibiting substantial discrepancies between the two datasets. We then collated their storage capacities with other 320 documents (e.g., regional inventories, GRanD, and Wikpedia) and concluded that Wada et al. (2017) supersedes WRD in the accuracy of storage capacity for 16 of the 22 dams. Therefore, the storage capacities of these 16 dams in Wada et al. (2017) were used to replace the original WRD capacities. The entire supplementation process, including adding new dams, updating existing dam coordinates, and correcting reservoir storage capacities, increased the total storage capacity of our georeferenced dams by 15%, and 70% of the capacity increase comes from the 32 added large dams. For improved clarity, it 325 is worth reiterating that all dams supplemented by Wada et al. (2017) were also documented in ICOLD WRD. The combined results of geo-matching and geocoding, after the supplementation from Wada et al. (2017), defines GeoDAR v1.0 containing 22,743 georeferenced records in ICOLD WRD.

Harmonization with GRanD: forming GeoDAR v1.1
While GeoDAR v1.0 largely exceeds GRanD in dam count, a visual comparison of their spatial distributions revealed that 330 the latter is often complementary to (instead of completely duplicated by) the former in many regions of the world. This motivated us to perform a systematic harmonization between the two datasets. The merged version, which we entitled GeoDAR v1.1, combines the merits of GRanD in accurately documenting the world's largest dams and GeoDAR v1.0 in providing extensive spatial details of smaller but more widespread dams.
We assumed that GRanD, by having collated multiple data sources, is superior to GeoDAR v1.0 in the accuracies of both 335 spatial locations and attribute values (particularly reservoir storage capacity) of the world's largest dams. While this may be true for most cases, we identified 69 dams in GRanD that exhibit evident georeferencing or attribute errors. These dams were excluded from the harmonization process. Another 5 dams in GranD that were documented as subsumed or replaced by new dams were also excluded. For user convenience, we released these 74 GranD dams together with suggested new coordinates (if possible) in Supplementary Table S4 (full spreadsheet accessible  Detailed processing for each of the objectives is given below.  Supplementary Table S4). 350 First, when a dam in GeoDAR v1.0 also exists in GRanD, the spatial coordinates of the former were replaced by those of the latter. We implemented a two-step procedure to identify the overlapping dams between GeoDAR v1.0 and GRanD.
Step 1 was based on attribute association while Step 2 utilized spatial query. Specifically, Step 1 detected matching records between ICOLD WRD and GRanD by assessing agreements on several attributes, including dam/reservoir names, administrative divisions, impounded rivers, and completion years. This step was essentially the same as "geo-matching" that was used to 355 link WRD records to regional registers for GeoDAR v1.0 (Section 2.2). The association results, after a meticulous manual QC, identified ~4660 dams in GRanD that were georeferenced in GeoDAR v1.0. For the remaining GRanD dams, Step 2 utilized their reservoir polygons to spatially intersect with the dam points in GeoDAR v1.0. A distance tolerance of ~5 km was applied to assist the spatial association and account for possible offsets in GeoDAR v1.0. As part of the QC, the attribute values of each pair (one from GRanD and the other from WRD) were manually compared to determine whether they are 360 indeed the same dam. This step identified another ~350 overlapping dams between the two datasets. In total, we found that GeoDAR v1.0 overlaps 5011 out of the 7246 dams in GRanD, and their spatial coordinates were updated to be consistent with those in GRanD.
Second, for the remaining 2235 dams in GRanD that do not overlap GeoDAR v1.0, we assumed that at least part of them could be matched to the WRD records not georeferenced in GeoDAR v1.0. Therefore, we performed another round of 365 attribute association between the remaining subsets of GRanD and WRD. After QC, this process identified another 1425 WRD dams that are included by GRanD. These additional WRD dams, with a total storage capacity of 605 km 3 , were then added to our inventory using the spatial coordinates provided in GRanD. As a result of the first two objectives, GeoDAR v1.1 georeferenced 24,168 (43%) out of the 56,850 dams in ICOLD WRD, including 6436 that overlap GRanD.
Third, to reduce the impact of possible attribute errors in ICOLD WRD, we next merged the values of reservoir storage 370 capacity from both WRD and GRanD to a single updated attribute, where the original values in WRD or Wada et al. (2017) were overwritten by those of the overlapping dams in GRanD. This correction led to a minor increase of 2.4 km 3 (less than 0.1%) in the total reservoir storage capacity. Eventually, the remaining 810 dams in GRanD, which were not found in WRD, were appended to our georeferenced WRD so that the final inventory absorbed the entire dataset of GRanD. It is worth noting that similar to geo-matching (Section 2.2), our attribute association here could be conservative, meaning that some of 375 the dams appended from GRanD might be documented in the remaining WRD (the subset not georeferenced successfully).
The complete harmonization process, combining the above three steps, led to a total of 24,978 georeferenced dams in GeoDAR v1.1 (Fig. 2b).

Retrieving reservoir boundaries
Reservoir polygons of the georeferenced dam points were retrieved as thoroughly as possible from three global water body 380 datasets: GRanD reservoirs (Lehner et al., 2011), HydroLAKES v1.0 (Messager et al., 2016), and UCLA Circa-2015 Lake Inventory (Sheng et al., 2016). These three water body datasets exhibit an increasing spatial resolution: from 7000+ polygons in GRanD reservoirs provided exclusively for GRanD's dam points, to millions of water body polygons, including both natural lakes and reservoirs, in the other two datasets. While HydroLAKES documents 1.4 million water bodies larger than 0.1 km 2 (10 ha), the Landsat-based UCLA Circa-2015 Lake Inventory further reduced the minimum size to only 0.004 km 2 385 (0.4 ha), resulting in another 7.7 million water bodies on the global continental surface. Accordingly, we implemented a hierarchical procedure, where the three water body datasets were applied in ascending order of spatial resolution to retrieve the reservoir boundaries with an overall decreasing size.
Specifically, GRanD v1.3 provides 7181 reservoir polygons for the 7246 collected dam points. The remaining 65 dams without reservoir polygons are either river barrages and thus have no proper reservoirs, or infrastructures that were too recent 390 to have filled impoundments. Other rarer cases also include dams that were abandoned or to be constructed (Lehner et al., 2011). These 7181 reservoir polygons were assigned to their associated dam points in GeoDAR v1.1 through GRanD IDs.
Reservoirs of the remaining 17,732 dam points in GeoDAR v1.1, which were georeferenced from ICOLD alone, were next retrieved from HydroLAKES when possible. To avoid duplicates in the reservoirs retrieved from different data sources, we only used the subset of HydroLAKES that is spatially independent from (i.e., not intersecting with) GRanD reservoirs. 395 Different from reservoir assignment using GRanD, there was no common attribute ID to pair HydroLAKES polygons with the remaining dam points, so their reservoir retrieval relied completely on spatial association. One major challenge in damreservoir spatial association was the ambiguity caused by the offsets between our georeferenced dam points and their actual reservoir polygons (see Section 2.3).
To tackle this ambiguity, we designed a procedure that consists of three rounds of iteration to progressively optimize 400 reservoir-dam association. This procedure was based on two assumptions, both conditional on a reasonable spatial tolerance.
We started with 500 m to be roughly consistent with the georeferencing offset observed in China. The first assumption was that larger reservoirs are more likely to be documented than smaller ones, in both ICOLD WRD and Google Maps. Therefore, the first round of iteration assigned each of the dams to the largest water body within the tolerance. This assignment might, however, lead to a situation where multiple dams were assigned to the same reservoir. To untangle this 405 situation, the remaining iterations assumed Tobler's First Law of Geography (Tobler, 1970): "everything is related to everything else, but near things are more related than distant things" (p.236). Accordingly, for any water body mistakenly associated with multiple dams, the second round of iteration reassigned the water body to its closest dam, and the other dam(s) within the tolerance, as a result, was/were left unpaired. To reduce the number of such "orphan" dams, a final, third round of iteration assigned the remaining unpaired dams to the next closest water body that was within the spatial tolerance 410 and had not been previously associated with any dams. If this led to multiple dams associated with one reservoir again, only the dam with the closest proximity to the reservoir was kept. Through experimentation, we opted to implement this threeiteration procedure twice, first using a conservative 500-m tolerance to maximize the accuracy for most associations, and then a 1-km tolerance to further minimize the number of orphan dams.
This multi-iteration procedure retrieved ~7600 reservoir polygons from HydroLAKES. For the remaining dam points left 415 unpaired, we applied the same association procedure to continue retrieving their reservoirs from the high-resolution UCLA Circa-2015 Lake Inventory. Similarly, only the subset that does not intersect with the retrieved HydroLAKES polygons was considered, in order to avoid duplicates in the retrieved reservoirs from different datasets. The use of UCLA Circa-2015 Lake Inventory retrieved another ~6700 reservoirs. A manual QC was performed on the combined result to confirm that each retrieved reservoir polygon was matched to the correct dam point, and if not, we tried to adjust the association as thoroughly 420 as possible.

Product components and usage
We here provide a detailed documentation of the components and structure of the GeoDAR versions (v1.0 and v1.1). To facilitate the description, the two GeoDAR versions and their component statistics are explained in Table 1, and spatial distributions of the dam points and reservoir polygons are visualized in Figs. 6 and 7. 425

GeoDAR v1.0: dams
GeoDAR v1.0 is a collection of 22,743 dam points georeferenced exclusively for ICOLD WRD (Fig. 6a). Among them, 13,260 or 58% were retrieved from geo-matching regional dam registers, 9350 or 41% from Google Maps geocoding API, and the remaining 133 largest dams from the spatial inventory in Wada et al. (2017) (Fig. 6b). For improved accuracies, WRD storage capacities of most of these 133 large reservoirs were replaced by the values in Wada et al. (2017) (see Section 430 2.4.1), and unless stated otherwise, our following statistics on storage capacities were calculated after this replacement.
The total reservoir storage capacity of the 22,743 dams is 6435.5 km 3 , meaning that GeoDAR v1.0 georeferenced 40% of the 56,850 WRD records but included more than 80% of their cumulative reservoir storage capacity. The total storage capacity of the 133 largest dams from Wada et al. (2017), despite being limited in number, reaches 3900 km 3 or 61% of the cumulative storage capacity in GeoDAR v1.0, and the other ~40% capacity was split almost equally between the remaining 435 22,000+ geo-matched and geocoded dams. Although the registers used for geo-matching are regional, the dams in GeoDAR v1.0, as shown in Fig. 6b, are distributed in 148 out of the 164 countries in WRD (including ICOLD member and nonmember countries), largely owing to our geocoding efforts through Google Maps API. Since GeoDAR v1.0 was produced independently from other global dam datasets such as GRanD, it can also be used to cross-compare, supplement, and potentially improve other dam datasets. Validation of our georeferencing accuracy for v1.0 is provided in Section 4. 440

GeoDAR v1.1: dams and reservoirs
GeoDAR v1.1 consists of a) 24,978 dam points (Fig. 6a) representing a full harmonization between GeoDAR v1.0 and GRanD v1.3, and b) 21,576 reservoir polygons (Fig. 7). In these nearly 25,000 dam points, 17,732 or 71% come from GeoDAR v1.0 alone, 6436 or 26% shared by ICOLD WRD and GRanD, and the other 810 or 3% from GRanD alone (Table   1; Fig. 6c harmonization, the spatial coverage of the dam points in GeoDAR v1.1 increased to 154 out of the 164 countries in WRD. As described in Section 2.4.2, we substituted the reservoir storage capacities in GRanD for the original capacity values of their overlapping WRD dams. As a result, the total reservoir storage capacity in GeoDAR v1.1 reaches 7296.6 km 3 , which compares to 95% of the cumulative capacity in the entire ICOLD WRD (see Section 5.1 for more comparisons with ICOLD). As reported in Table 1, 82% (5996 km 3 ) of the total storage capacity in GeoDAR v1.1 is explained by the 5011 455 relatively large dams georeferenced in both GeoDAR v1.0 and GRanD. The 17,732 smaller dams from GeoDAR v1.0 alone contribute only 6% (428 km 3 ) of the total storage capacity, which is roughly comparable to the subset from GRanD alone (268 km 3 ) or the subset from GRanD and other ICOLD WRD (605 km 3 ). These capacity contributions suggest that compared to GRanD, the major improvement of GeoDAR lies on the increased number of relatively small dams, rather than the increase in total storage capacity of the dams (see Section 5.2 for more comparisons with GRanD). 460 Different from GeoDAR v1.0, version 1.1 also includes a component of reservoir polygons which represent water impoundment extents associated with 21,576 or 86% of the georeferenced dam points (Fig. 7). Reservoir polygons for the 465 remaining 14% of the dam points were retrieved unsuccessfully due to a combination of factors, including limited spatial resolutions of the applied water masks, offsets in our georeferenced dam points, and the fact that some of the dams (e.g., river barrages) have no evident water impoundments. Nevertheless, the retrieved reservoir polygons have a cumulative area of 493,860 km 2 , accounting for 96% of the total reservoir area of all georeferenced dams in GeoDAR v1.1 (reservoir areas without polygons are based on WRD attributes). These reservoir polygons also correspond to a cumulative storage capacity 470 of 7117 km 3 , accounting for nearly 98% of the total storage capacity in v1.1. These statistics indicate that the reservoirs whose boundaries were retrieved unsuccessfully were mostly small in area and storage.
The numbers of reservoir polygons retrieved from each of the three water body datasets are fairly comparable (roughly 7000 each), but the total reservoir storage capacity and area both decrease drastically with the increasing spatial resolution of the water body datasets (Table 1). As a result, the mean reservoir polygon size decreased from 66 km 2 for those identified from 475 GRanD, to 2 km 2 from HydroLAKES and then less than 1 km 2 from the UCLA Circa-2015 Lake Inventory. This result is overall consistent with the design of our hierarchical procedure (Section 2.5), where smaller reservoirs were successively retrieved with the help of finer water masks. It is important to note that the retrieved polygons do not always represent the maximum water extents of the reservoirs because water boundaries in the retrieval sources were not necessarily mapped in the maximum inundation periods. For example, the UCLA Circa-2015 Lake Inventory contains approximately 9.5 million 480 water bodies larger than 0.4 ha, which were mapped from Landsat images acquired during the "steady" climate periods (Lyons and Sheng, 2018) and thus represent the average seasonal extent of each water body (Sheng et al., 2016). Despite not always being the largest water extents, our retrieved reservoir polygons enhanced the spatial details of global reservoir locations, using which users can further expand or refine the water boundaries to their specific needs.

Attributes and usage
The GeoDAR dataset, including dam points for v1.0 and both dam points and reservoir polygons for v1.1, is provided as three separate shapefiles. For user convenience, we also duplicated the two dam point shapefiles in the comma-separated 490 values (csv) format. The file names and attributes are explained in Table 3. Although most of our dam points were georeferenced using WRD records, our published GeoDAR complies with the proprietary rights of ICOLD and does not directly release any attribute from WRD. The attributes we provide in GeoDAR, as listed in Table 3, are only limited to our georeferencing methods, QA/QC, validation, and other information (such as spatial coordinates and part of the reservoir storage capacities) that is already open source or has been permitted for use by the original producers. 495 Table 3. Attributes in the data products of GeoDAR
Note: Missing or inapplicable values are flagged by "Null" for text-type attributes and "-999" for numeric-type attributes.
Although WRD attributes are not directly available in GeoDAR, we suggest two possible ways for users to acquire at least some of the essential attributes. Upon the user's reasonable request and on a case-by-case basis, we may provide assistance in decrypting the association between GeoDAR IDs (Table 3) and ICOLD's International Codes, and using the International 500 Codes, the user can link each of the dams/reservoirs in GeoDAR to the entire 40 or so proprietary attributes in WRD. This is also based on the premise that the user needs to acquire the WRD attribute data from ICOLD, and that the user agrees not to release the GeoDAR-WRD association or the WRD attributes to the public. Alternatively, since we imposed no usage restrictions on our spatial features (geometric dam points and reservoir polygons), users are free to integrate them with other datasets and tools, such as remote sensing observations and modelling, to acquire the needed attributes, particularly those not 505 yet documented in ICOLD WRD. Acquisition methods have been exemplified for at least the following attributes: reservoir hypsometry and bathymetry (Li et al., 2020;Yigzaw et al., 2018), surface evaporation loss (Mady et al., 2020;Zhan et al., 2019;Zhao and Gao, 2019a), operation rules (Shin et al., 2019;Yassin et al., 2019), completion years (Zhang et al., 2019), storage capacities (Liu et al., 2020), and the changes in water area (Pekel et al., 2016;Yao et al., 2019;Zhao and Gao, 2019b), level (Cretaux et al., 2011;Schwatke et al., 2015), and storage or volume (Busker et al., 2019;Cretaux et al., 2016;510 Gao et al., 2012;Zhang et al., 2014).

Validation
Separate from the QA/QC during data production, we also performed a posterior validation to further assess the accuracy of the georeferenced ICOLD WRD records. The validation sample consists of about 1400 dam points (Fig. 8), which were selected worldwide from GeoDAR v1.0 and represent the results of our geo-matching and geocoding prior to GRanD 515 harmonization. The collection of the validation points followed a stratified sampling method (Table 4). From the subset of GeoDAR v1.0 produced by geo-matching, we randomly selected about 40 dam points for each of the geo-matching regions (Brazil, Canada, Europe, South Africa, and United States), with the exception of Southeast Asia (Cambodia and Laos) where all 17 geo-matched WRD dams were included for validation. We allowed the sample to occasionally overlap with GRanD because dams in GeoDAR v1.0 were georeferenced independently from GRanD and those shared with GRanD reflect our 520 georeferencing accuracy for the world's largest dams. However, for each regional sample, we limited the number of GRanDoverlapping dams to no more than 30% of the entire regional sample size if possible (Table 4). This was to comply with the size ratio between GRanD and GeoDAR v1.0 (about 1:3) so that our validation still emphasized smaller, newly georeferenced dams. We also randomly selected 40 out of the 133 large WRD dams supplemented by Wada et al. (2017), considering that they are part of GeoDAR v1.0 and the supplementation was based on attribute association similar to 525 regional geo-matching. In total, 260 dams were selected for validating the geo-matching accuracy. For each dam, we manually checked whether its spatial coordinates in GeoDAR v1.0 are consistent with those documented in the geomatching source (see source references in Table 2). (left) and the number of dams in this sample that overlap with GRanD v1.3 (right), respectively. "Error source" lists error scenarios in decreasing order of frequency. "Mismatch" indicates geo-matching errors due to incorrect association between WRD and the source/reference register. "Register" indicates geo-matching errors due to inaccurate spatial coordinates in the source register (despite correct association). "Misplacement" indicates geocoding errors where the WRD attribute information disagrees with the Google Maps label. "Google Maps" indicates geocoding errors due to endogenous feature 535 labelling mistakes in Google Maps (despite the WRD attribute information and the Goole Maps label agreeing with each other). See Table 2 (column "Register/Source") for reference details.
From the remaining subset of GeoDAR v1.0 produced by geocoding, we followed the same stratified sampling scheme and selected 220 to 250 dam points for each of China, India, and Japan. Another 450 dam points were sampled from the other regions of the world (Table 4). Compared to geo-matching which was based on attribute association with georeferenced 540 regional registers, the geocoding process was more complicated and relied largely on the geographic information repository in Google Maps and its embedded geocoding algorithms. To increase our confidence in the geocoding results, we therefore purposefully enlarged the sample size for each validation region. As described in Section 2.2, three additional georeferenced datasets from authoritative registries in China, Indian, and Japan were used exclusively for the purpose of geocoding validation (refer to Table 2 for register details). For the remaining regions of the world, the validation was based on a 545 meticulous manual comparison between the WRD information of each sampled dam point and the associated Google Maps label, including the dam/reservoir name, administrative divisions, the nearest town/city, and the impounded river name if possible. When necessary, we also referred to other auxiliary information including open-source gazetteers and other literature. In total, we collected 1,152 dam points for validating the accuracy of geocoding, including all 232 Japanese dams in GeoDAR v1.0. The distribution of all sampled validation dams is shown in Fig. 8. 550 As reported in Table 4, our geo-matching accuracy ranges from 88% to 100% among different regions, with an overall accuracy of 97%. Causes of the identified geo-matching errors (see the last column in Table 4) were not necessarily mistakes in our attribute association between WRD and the georeferenced registers, but sometimes inaccurate spatial coordinates provided by the georeferenced registers themselves. An example is Skutvik Dam (completion year 1991) in Norway (Fig. 8), where coordinates are documented to be 68.025° N and 15.345° E in MARS. However, inspected from high-resolution 555 Google Maps imagery, no dam or reservoir could be conclusively verified at or near this coordinate point, except for three surrounding lakes that are all over 2 km away and labelled with other names (Vanbassenget, Lanstøvatnet, and Stenslandsvatnet). The documented coordinates for this dam are probably inaccurate.
The accuracies of our geocoded samples ranges from 90% for Japan to 98-99% for India and China, with an overall accuracy of 95%. As shown in Table 4, most of the errors were related to the misplacement of the dam/reservoir to another 560 feature, typically a free-flowing river reach, which shares the name and administrative divisions with the dam/reservoir. One example is Nambiar Dam near the city of Tirunelveli in the state of Tamil Nadu, southern India (Fig. 8). The correct coordinates, according to NRLD, are 8.374° N and 77.738° E where the Google Maps labelled "Nambi Dam" instead of Nambiar Dam. Probably because of this spelling inconsistency, our geocoded coordinates were misplaced on a reach of the Nambi(y)ar River (8.435° N,77.569° E, labelled as "Nambiyar") about 20 km upstream from the dam. Although our 565 recursive geocoding procedure (Section 2.3) embedded an automated filter that examines the type of the feature at each returned point (scripts accessible through Code availability), this filter was designed to only eliminate the coordinates where feature types are clearly disparate from a dam or reservoir (such as commercial and residential buildings). Our experiments showed that dams/reservoirs and free-flowing river reaches could both be categorized as "establishment" of "natural feature" and a feature type that is more specific to dams/reservoirs was hardly seen. Thus, to avoid over-filtering, we allowed a 570 certain ambiguity in the geocoded feature types, and then relied on manual QC to correct or remove mistaken coordinates as thoroughly as possible. The misplacement of dams to their upstream/downstream river reaches is a major cause of the relatively low geocoding accuracy in Japan. Through experimentations, we noticed that Google Maps labelling for some of the Japanese dams that are homonymous to their impounded rivers, is either lacking or highly adapted to the Japanese language. The latter further challenged our geocoding accuracy using English-based ICOLD information. For one of the 575 errors in Japan, we verified from the JDF register that Google Maps mislabelled Myojin Dam in Horoshima Prefecture (34.587° N, 132.505° E) as "Nabara Dam" whose correct location is 3 km downstream (34.563° N, 132.517° E; Fig. 8). As a result, our georeferenced coordinates for Nabara Dam were wrong although our geocoding process was correct. However, given what we have observed, such endogenous labelling errors in Google Maps are probably rare.
Integrating the validations for both geo-matching and geocoding, our overall georeferencing accuracy is 95.4% in terms of 580 dam count or 99.1% in terms of total storage capacity based on the sampled 1,412 dams. While these statistics can be considered as an accuracy measure of our data product, the identified errors in the validation sample have been corrected wherever possible in our released GeoDAR v1.0 and v1.1. To reflect the accuracy of GRanD harmonization, we also randomly sampled another ~100 dams in v1.0 that were associated with GRanD in v1.1, and identified no association errors among them. 585  Table 4 for detailed validation statistics.

Comparisons with existing global datasets
To better understand the improvements and potential applications of GeoDAR, we compare it with three major global dam 590 and reservoir datasets: the complete ICOLD WRD, GRanD (v1.3), and GOODD (V1). To recap the pros and cons of each dataset, ICOLD WRD documents over 56,000 unique dam records with a broad suite of attributes, but the provided dam records are not georeferenced. GOODD depicts the spatial details of more than 38,000 dam points and their catchments but does not include any other attribute. GRanD is georeferenced and provides multiple essential attributes, but the records are limited to 7320 large dams. Accordingly, our comparison first emphasized the aspects of dam quantity, reservoir area, and if 595 applicable, the spatial pattern and distribution of the dams. These aspects are directly acquirable from the spatial features (i.e., dam points and reservoir polygons) in GeoDAR. Considering that each GeoDAR feature is explicitly linked to a WRD or GRanD record which contains detailed attributes, our comparison also includes two important attributes, i.e., reservoir storage capacity and catchment area, to help inform the extended capability of GeoDAR once the attributes are acquired.

Comparison with ICOLD WRD 600
Despite our efforts to integrate multi-source registers and the Google Maps geocoding API, georeferencing ICOLD WRD, particularly smaller dams in poorly documented regions, has proven to be challenging. This challenge was reflected by the proportion of WRD that was spatially resolved in GeoDAR. As compared in Table 5, GeoDAR v1.0 included 40% of the 56,850 records in the entire WRD. Although limited in number, these georeferenced records compromised a balance between geocoding thoroughness and quality (see Sections 2.2 and 2.3), and account for 84% of the total reservoir storage 605 capacity in WRD. The larger proportion in terms of storage capacity indicates that most of the sizable dams in WRD have been spatially resolved. This message is also corroborated by Fig. 9. Nearly 70% of the 12,425 WRD dams larger than 10 mcm, for example, have been georeferenced in GeoDAR v1.0 (Fig. 9a). While 80% of the 21,845 WRD dams smaller than 1 mcm were not georeferenced, these smaller dams account for less than 1% of the total WRD storage capacity (Fig. 9b). After harmonization with GRanD, the proportion of WRD georeferenced in GeoDAR v1.1 increased to 43% by count or 92% by 610 storage capacity (Table 5), and these percentages represent our best result for georeferencing WRD. By absorbing the remaining dams in GRanD as well, v1.1 has a total dam count equivalent to 44% of WRD and a cumulative storage capacity less than 5% below that of the full WRD (Table 5; Fig. 9b). Compared to v1.0, the margin between the distribution curves of GeoDAR v1.1 and WRD, particularly for relatively large dams, was further reduced (Fig. 9a). As a result, the number of dams larger than 10 mcm in GeoDAR v1.1 exceeds 80% of that in WRD, and the number of dams larger than 1 mcm reaches 615 60% of that in WRD. and "Catchment area" for GeoDAR v1.1 and "Entire WRD". When a dam has both a reservoir polygon and an area attribute, 620 the polygon area took precedence for computing "Reservoir area". Reservoir area statistics for GeoDAR v1.1 only include the dams whose reservoir polygons were successfully retrieved. Statistics for GRanD are based on the entire records in v1.3. We reported in Section 3.1 that the georeferenced dams in GeoDAR v1.0 are distributed in 148 out of the 164 countries registered in ICOLD WRD, and the spatial coverage was further improved to 154 countries in v1.1. Since GeoDAR v1.1 represents an improved version of our spatial dam inventory, we compare it with WRD in terms of dam count and reservoir storage capacity for each of the registered countries worldwide (Fig. 10). Among the 164 WRD countries, the median 630 proportion of the dam count covered by GeoDAR is 62%, with the first and third quartiles being 35% and 89%, respectively.
As shown in Fig. 10a, better coverages tend to occur in North America, Europe, Russia, Oceania, and part of South America and Africa, whereas poorer coverages are seen in East Asia, South Asia, and part of the Middle East. The coverages in China and India, for example, are only about 22-26% due to a large quantity of WRD records for these two countries (23,749 in China and 5074 in India) but relatively limited information on Google Maps. Despite lower percentages, the dam counts for 635 China and India in GeoDAR are nearly six and four times of those in GRanD, respectively (see Section 5.2 for details), suggesting that our improvements on the spatial details of dams for major emerging nations are substantial. Compared with dam counts, GeoDAR's coverage for reservoir storage capacity is higher overall (Fig. 10b). Among the 157 countries with documented reservoir storage capacities, the median coverage in GeoDAR reaches 97%, with the first and third quartiles being 86% and nearly 100%, respectively. 640 To assess the coverage in GeoDAR for leading dam contributors, we further highlight the top five countries by either dam count or total reservoir storage capacity. According to WRD, the top five countries by dam count are China (23,749), US (8886), India (5074), Japan (3089), and Brazil (1347). GeoDAR v1.1 covers the dam counts of these countries by 22%, 90%, 26%, 20%, and 59%, respectively. The top five countries by total reservoir storage capacity are Russia (917.8 Gigatons (Gt)), Canada (892.3 Gt), US (867.1 Gt), China (814.1 Gt), and Brazil (673.5 Gt). The coverage ranges from 88% for China, 645 92% for Brazil, 98% for Russia, to about 100% for the US and Canada. Catchment areas of the reservoirs often indicate the stream order of the impounded river, and thus the scales of flow and sediment alterations by the dam. Locating dams with an improved representation of catchment areas, particularly smaller ones, has been increasingly needed by hydrologic modelling and watershed managements (Grill et al., 2019;Lin et al., 2019). To evaluate how GeoDAR spatially resolved WRD in this aspect, we directly used the values of the attribute "catchment area" provided in WRD. As many records in WRD are missing catchment areas, we combined the available 655 values in both WRD and GRanD, and when a dam has catchment areas in both datasets, we preferred the value in GRanD. Table 5, the subset of WRD georeferenced in GeoDAR v1.1 has a total catchment area of 141 million km 2 , which covers 94% of the total catchment area in WRD. The remaining 6% gap was largely closed by the inclusion of the remaining non-WRD dams from GRanD. It is worth mentioning that these statistics do not take into account the dams without valid catchment areas. While it is possible to retrieve catchment boundaries for GeoDAR dams (e.g., using high-660 resolution DEM as per Mulligan et al. (2020)), acquiring accurate catchment areas of the other WRD dams (which have not been georeferenced) is prohibited due to unknown pour point locations. Therefore, our comparison was only based on the attribute values that are already available. This explains why GeoDAR georeferenced less than half of the WRD records by count but included more than 90% of the total catchment area. Similar to the pattern of reservoir storage capacity, higher proportions of the WRD catchment area covered by GeoDAR are skewed towards the dams with larger catchment areas (Fig.  665   11a). For example, the number of dams with a catchment area larger than 10 km 2 in GeoDAR equals 89% of that in WRD, and the coverage increases to 95% for the dams with a catchment area larger than 100 km 2 . Although GeoDAR does not include reservoir catchment boundaries, it does provide reservoir polygons for 86% of the georeferenced dam points. As reported in Section 3.2, the remaining 14% of the dam points without reservoir polygons, if inferred from their available attribute values, yield a reservoir area that is only 4% of the total reservoir area of all GeoDAR 680 dams. For this reason, we focus on the retrieved reservoir polygons for comparing how GeoDAR v1.1 represents the reservoir areas in the entire ICOLD WRD. Among the 21,576 polygons, 20,778 (96%) are associated with the georeferenced WRD dams. These retrieved WRD reservoirs have a total area of 474 thousand km 2 , accounting for 92% of the cumulative reservoir area in WRD (Table 5). After supplementation of the other 798 polygons from GRanD, the total reservoir area reached 494 thousand km 2 , equivalent to 96% of the cumulative reservoir area in WRD. Like other attributes, the values of 685 reservoir area are not always available in all WRD records, so our reported coverage percentages are theoretically overestimated. However, if a WRD record is missing its area attribute value but has a retrieved reservoir polygon, we used the area of the reservoir polygon as the de facto reservoir area in calculating WRD statistics, and the other WRD records still missing reservoir areas probably contribute a miniscule fraction of the aggregated area. Therefore, we consider our comparison to be overall reasonable. Keeping this limitation in mind, we showed in the distribution curves (Fig. 11b) that the 690 number of GeoDAR reservoir polygons accounts for 68% of all WRD records that have reservoir area values (either documented or de facto), and consistent with the distributions of other attributes, higher coverages for reservoir area tend to occur for larger reservoirs. For example, GeoDAR retrieved 8111 reservoirs larger than 1 km 2 , which account for 80% of those in WRD. The coverage increases to 92% for reservoirs larger than 10 km 2 although the reservoir polygon number decreases to 2543. 695

Improved spatial density over GRanD
While GRanD emphasized dams larger than 100 mcm (or 0.1 km 3 ), GeoDAR aimed to georeference WRD records which, by definitions, have a minimum storage capacity of 3 mcm or smaller if the dam is higher than 15 m (see Section 1). This reduced storage threshold entailed a substantial increase of the dam quantity in GeoDAR. As compared in Table 5, GeoDAR v1.0, which was generated independently from GRanD, is already more than triple the dam quantity in GRanD (7320) and 700 accounts for 94% of the total reservoir storage capacity in GRanD (6881 Gt). With the harmonization with GRanD, the number of dams in GeoDAR v1.1 reaches 341% of that in GRanD, with a total reservoir storage capacity also exceeding 6% of that in GRanD. This comparison suggests that the improvement of GeoDAR is mainly manifested as the increased dam quantity, rather than reservoir storage capacity. The increased dam quantity in GeoDAR is manifested as a ubiquitous improvement of the spatial density of smaller dams worldwide (Fig. 12). Since GeoDAR v1.1 has absorbed GRanD v1.3, the global patterns for capacious reservoirs are overall 710 similar between the two datasets. What is noticeably different are the proliferated density of thousands of smaller reservoirs, particularly those beyond the main focus of GRanD (such as smaller than 100 mcm). The substantial increase of smaller dams and reservoirs is corroborated by the distribution curves in Fig. 9a, where the mode storage capacity (i.e., the capacity corresponding to the peak frequency) shifted from about 100 mcm in GRanD to about 3-5 mcm in GeoDAR (both v1.0 and v1.1). The area between the distribution curves is largely explained by the addition of ~16,600 dams smaller than 100 mcm 715 in GeoDAR v1.1 (Fig. 9a), which correspond to a total storage increase of 126 Gt or 96% of the total storage of the dams smaller than 100 mcm in GRanD (Fig. 9b). It is important to note that the added reservoirs in GeoDAR still comply with ICOLD's definition of "large dams". Although their aggregated storage is limited, these relatively small reservoirs are geographically widespread, meaning that they are locally significant for filling service gaps between more sporadic larger dams. Examples include hundreds of smaller dams/reservoirs that provide irrigation from southern Europe (Fig. 13b) to 720 north-western and central India (Fig. 13c), hydropower and water usage in central and southern China (Fig. 13a), and flood controls across the Mississippi River Basin and southern Texas in the US (Fig. 13d). The sheer number of these added smaller dams and reservoirs accentuate the benefits of an improved knowledge of their spatial locations, such as what GeoDAR offers, for strategizing water and energy managements and assessing fragmentation of the river ecosystems (Belletti et al., 2020;Grill et al., 2019). 725 America. Graduated symbols for GeoDAR (blue bubbles) are superimposed by symbols for GRanD (red bubbles). 730 To assist regional applications, we further aggregated the improvements of GeoDAR over GRanD into national scales. As shown in Fig. 14, GeoDAR's improvements in either dam count or reservoir storage capacity pervade more than 120 countries which occupy about 86% of the continental landmass (excluding Greenland and Antarctica). The increase of dam count occurs in 126 out of the 154 GeoDAR countries (Fig. 14a). These countries include 17 countries without GRanD records at all (such as Haiti, United Arab Emirates, Yemen, and Bhutan), and the other 109 countries comprise 80% of the 735 137 countries with GRanD records. There are slightly fewer countries with a confirmed increase of reservoir storage capacity ( Fig. 14b) because some of the added WRD records are missing storage capacity values. The number of these countries is 116, including 14 without GRanD records at all. While GeoDAR's improvements are widespread, the improvement levels are not geographically uniform (Fig. 14). Globally speaking, the spatial patterns of number and capacity increases are overall consistent, with the major hotpots concurring 740 with large or industrialized nations (e.g., US, China, Brazil, India, and European countries) and less impressive increases in smaller, drier, and/or less developed nations (e.g., part of Africa and South America). This is reasonable as bigger and/or more developed nations usually possess a larger quantity of dam infrastructures and thus a greater potential for GeoDAR to improve. However, this pattern also reflects the disparities due to several factors, such as a possible bias in WRD (as it is a volunteered dataset and not all member nations contributed equally), the accessibility of regional registers for geo-matching, By further aggregating national statistics to each continent (Fig. 14a), the result echoes that GeoDAR's major improvement 765 lies on the quantity or spatial density of the dams, rather than their total reservoir storage capacity. However, this should not overshadow the fact that improvements of both dam count and storage capacity do exist in all continents. As summarized in Fig. 14a, the continental improvement ascends from 173 more dams with a 6 km 3 total capacity in Oceania, to a scale of 6000-7000 more dams with a 100-200 km 3 capacity in North America or Asia. Unfortunately, because the total storage capacity is disproportionally dominated by the largest reservoirs and GRanD has already included most of them, the added 770 storage capacity by GeoDAR relative to what has existed in GRanD appears limited and descends from 8-9% in North America, Asia, and South America, 7% in Oceania, to only 1-2% in Europe and Africa. By contrast, GeoDAR's dam quantity ranges from being almost double that of GRanD in Oceania and Africa, to being triple to quadruple in the other continents.
A derivative benefit of the increased dam quantity is a more complete representation of the regulated watersheds, which is 775 critical to improving discharge estimates. As revealed by the distribution curves in Fig. 11a, GeoDAR improved GRanD in the inclusion of reservoir catchment areas from two aspects. First, the exceedance of the number of reservoir catchments is almost unanimous on all area levels. This corresponds to a total increase of the regulated catchment area by 31,711 km 2 or 27% (Table 5). Second, the increase of reservoir catchments is skewed towards smaller catchments, signifying a more realistic inventory of human water regulations in the basins of lower stream orders or closer to stream headwaters. As shown 780 in the distribution curves (Fig. 11a), the average increasing rate is augmented from about 30% for catchments larger than 1000 km 2 , 80% for catchments between 10 and 1000 km 2 , to more than 600% for those smaller than 10 km 2 . The mode of catchment areas decreases from about 200-400 km 2 in GRanD to 30-100 km 2 in GeoDAR, with the latter much closer to the mode of the entire WRD (15-50 km 2 ). As a result, the number of dams with a catchment size smaller than 25 km 2 , for example, which is the channelization threshold for the high-resolution MERIT Basins hydrography dataset (Lin et al., 2019;785 Yamazaki et al., 2017)), is 3556 or 27% in GeoDAR in comparison to only 671 or 9% in GRanD. These small-catchment dams, once integrated into river networks, may substantially improve the performance of routing models. Consistent with our comparison with WRD (Section 5.1), these statistics are only based on the records with valid catchment areas.
Considering that missing values more likely occur to dams with smaller catchments, our reported improvement could be theoretically conservative. 790 The increased dam count in GeoDAR also enabled the retrieval of surface extents of another 14,000 or so smaller reservoirs (Fig. 7). These added reservoir polygons have a median size of 0.5 km 2 in comparison to 4.3 km 3 in GRanD. They aggregate to a total area of 19,667 km 2 , a scale comparable to 30 Lake Meads. Although this area increase may appear substantial, it only expanded the global reservoir area in GRanD by a marginal proportion of 4%. Similar to the pattern of storage capacities, reservoir areas follow a quasi-Pareto distribution, meaning that smaller reservoirs tend to dominate the population 795 (or number) whereas larger reservoirs dominate the area and storage. This explains why the increase of relative area is small, but the increase of absolute quantity is double that of the entire reservoir polygons in GRanD. For example, 95% of the total reservoir area in GeoDAR comes from only 12% of the reservoir polygons larger than 10 km 2 , and about 90% of these large reservoirs are already included by GRanD (Fig. 11b). This pattern again suggests that the core value of GeoDAR is not to augment the global scale of reservoir area or storage, but to amplify the local details of smaller dams and reservoirs. Owing 800 to the added details, the mode of reservoir area is on the order of 1-10 km 2 in GRanD but was refined by one order of magnitude to 0.1-1 km 2 in GeoDAR.
If we group the global dams by their documented main purpose, we observe in Fig. 15 that GeoDAR improved GRanD unanimously in both dam count and storage capacity for all main purposes (Fig. 15). For the same reason as explained above (i.e., the added reservoirs are small), the increases of dam count appear more prominent than those of storage capacity, and 805 the increases of storage capacity from GRanD to GeoDAR are overall more evident than those from GeoDAR to ICOLD WRD. The exception is the dams with "others" or "unknown" purposes whose total storage capacity in GeoDAR is lower. This is because when GRanD and WRD records conflict with each other in the GeoDAR harmonization process, the attribute values in GRanD took precedence only if they are available or valid ("others" or "unknown" was considered as invalid reservoir purpose). Assuming that reservoir operations vary by purpose, this unanimous improvement of the spatial 810 inventory for all reservoir purposes, in conjunction with satellite-observed water budget variations, can help us better generalize reservoir operation rules which are critical to improving water managements. For a dam with multiple purposes, its "main purpose" was considered as the one with the highest order of priority. The main purpose in GRanD took precedence if it differs from that in WRD.

Spatially complementary to GOODD
The recently published GOODD (V1) dataset (Mulligan et al., 2020) includes 38,667 dam points in the world, which were 820 digitized by scanning through Google Earth imagery with supports of regional inventories and the Shuttle Radar Topography Mission Water Body Dataset (SWBD, 2005). Despite lacking essential attributes, GOODD is thus far the most comprehensive global inventory of dam locations and catchments. The digitization was performed during 2007 to 2011 and was later updated in 2016. This means that reservoirs postdating 2016 were not yet included in the dataset. The completeness and accuracy of GOODD also depend on the sizes of the dams or reservoirs. According to Mulligan et al. (2020), the 825 resolution and quality of available Google Earth imagery during the digitization period were low in some parts of the world (such as China), and an experiment in the US showed that detectable dams and reservoirs from low resolution imagery (e.g., Landsat Geocover 2000) may require the reservoir length greater than 500 m and the dam width greater than 150 m. These minimum size criteria do not necessarily overlap with those of ICOLD WRD which instead emphasize the reservoir storage capacity and dam height (see Section 1). 830 Because of these digitizing limitations and criterion difference, the dam points in GeoDAR are spatially complementary to, rather than always duplicated by, those in GOODD across many regions. Figure  To approximate how GeoDAR and GOODD complement each other globally, we intersected both dam datasets with the 30-840 m-resolution UCLA Circa-2015 Lake Inventory. We noticed that some of the points in GOODD, particularly in regions like China, India, and Brazil, exhibit substantial geographic offsets from the dams or reservoirs observed in the Google Earth imagery. For a pilot experiment, we applied a 1-km tolerance when intersecting the UCLA lake inventory with GOODD, and kept a 500-m tolerance as used in Section 2.5 for intersecting the lake inventory with GeoDAR. The result shows that among the 55,000 or so water bodies that intersect either datasets, 80% intersect with GOODD and the other 20% with GeoDAR 845 alone. These statistics imply that GeoDAR may have an ability to expand the number of dams in GOODD by about 25% (i.e., 20% divided by 80%). Since we applied a larger tolerance for GOODD, this estimated expansion by GeoDAR is likely conservative (considering that the number of GOODD-intersecting reservoirs may be overestimated). If a 500-m tolerance is used for both intersections, the expansion by GeoDAR will increase to roughly 45%. In addition to the expanded spatial coverage, GeoDAR indexed each georeferenced dam point to a WRD and/or GRanD record and thus enabled access to 850 multiple attributes, whereas GOODD carries no attribute information except the delineated reservoir catchments. These regional and global comparisons suggest that, even just with the geometric dam points, GeoDAR is not a simple replication of GOODD, but instead complements GOODD for an improved spatial coverage and density of global dams.  Table 3 as well as in the repository website. The data usage information is described in Section 3.3. Other citation courtesy and disclaimer information are given in the Disclaimer section and the repository website. All released datasets and information are available under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license 865 (https://creativecommons.org/licenses/by/4.0). Users who would like to link GeoDAR records to the proprietary WRD attributes they have purchased in advance from ICOLD should contact the corresponding author.

Summary and applications
We have produced a comprehensive and spatially resolved dam and reservoir dataset, GeoDAR, which complementarily improved the existing global inventories of large dams. We demonstrated that the production of GeoDAR is not a direct 870 compilation or collation of existing dam datasets. Instead, it involved a first-known effort to georeference ICOLD WRD. This was jointly enabled by geo-matching (or table-associating) multi-source regional registers and geocoding descriptive attributes through the Google Maps API. This georeferencing effort resulted in GeoDAR v1.0 which contains 22,743 spatially resolved dam points, each associated with a WRD record, with an overall accuracy of 95%. Each of the georeferenced records was also labelled with a QA score, providing users a reference to the qualities of individual dam 875 locations. Our georeferencing process and accuracy validation, as we have elaborated in substantive detail, have important methodological values for future expansions of spatial dam inventories using similar approaches, such as Geo-Wiki and OpenStreetMap.
To further ensure the optimal inclusion of the world's largest dams, we harmonized the georeferenced WRD (or GeoDAR v1.0) carefully with GRanD v1.3. Using the harmonized dam points as spatial identifiers, most of their reservoir boundaries 880 were then retrieved from high-resolution water body datasets. This ICOLD-GRanD harmonization and the subsequent reservoir retrieval resulted in GeoDAR v1.1, our end product, which holds 24,978 dam points (including 24,168 linked to WRD) and 21,576 reservoir polygons. This product spatially resolved 42% of the entire ICOLD WRD by dam count and more than 90% by reservoir storage capacity. Since most of the world's largest reservoirs (e.g., >0.1 km 3 ) are already included in GRanD, GeoDAR adds limited improvements (by 4-27%) to the total reservoir area, storage capacity, and 885 catchment area. However, by including many smaller dams particularly in lower and middle latitudes, GeoDAR is triple the size of GRanD in terms of dam and reservoir quantity. For this reason, one of the major improvements of GeoDAR is its unparalleled ability to capture relatively small dams, or in other words, to enhance the spatial detail of global dam and reservoir distributions.
Besides an improved quantity and spatial detail, another unique value of GeoDAR is its capability of bridging the locations 890 of dams to a broad suite of attributes that are essential to scientific applications. A standing dilemma of existing global dam datasets is the divergence between the focus on dam quantity or spatial detail and the provision of detailed attributes for a limited dam quantity. This dilemma was partially ameliorated by GeoDAR because its georeferenced dams and reservoirs were explicitly indexed to WRD and/or GRanD records where many attributes are available. Since the original WRD is not georeferenced, our perception was that the task of georeferencing WRD to enable a spatially explicit application of the 895 attribute information, even at regional scales, may fell on individual users. To avoid the duplication of efforts and to facilitate scientific applications, we performed this comprehensive georeferencing on the entirety of ICOLD WRD as thoroughly as possible, and hereby released the resultant dam coordinates and reservoir polygons to the public as part of GeoDAR. We would like to reiterate the disclaimer that GeoDAR does not directly contain, and neither do we intend to release, the original WRD attribute data which are proprietary to ICOLD. In other words, the association between GeoDAR 900 IDs and WRD IDs exist but were purposefully encrypted. However, if individual users need GeoDAR records to be linked to the WRD attributes that they already purchased from ICOLD, we can be contacted and on a case-by-case basis, we may provide this assistance given that the users agree not to release the decryption key or the proprietary WRD attributes.
We envision that GeoDAR, with its enhanced spatial density and extended accessibility to essential attributes, will benefit a wide spectrum of disciplines and applications. It is worth noting that although most dams in GeoDAR are smaller than those 905 in GRanD or AQUASTAT, they are still compliant with ICOLD's size criteria which exclude countless tiny on-farm reservoirs and water storage tanks. Nevertheless, we have suggested from regional examples that GeoDAR partially complements some of the most extensive global dam inventories such as GOODD, despite GOODD owning a larger number of dams. In this sense, even just with the 25,000 or so geometric dam points, GeoDAR contributes yet another fundamental extension to global water infrastructure databases. If these dam points are rectified to high-resolution hydrographic networks 910 (such as MERIT Hydro (Lin et al., 2021;Yamazaki et al., 2019)), GeoDAR, together with other existing dam and barrier datasets, can help refine our understanding of how human water infrastructure fragmented global rivers and their ecosystems (Belletti et al., 2020;Grill et al., 2019;Kornei, 2020), especially with a more exhaustive inclusion of smaller and/or headwater catchments.
Alongside the detailed dam points, GeoDAR's reservoir boundaries provide thus far the most comprehensive global base 915 maps for assessing reservoir dynamics and the impacts of human water regulation. In combination with the expanding constellation of satellite sensors (e.g., ICESat-2, Sentinel-6, and the forthcoming SWOT), this high-resolution base map will, for instance, enable a more complete and accurate monitoring of water storage variation and surface evaporation in global reservoirs (Biancamaria et al., 2016;Chen et al., 2021;Cretaux et al., 2016;Zhao and Gao, 2019a). Tracking the spatiotemporal balance between reservoir water storage and evaporative loss will help strategize regional water 920 managements under a warming climate (Cretaux et al., 2015). Since our knowledge and understanding improves as observations increase, the observed water storage dynamics for an increased quantity of reservoirs will inevitably entail a more realistic generalization of the reservoir operation rules. This is particularly true if the attribute information such as reservoir purpose and storage capacity are also utilized. Considering that small but widespread reservoirs have a strong cumulative impact on discharge (Habets et al., 2018;Lin et al., 2019), the improved operation rules and the fine details of 925 reservoir storage changes will benefit discharge estimations from hydrological models. From another perspective, GeoDAR's reservoir polygons can also help refine surface water typology, either by directly using them to mask artificial impoundments from natural lakes, or by expanding the training pool to enhance machine learning algorithms so that additional reservoirs can be detected (Fang et al., 2019). A refined water typology map will, in turn, assist other analysis tools in improving our assessments of how human footprints alter surface hydrology and its related biodiversity and 930 ecosystem health.

Code availability
Python scripts for geo-matching, geocoding, and reservoir assignment are publicly available at https://github.com/jidawang/georeferencing-ICOLD-dams-and-reservoirs. We request users who adapt or use the scripts to cite Wang et al. (2021). Methodology, Supervision, Writing -review and editing.

Competing interests
The authors declare no conflict of interest.