GeoDAR: Georeferenced global dams and reservoirs dataset for bridging attributes and geolocations

. Dams and reservoirs are among the most widespread human-made infrastructure on Earth. Despite their societal and environmental significance, spatial inventories of dams and reservoirs, even for the large ones, are insufficient. A dilemma of the existing georeferenced dam datasets is the polarized focus on either dam quantity and spatial coverage (e.g., GOODD) or detailed attributes for a limited dam quantity or region (e.g., GRanD and national inventories). One of the most comprehensive datasets, the World Register of Dams (WRD) maintained by the International Commission on Large Dams 20 (ICOLD), documents nearly 60,000 dams with an extensive suite of attributes. Unfortunately, the WRD records provide no geographic coordinates, limiting the benefits of their attributes for spatially explicit applications. To bridge the gap between attribute accessibility and spatial explicitness, we introduce the Georeferenced global Dams And Reservoirs (GeoDAR) tolerance used for both intersections, the expansion by will to roughly 45%. In to the spatial indexed each georeferenced dam point to and/or GRanD record thus access to multiple attributes, whereas GOODD carries no attribute information except the delineated reservoir catchments. These regional and global comparisons suggest that, even just with the geometric dam points, GeoDAR is not a simple replication of GOODD, but instead complements GOODD for an improved spatial coverage and density of global dams.

Despite such environmental and societal significance, our spatial inventory of global dams and reservoirs, even for the large ones (such as those with a surface area >1 km 2 ), has been insufficient. We still lack a thorough and authoritative dataset that 45 documents both geographic coordinates (latitude and longitude) and standard attributes (e.g., purpose, reservoir storage capacity, and hydropower capacity) of the existing large dams. One of the most comprehensive datasets, the World Register of Dams (WRD), is regularly updated by the International Commission on Large Dams (ICOLD; https://www.icoldcigb.org), a non-governmental organization dedicated to the global sharing of professional dam/reservoir information. The recent version of ICOLD WRD documents nearly 60,000 "large" dams, defined as those with a wall higher than 15 m or 50 between 5 to 15 m but with a reservoir storage greater than 3 million m 3 (mcm). These WRD records are considered to be "complete" to the extent of contributions from willing nations and water authorities .
While ICOLD WRD provides more than 40 attributes (e.g., reservoir storage capacity, dam height, and reservoir purpose), the dam locations are, unfortunately, either not georeferenced or publically available. Despite the availability of many essential attributes, missing geographic coordinates has severely limited the applications of WRD, including for hydrological 55 modelling and hydropower planning (Yassin et al., 2019) which require the dam records to be spatially explicit. This dilemma may be partially resolved by using georeferenced regional registers such as the United States National Inventory of Dams (US NID; https://nid.sec.usace.army.mil). Nevertheless, such regional registers are not always publicly available, especially in developing nations where dam construction is still booming (Zarfl et al., 2015).
Other global dam and reservoir datasets that are georeferenced, however, often lack essential attributes. An example is the 60 recently published GlObal geOreferenced Database of Dams (GOODD V1) (Mulligan et al., 2020), which contains 38,667 dam points digitized from Google Earth imagery and their associated catchments delineated from digital elevation models (DEMs). Despite this dam quantity, GOODD provides no other attribute information. Another inventory, the Global River Obstruction Database (GROD) (Whittemore et al., 2020;Yang et al. 2022), located more than 30,500 flow obstructions along rivers wider than 30 m as mapped in the Global River Width from Landsat (GRWL) database (Allen and Pavelsky, For proprietary reasons, neither GeoDAR version releases any WRD attributes. Instead, we offer an option for users if they need to acquire the attributes: upon individual request we may assist the user who has purchased WRD (https://www.icoldcigb.org/GB/world_register/world_register_of_dams.asp) to associate the GeoDAR ID with the ICOLD "international code", through which WRD attributes can be linked to each GeoDAR feature (see Sections 3.3 and 7 for more details). Even without the proprietary WRD attributes, GeoDAR offers one of the most extensive and spatially-resolved global inventory of 100 dams and reservoirs, which may benefit a variety of applications in hydrology, hydropower planning, and ecology.

Definitions and overview
We aim to georeference (i.e., acquire the latitude and longitude of) each dam listed in ICOLD WRD, by using the nominal 105 location (e.g., a descriptive address for a dam or reservoir) available in the WRD attributes. Examples of the attributes that are important for georeferencing include the names of the dam and reservoir, the administrative divisions the dam is affiliated with, and the name of the impounded river. Using such attribute information, spatial coordinates of a dam may be either a) queried from an existing register or inventory where dam records were already georeferenced and verified, or b) estimated through a geocoding service that can convert nominal locations to numeric spatial coordinates. Our preference was 110 the former when possible to optimize the georeferencing accuracy.
The schematic procedure of GeoDAR production is illustrated in Fig. 1. We started by removing duplicate records from the 59,071 dams listed in the original ICOLD WRD (accessed in March 2019). Here "duplicates" are defined as the dams that are either a) repeatedly recorded with identical (or highly similar) attribute information or b) different dam structures but associated with the same reservoir. Examples of the second scenario include a reservoir's primary and secondary/saddle 115 dams such as the Boonton Dam and its associated Parsippany Dike (40.884° N,74.408° W) in New Jersey and multiple controls for one reservoir such as Veersedam and Zandkreekdam for Veerse Meer (51.549° N, 3.678° E) in the Netherlands.
Although "duplicates" in this scenario refer to different dam bodies, including them could lead to double or multiple counting of the storage capacity of the same reservoir, and similar to the production of GRanD, our goal was to link one reservoir to one dam (if possible). After removing the identified duplicates, the cleaned WRD contains 56,815 unique 120 dams/reservoirs. These dams/reservoirs have an accumulative storage capacity of 7,328 km 3 based on the original WRD attribute values (which occasionally are missing or have unit errors) or 7,720 km 3 after replacement/correction by  and GRanD (see Section 2.4). Unless otherwise described, the ICOLD WRD mentioned in the following text refers to the version after duplicate removal. We acknowledge that owing to the challenges of lacking explicit spatial information and occasional attribute errors in WRD, our duplicate removal is not perfect and may have misidentified or 125 missed some duplicate dams.
We then compared the unique ICOLD WRD records against a collection of georeferenced dam registers we acquired from regional water authorities and agencies. When the attribute information of a WRD dam matched that in a regional register, the spatial coordinates from the latter were "borrowed" by the WRD record. We term this process "geo-matching", which resulted in the georeferencing of 13,190 WRD dams. For the remaining dams in WRD, we applied the alternative approach 130 "geocoding", which transforms a nominal location (such as the dam or reservoir address formulated by ICOLD attribute information) to a pair of spatial coordinates. The tool we used to implement geocoding was the Google Maps geocoding API (http://developers.google.com/maps). The geocoding process successfully retrieved the spatial coordinates of another 9,338 WRD dams. The combined output from both geo-matching and geocoding were next collated with the spatial coordinates and reservoir storage capacities of 133 WRD dams larger than 10 km 3 as documented in Wada et al. (2017). These processes 135 resulted in GeoDAR v1.0, a total of 22,560 georeferenced WRD dam points with an accumulative storage capacity of 6,441 km 3 (accounting for more than 80% of that in ICOLD WRD). The Venn diagram in Fig. 2a provides an overview of the logical relations among the georeferencing sources and methods for GeoDAR v1.0. To further improve our spatial inventory of the world's largest dams, we performed a harmonization between the dam points in GeoDAR v1.0 and GRanD v1.3. The harmonization aimed at merging both datasets, removing duplicates (overlap) between them, and when possible, associating new dams supplemented by GRanD with the corresponding WRD records.
This process identified another 2,223 dam points, including 1,414 associated with WRD but not georeferenced in GeoDAR 145 v1.0. With removal of duplicates, this harmonization led to a total number of 24,783 georeferenced dam points, with an accumulative storage capacity of 7,384 km 3 . An overview of this harmonization process is illustrated by the Venn diagram in Fig. 2b. Finally, the reservoir polygons for each of the georeferenced dams were retrieved as thoroughly as possible from three global water body datasets: GRanD v1.3 reservoirs (Lehner et al., 2011), HydroLAKES v1.0 (Messager et al., 2016), and the Landsat-based UCLA Circa 2015 Lake Inventory (Sheng et al., 2016). These nearly 25,000 dam points and their 150 associated reservoir polygons constitute GeoDAR v1.1. Details of production processes and their Quality Assurance and Quality Control (QA/QC) are included in the following method sections. point to the georeferencing sources or methods. Topology of the shapes illustrates logical relations among the data/methods, but sizes of the shape were not drawn to scale of the data volume.

Geo-matching regional registers
The ICOLD WRD was a joint contribution from more than 100 member nations, some of which also release detailed and publicly accessible dam registers that have been georeferenced. These regional/local registers, with reliable spatial 160 coordinates already provided for each dam, were our preferred sources for georeferencing WRD. Since this type of register is not available for most countries, we searched multiple water authority and project websites, and collected seven georeferenced regional registers or inventories that are open access. Their names, sources, and numbers of documented dams are summarized in Table 2. (NRLD) of India, and Japan Dam Foundation (JDF). Regional inventories were collected with partial reference to the Global 170 Dam Watch project (http://globaldamwatch.org). Dam numbers for regional registers are based on the records with valid geographic coordinates, and numbers for ICOLD WRD are based on the records after duplicate removal. See full registers, references, and download links in the reference list.
These seven registers/inventories cover Brazil, Canada, the United States, 31 European countries (including part of Russia), South Africa, and part of Southeast Asia (Cambodia and Myanmar), with a total dam count of more than 126,000. Besides 175 spatial coordinates, each of these registers also provides attributes for their documented dams, which were required by the geo-matching process. While other dam inventories could be available, our geo-matching effort for GeoDAR v1.0 was focused on these collected ones. However, we referred to additional registers from China, India, and Japan (Table 2) for the   validation of our WRD geocoding (see Validation). For these additional regional registers, it was either inconvenient to bulkdownload the dam records, or we were legally restricted from releasing their dam coordinates. Therefore, we only used these 180 registers for the purpose of validation.
The procedure of geo-matching is illustrated in Fig. 3. Given each regional register, our goal was to find its matching records from the subset of ICOLD WRD for the same region, by cross-checking value similarities for several key attributes between the two datasets. On one hand, the compared attributes must be mutually available in both datasets. On the other hand, the attributes should cover various themes so that in combination, they are able to disambiguate records that represent different 185 dams but may coincide in certain attributes. Taking both requirements into account, the key attributes used include the dam and reservoir names, multiple levels of administrative/political divisions for the dam, and the dam's completion year. The river on which the dam was constructed was also considered for all regions except Cambodia as the register does not contain such an attribute. For each of the key attributes, we considered values in WRD and the regional register agreeing with each other if the similarity score between the value sequences exceeded about 85% (meaning that there are more than 8 pairs of 190 identical elements, with consideration of their orders, between two 10-character sequences). This similarity threshold tolerated minor variations in spelling that often occur among different data sources. If an agreement was not reached between the two full sequences (e.g., "Maharashtra Pradesh" and "Maharashtra"), the similarity was then tested between the main subsets of the sequences in order to increase the matching success.
195 Figure 3. Schematic procedure of geo-matching regional registers. Text in roman indicates applied or produced datasets, and text in italics indicates methods or procedures.
One of the geo-matching challenges was that the levels of political/administrative divisions are not always comparable or consistent between WRD and the regional registers. In WRD, the divisions were provided at the levels of country, state/province, and the nearest town/city, which are inconsistent with some of the registers. For example, the register for 200 Brazil (Dams Safety Report in 2017) provides the finest division at the county level, whereas the European inventory (from the MARS (Managing Aquatic ecosystems and water Resources under multiple Stress) project) documents no divisions below the national level. To improve the feasibility in division comparison, we performed a "reverse geocoding" for each georeferenced regional register using the Google Maps geocoding API. Opposite to regular (or "forward") geocoding which converts a nominal location to numeric spatial coordinates, this reverse geocoding converted the spatial coordinates of each 205 dam documented in the register, to a parsed address that contains administrative divisions at consecutive levels. These multilevel divisions and subdivisions were appended to the original regional registers (Fig. 3), thus enabling a more flexible and complete comparison with the WRD attributes and thus an increased success rate of geo-matching.
We considered a WRD record matched with a regional record if their agreements on the key attributes warranted a reasonable confidence that the two are the same dam. In principle, a high confidence would require a unanimous agreement 210 on all key attributes. However, this ideal scenario was often unnecessary and sometimes impossible. One of the reasons is that the key attributes do not always have valid values. In WRD, for instance, the values of "nearest town" for nearly all (>99%) US dams are null. While this attribute is valid for most other dams, the nearest town/city in WRD is not necessarily the division that administrates or contains the dam as is the case in the township in some regional registers. Another reason is that our collected multi-source datasets were not collated by a universal standard. As a result, inherent discrepancies of the 215 attribute definitions and/or values may exist among the datasets. One example is the dam's "completion year", which could be ambiguous between the year when the dam construction was concluded and the year when the dam operation was initiated or commissioned. These two definitions do not necessarily lead to the same year. To address such inconsistencies, we defined a baseline scenario that required any pair of matched WRD and regional records to agree on the following: • Dam or reservoir name, 220 • Country, state/province if values are valid, and • At a minimum, (a) either completion year or river if the town/city values disagree or are invalid, or (b) town/city when completion years and rivers do not both disagree.
In compliance with this baseline, we implemented an automated QA to filter out any matching errors and optimize the matching accuracy for each WRD record. In brief, any match that did not meet the baseline scenario was removed, and the 225 remaining geo-matched pairs were ranked to three discrete QA levels (M1, M2, and M3) according to the quality of attribute agreements (see definitions in Supplementary Table S1). As the QA rank increases (from M3 to M1), agreements on the key attributes improved from the baseline to the ideal scenario (i.e., a unanimous agreement). If a WRD record was matched to multiple records in the regional register, the QA selected the match with the best rank. This way, each georeferenced WRD record was only matched to the best-ranking regional record. Users may refer to the provided QA ranks as a measure of the 230 general reliability of each geo-matched location. It is worth noting that our geo-matching purpose was to acquire the spatial coordinates of any matched WRD record from the regional register, rather than collating or correcting any existing attribute values. In other words, some of the WRD and regional records may actually refer to the same dams but were matched unsuccessfully due to major discrepancies between their attribute values. This led to a conservative success rate in our automated geo-matching. More technical details about QA are given in our Python scripts at https://github.com/surf-235 hydro/georeferencing-ICOLD-dams-and-reservoirs.
Following the automated QA, we performed a manual QC to reassure the accuracy of the geo-matching results. We went through each geo-matched WRD record to examine whether its attributes (e.g., dam/reservoir name, administrative locations, river name, construction year, and storage capacity) indeed agreed with those of the regional source. If an evident discrepancy was identified, the "match" was removed or corrected in the final product. Although we made every endeavour 240 to be as rigorous as possible, remnant matching errors are still possible due to the challenges of incompleteness and intrinsic errors in the attribute information (refer to Section 4 for accuracies). For occasional cases that a dam was matched correctly to the register attributes but misplaced due to poorer quality of the spatial coordinates in the register, we tried to adjust or, if possible, correct the register's spatial coordinates using the best possible resources (such as Google Maps and other opensource documents). If we were unable to observe any water infrastructure at the location of a correct match, we took a 245 conservative action and removed the match. We admit that this might mistakenly delete some of the structures (e.g., small run-of-the-river hydropower stations, weirs, and diversions) that are too small to be visible from Google Map imagery. Our manual QC identified ~4% error in the geo-matched WRD records, most of which came from QA rank M3. After removing these errors, the geo-matching process concluded with a total of 13,190 WRD records georeferenced (Fig. 3), including 3,238, 6,987, and 2,965 for QA ranks M1, M2, and M3, respectively (Supplementary Table S1). The success rate, i.e., the 250 number of geo-matched dams as a percentage of the number of WRD records, varies from about 40% in Southeast Asia to about 80% in South Africa and US (Table 2), with an overall success of 71% in all geo-matched regions (Fig. 3).

Geocoding via Google Maps
The subset of ICOLD WRD that was not geo-matched includes the remaining 5,481 (29%) dams in the geo-matched regions and the entire 38,144 dams in the other regions of the world (Fig. 2a). For these dams, we applied the Google Maps 255 geocoding API, a sophisticated cloud-based geocoding service, to retrieve the spatial coordinates of each dam as thoroughly and accurately as possible. To do so, we designed a recursive geocoding procedure that implemented three primary steps on each dam: forward geocoding, reverse geocoding, and QA filtering. The purpose of each of the steps and their logical relations are illustrated in Fig. 4. The forward geocoding (see Section 2.1 for definition) used the text address of each dam as the input, which we formatted by concatenating the WRD attribute values, to output the latitude and longitude of the dam. The WRS attributes used for address formatting include dam name, reservoir name, statement/province, and country. "Nearest town" was excluded 265 because it is not always the township administrating the dam or reservoir. Together with the spatial coordinates, the forward geocoding also output a Google Maps address associated with the coordinates, which was parsed to individual components including feature name, street name, and political divisions. These output address components, in return, provided valuable information for QA: if the geocoded coordinates are correct, the associated output address components should agree well with those of the WRD input. However, we noticed that address components from forwarding geocoding are often limited in 270 terms of division levels. To complement this limitation, we utilized reverse geocoding (see Section 2.2 for definition) to convert the coordinates from forward geocoding to an updated address with more complete division levels. The address components from both forward and reverse geocoding were combined and hereafter referred to as the "output address". Similar to geo-matching, we employed a QA filter to approach the optimal geocoding result. This process first arranged the attributes of each WRD record to several address formats as they could result in different geocoding outputs. The address 275 arrangements are listed in Supplementary Table S2, and their preference order is rationalized in Supplementary Text. Each of these WRD addresses was used iteratively for both forward and reverse geocoding (as described above). Their geocoded spatial coordinates were then ranked to five discrete QA levels based on how well the input and output addresses agree with each other (C1 to C5, Supplementary Table S3). The iteration was terminated if the highest QA rank was achieved; otherwise, the coordinates that render the best possible QA rank was used as the geocoding result. 280 As explained in Supplementary Table S3, the compared address components include the name of the feature and its affiliated political divisions from town/city to country levels. Consistent with geo-matching, we considered that a component was agreed on if the similarity of its values from both input and output addresses exceeds about 85%. Since the nearest town in WRD was not used for forward geocoding, we treated it as an "independent reference" for validating the township component in the output address. Although the town or city near the dam (from WRD) does not always coincide with that 285 administrating the dam (from the geocoding output), their occasional agreement would strengthen our confidence of the geocoded coordinates if other components were also well matched between the WRD input and the geocoding output. For this reason, we opted to include the township comparison as a supplementary criterion in the geocoding QA process. The highest QA rank (C1) corresponds to a unanimous agreement on all address components. However, the minimum rank (C5) only required the agreement on the feature name, which is a more flexible baseline in comparison with that for geo-290 matching. This was because some of the large reservoirs, particularly those on/near political boundaries, have shared or ambiguous divisions, and the ambiguity might be further amplified by the output coordinates which could fall in anywhere from the dam to across the reservoir water surface. In addition, some of the outputs, regardless of agreement on the address components, are not dams or reservoirs. We therefore included another baseline filter which aimed to remove any error that is not water infrastructure by analysing the feature type information in the geocoding output (see scripts in Code 295 Availability). Although the QA process was designed to be automated, we still manually enforced hundreds of the initial outputs, many of which had returned feature names in native languages, to pass the baseline filters. As a result, our QA yielded more than 16,000 geocoded WRD records, each with the optimal spatial coordinates and the corresponding QA rank.
To complement the QA process, we then conducted a rigorous QC to correct and/or remove the remaining geocoding errors.
We considered a geocoding error as a location where (a) no dam or reservoir could be visibly verified from Google Earth or 300 Esri images, or (b) the WRD attribute information is inconsistent with the feature or division labels on Google Maps. In such cases, we usually first manually re-geocode this dam (by directly using the Google Maps interface) before deleting this error if it was not correctable. While the geo-matched coordinates from regional registers are usually on or close to the dam bodies, the geocoded coordinates could be located on the reservoir. Note the latter case was not considered as an error. Due to China's GPS shift problem, geocoded points across mainland China often show a systematic offset of roughly 500 m from 305 their actual dam or reservoir features. For such Chinese dams, we tried to reduce their geocoding offsets by manually relocating the coordinate points to their correct dams or reservoirs. Our QC process ended up removing about 42% of the originally geocoded dams, most of which stemmed from relatively lower QA ranks (see statistics in Supplementary Table   S3). The complete geocoding procedure resulted in 9,338 georeferenced and quality controlled WRD records, with an overall success rate of 21%. 310

Supplementation with other global inventories
The outputs from both geo-matching and geocoding, a total of 22,528 georeferenced ICOLD WRD records (Fig. 2a), was further supplemented or harmonized by two global dam/reservoir inventories to improve our inclusion of the world's largest dams. We considered this process necessary for two reasons. First, our georeferencing process, particularly geocoding via the address and label information in Google Maps is either lacking or difficult to pass the automated QA due to language ambiguity or naming discrepancies. Second, through cross-referencing we noted that the attribute values of reservoir storage capacity provided in ICOLD WRD are occasionally erroneous (also noted by Mulligan et al. (2020)), e.g., by a factor of 1000 probably caused by unit confusion in WRD compilation. As part of the supplementation/harmonization process, we therefore collated the ICOLD reservoir storage capacities with those in the two global inventories below and corrected any 320 evident errors in ICOLD. Among them, 139 dams were provided with spatial coordinates. We verified each of the dam locations and made minor adjustments to further assure the quality. The attributes of these 139 dams were then manually compared with those in 325 ICOLD WRD. We found that 133 of them were documented in WRD but 32 were georeferenced unsuccessfully in our geomatching or geocoding procedure. Therefore, we borrowed the spatial coordinates of these 32 large dams in Wada et al. (2017) to supplement what we had georeferenced. The coordinates of the other 101 large dams, which we georeferenced successfully (41 from geo-matching and 60 from geocoding), were also overwritten by those in Wada et al. (2017) to doubleassure and improve their spatial accuracies. This supplementation is illustrated by the Venn diagram in Fig. 2a. 330

Supplementation with
We then compared the storage capacities of each of the 133 dams in Wada et al. (2017) with those in WRD and identified 22 of them exhibiting substantial discrepancies between the two datasets. We then collated their storage capacities with other documents (e.g., regional inventories, GRanD, and Wikpedia) and concluded that Wada et al. (2017)

Harmonization with GRanD: forming GeoDAR v1.1
While GeoDAR v1.0 largely exceeds GRanD in dam count, a visual comparison of their spatial distributions revealed that the latter is often complementary to (instead of completely duplicated by) the former in many regions of the world. This motivated us to perform a systematic harmonization between the two datasets. The merged version, which we entitled 345 GeoDAR v1.1, combines the merits of GRanD in accurately documenting the world's largest dams and GeoDAR v1.0 in providing extensive spatial details of smaller but more widespread dams.
We assumed that GRanD, by having collated multiple data sources, is superior to GeoDAR v1.0 in the accuracies of both spatial locations and attribute values (particularly reservoir storage capacity) of the world's largest dams. While this may be true for most cases, we identified at least 88 dams in GRanD with possible location errors. With the help of several 350 references such as regional registers (Table 2) duplicates with other dams and 9 we were unable to correct the locations for, were excluded from the harmonization. What was also excluded are another 5 dams in GranD that were subsumed or replaced by newer dams. For user convenience, we 355 released these ~90 GranD dams together with the identified issues and suggested coordinates (if possible) in Supplementary   Table S5 (full spreadsheet accessible Supplementary Table S5). 365 First, when a dam in GeoDAR v1.0 also exists in GRanD, the spatial coordinates of the former were replaced by those of the latter. We implemented a two-step procedure to identify the overlapping dams between GeoDAR v1.0 and GRanD.
Step 1 was based on attribute association while Step 2 utilized spatial query. Specifically, Step 1 detected matching records between ICOLD WRD and GRanD by assessing agreements on several attributes, including dam/reservoir names, administrative divisions, impounded rivers, and completion years. This step was essentially the same as "geo-matching" that was used to 370 link WRD records to regional registers for GeoDAR v1.0 (Section 2.2). The association results, after a meticulous manual QC, identified ~4,670 dams in GRanD that were georeferenced in GeoDAR v1.0. For the remaining GRanD dams, Step 2 utilized their reservoir polygons to spatially intersect with the dam points in GeoDAR v1.0. A distance tolerance of ~5 km was applied to assist the spatial association and account for possible offsets in GeoDAR v1.0. As part of the QC, the attribute values of each pair (one from GRanD and the other from WRD) were manually compared to determine whether they are 375 indeed the same dam. This step identified another 400 or so overlapping dams between the two datasets. In total, we found that GeoDAR v1.0 overlaps 5,080 out of the 7,303 dams in GRanD, and their spatial coordinates were updated to be consistent with those in GRanD.
Second, for the remaining 2,223 dams in GRanD that do not overlap GeoDAR v1.0, we assumed that at least part of them could be matched to the WRD records not georeferenced in GeoDAR v1.0. Therefore, we performed another round of 380 attribute association between the remaining subsets of GRanD and WRD. After QC, this process identified another 1,414 WRD dams that are included by GRanD. These additional WRD dams, with a total storage capacity of 603 km 3 , were then added to our inventory using the spatial coordinates from GRanD. As a result of the first two objectives, GeoDAR v1.1 georeferenced 23,974 (42%) out of the 56,815 dams in ICOLD WRD, including6,494 that overlap with GRanD.
Third, to reduce the impact of possible attribute errors in ICOLD WRD, we next merged the values of reservoir storage 385 capacity from both WRD and GRanD to a single updated attribute, where the original values in WRD or Wada et al. (2017) were overwritten by those of the overlapping dams in GRanD (if the GRanD values are valid). This correction led to a minor increase of 86 km 3 (1.2%) in the total reservoir storage capacity. Eventually, the remaining 809 dams in GRanD, which were not found in WRD, were appended to our georeferenced WRD so that the final inventory absorbed the entire dataset of GRanD. It is worth noting that similar to geo-matching (Section 2.2), our attribute association here could be conservative, 390 meaning that some of the dams appended from GRanD might be documented in the remaining WRD (the subset not georeferenced successfully). The complete harmonization process, combining the above three steps, led to a total of 24,783 georeferenced dams in GeoDAR v1.1 (Fig. 2b).

Retrieving reservoir boundaries
Reservoir polygons of the georeferenced dam points were retrieved as thoroughly as possible from three global water body 395 datasets: GRanD reservoirs (Lehner et al., 2011), HydroLAKES v1.0 (Messager et al., 2016), and UCLA Circa 2015 Lake Inventory (Sheng et al., 2016). These three water body datasets exhibit an increasing spatial resolution: from 7000+ polygons in GRanD reservoirs provided exclusively for GRanD's dam points, to millions of water body polygons, including both natural lakes and reservoirs, in the other two datasets. While HydroLAKES documents 1.4 million water bodies larger than 0.1 km 2 (10 ha), the Landsat-based UCLA Circa 2015 Lake Inventory further reduced the minimum size to only 0.004 km 2 (0.4 ha), resulting in another 7.7 million water bodies on the global continental surface. Accordingly, we implemented a hierarchical procedure, where the three water body datasets were applied in ascending order of spatial resolution to retrieve the reservoir boundaries with an overall decreasing size.
Specifically, GRanD v1.3 provides 7,230 reservoir polygons for the 7,303 dam points used for harmonization. These GRanD polygons were first assigned to their associated dam points in GeoDAR v1.1 through GRanD IDs. Reservoirs of the 405 remaining 17,480 dam points in GeoDAR v1.1, which were georeferenced from ICOLD alone, were next retrieved from HydroLAKES when possible. To avoid duplicates in the reservoirs retrieved from different data sources, we only used the subset of HydroLAKES that is spatially independent from (i.e., not intersecting with) GRanD reservoirs. Different from reservoir assignment using GRanD, there was no common attribute ID to pair HydroLAKES polygons with the remaining dam points, so their reservoir retrieval relied completely on spatial association. One major challenge in dam-reservoir spatial 410 association was the ambiguity caused by the offsets between our georeferenced dam points and their actual reservoir polygons (see Section 2.3).
To tackle this ambiguity, we designed a procedure that consists of three rounds of iteration to progressively optimize reservoir-dam association. This procedure was based on two assumptions, both conditional on a reasonable spatial tolerance.
We started with 500 m to be roughly consistent with the georeferencing offset observed in China. The first assumption was 415 that larger reservoirs are more likely to be documented than smaller ones, in both ICOLD WRD and Google Maps.
Therefore, the first round of iteration assigned each of the dams to the largest water body within the tolerance. This assignment might, however, lead to a situation where multiple dams were assigned to the same reservoir. To untangle this situation, the remaining iterations assumed Tobler's First Law of Geography (Tobler, 1970): "everything is related to everything else, but near things are more related than distant things" (p.236). Accordingly, for any water body mistakenly 420 associated with multiple dams, the second round of iteration reassigned the water body to its closest dam, and the other dam(s) within the tolerance, as a result, was/were left unpaired. To reduce the number of such "orphan" dams, a final, third round of iteration assigned the remaining unpaired dams to the next closest water body that was within the spatial tolerance and had not been previously associated with any dams. If this led to multiple dams associated with one reservoir again, only the dam with the closest proximity to the reservoir was kept. Through experimentation, we opted to implement this three-425 iteration procedure twice, first using a conservative 500-m tolerance to maximize the accuracy for most associations, and then a 1-km tolerance to further minimize the number of orphan dams.
This multi-iteration procedure retrieved roughly 7,600 reservoir polygons from HydroLAKES. For the remaining dam points left unpaired, we applied the same association procedure to continue retrieving their reservoirs from the high-resolution UCLA Circa 2015 Lake Inventory. Similarly, only the subset that does not intersect with the retrieved HydroLAKES 430 polygons was considered, in order to avoid duplicates in the retrieved reservoirs from different datasets. The use of UCLA Circa 2015 Lake Inventory retrieved another 6700 or so reservoirs.
We followed the automated reservoir retrieval by a manual QC to visually confirm that each retrieved reservoir polygon was matched to the correct dam point, and if not, we tried to adjust the association as thoroughly as possible. This visual QC was particularly necessary for lake-dense regions, including the case of cascade reservoirs immediately downstream/upstream to 435 each other. While some of the dams, such as barrages, diversion infrastructure, and dams under construction, do not have visible impoundments (Lehner et al., 2011), we tried to be as meticulous as possible to verify and recover any missing reservoirs. For instance, we were unable to manually retrieve 10 reservoirs (including 4 completed after 2000) from the UCLA Circa 2015 Lake Inventory for the ~70 dams in GRanD v1.3 without polygons. We also replaced hundreds of reservoirs initially retrived from GRanD and HydroLAKES by the polygons in the UCLA inventory to improve the 440 boundary accuracy and completeness.

Product components and usage
We here provide a detailed documentation of the components and structure of the GeoDAR versions (v1.0 and v1.1). To facilitate the description, the two GeoDAR versions and their component statistics are explained in Table 1, and spatial distributions of the dam points and reservoir polygons are visualized in Figs. 6 and 7. 445

GeoDAR v1.0: dams
GeoDAR v1.0 is a collection of 22,560 dam points georeferenced exclusively for ICOLD WRD (Fig. 6a). Among them, 13,149 or 58% were retrieved from geo-matching regional dam registers, 9,278 or 41% from Google Maps geocoding API, and the remaining 133 largest dams from the spatial inventory in Wada et al. (2017) (Fig. 6b). WRD storage capacities of most of these 133 large reservoirs were replaced by the values in Wada et al. (2017) (see Section 2.4.1), and unless stated 450 otherwise, our following statistics on storage capacities were calculated after this replacement.
The total reservoir storage capacity of these dams is 6,441 km 3 , meaning that GeoDAR v1.0 georeferenced 40% of the 56,815 WRD records but included more than 80% of their cumulative reservoir storage capacity. The total storage capacity of the 133 largest dams from Wada et al. (2017), despite being limited in number, reaches 3900 km 3 or 61% of the cumulative storage capacity in GeoDAR v1.0, and the other ~40% capacity was split almost equally between the remaining 455 22,000+ geo-matched and geocoded dams. Although the registers used for geo-matching are regional, the dams in GeoDAR v1.0, as shown in Fig. 6b, are distributed in 151 out of the 165 countries or territories in WRD, largely owing to our geocoding efforts through Google Maps API. Since GeoDAR v1.0 was produced independently from other global dam datasets such as GRanD, it can also be used to cross-compare, supplement, and potentially improve other dam datasets.
Validation of our georeferencing accuracy for v1.0 is provided in Section 4. 460

GeoDAR v1.1: dams and reservoirs
GeoDAR v1.1 consists of a) 24,783 dam points (Fig. 6a) representing a full harmonization between GeoDAR v1.0 and GRanD v1.3, and b) 21,515 reservoir polygons (Fig. 7). In these nearly 25,000 dam points, 17,480 or 71% come from GeoDAR v1.0 alone, 6,494 or 26% shared by ICOLD WRD and GRanD, and the other 809 or 3% from GRanD alone (Table   1; Fig. 6c). Among the 6,494 shared dams, 5,080 were georeferenced in both GeoDAR v1.0 and GRanD, and the remaining 465 As described in Section 2.4.2, we substituted the reservoir storage capacities in GRanD for the original capacity values of their overlapping WRD dams. As a result, the total reservoir storage capacity in GeoDAR v1.1 reaches 7,384 km 3 , which compares to ~95% of the cumulative capacity in the entire ICOLD WRD (see Section 5.1 for more comparisons with 475 ICOLD). As reported in Table 1, 81% (6,006 km 3 ) of the total storage capacity in GeoDAR v1.1 is explained by the 5,080 relatively large dams georeferenced in both GeoDAR v1.0 and GRanD. The 17,480 smaller dams from GeoDAR v1.0 alone contribute only 7% (507 km 3 ) of the total storage capacity, which is roughly comparable to the subset from GRanD alone (268 km 3 ) or the subset from GRanD and other ICOLD WRD (603 km 3 ). These capacity contributions suggest that compared to GRanD, the major improvement of GeoDAR lies on the increased number of relatively small dams, rather than the 480 increase in total storage capacity of the dams (see Section 5.2 for more comparisons with GRanD). impoundment extents associated with 21,515 or 87% of the georeferenced dam points (Fig. 7). Reservoir polygons for the remaining 13% of the dam points were retrieved unsuccessfully due to a combination of factors, including limited spatial resolutions of the applied water masks, offsets in our georeferenced dam points, and the fact that some of the dams have no evident water impoundments. Nevertheless, the retrieved reservoir polygons have a cumulative area of 496,314 km 2 , accounting for 98% of the total reservoir area of all georeferenced dams in GeoDAR v1.1 (reservoir areas without polygons 490 are based on documented attributes). These retrieved reservoirs correspond to a cumulative storage capacity of 7,216 km 3 , also accounting for nearly 98% of the total storage capacity in v1.1. These statistics indicate that the reservoirs whose boundaries were retrieved unsuccessfully were mostly small in area and storage.
The numbers of reservoir polygons retrieved from each of the three water body datasets are fairly comparable (about 7,100-7,200 each), but the total reservoir storage capacity and area decrease with the increasing spatial resolution of the water body 495 datasets ( Table 1). As a result, the mean reservoir polygon size decreased from 63 km 2 for those retrieved from GRanD, to 2 km 2 from HydroLAKES and 5 km 2 from the UCLA Circa 2015 Lake Inventory. This result is overall consistent with the design of our hierarchical procedure (Section 2.5), where smaller reservoirs were successively retrieved with the help of finer water masks. It is important to note that the retrieved polygons do not always represent the maximum water extents of the reservoirs because water boundaries in the retrieval sources were not necessarily mapped in the maximum inundation 500 periods. For example, the UCLA Circa 2015 Lake Inventory contains more than 9 million water bodies larger than 0.4 ha, which were mapped from Landsat images acquired during the "steady" climate periods (Lyons and Sheng, 2018) and thus represent the average seasonal extent of each water body (Sheng et al., 2016). Despite not always being the largest water extents, our retrieved reservoir polygons enhanced the spatial details of global reservoir locations, using which users can further expand or refine the water boundaries to their specific needs. 505

Attributes and usage 510
The GeoDAR dataset, including dam points for v1.0 and both dam points and reservoir polygons for v1.1, is provided as three separate shapefiles. For user convenience, we also duplicated the two dam point shapefiles in the comma-separated values (csv) format. The file names and attributes are explained in Table 3. Although most of our dam points were georeferenced using WRD records, our published GeoDAR complies with the proprietary rights of ICOLD and does not directly release any attribute from WRD. The attributes we provide in GeoDAR, as listed in Table 3, are only limited to our 515 georeferencing methods, QA/QC, validation, and other information (such as spatial coordinates and part of the reservoir storage capacities) that is already open source or has been permitted for use by the original producers.

v1.1 dams (file name: GeoDAR_v11_dams; format: comma-separated values (csv) and point shapefile) id_v11
Dam ID of GeoDAR version 1.1 (type: integer). Note this is not the "International Code" in ICOLD WRD but is associated with "International Code" through encryption. id_v10 v1.0 ID of this dam/reservoir (as in ID_v10) if it is also included in v1.0 (type: integer). id_grd_v13 GRanD ID of this dam if also included in GRanD v1.3 (type: integer).  Table 1 for more details. pnt_src Source(s) of the dam point spatial coordinates. Unique values include: "GeoDAR v1.0", "original GRanD", "adjusted GRanD" (meaning original points in GRanD adjusted to improve the locations), "corrected GRanD" (meaning misplaced dam points in GRanD corrected and relocated; also see Table S5). qc Roles and name initials of co-authors/personnel for data QC, validation, and other manual operations.
Note: Missing or inapplicable values are flagged by "Null" for text-type attributes and "-999" for numeric-type attributes.
Although WRD attributes are not directly available in GeoDAR, we suggest two possible ways for users to acquire at least 520 some of the essential attributes. Upon the user's reasonable request and on a case-by-case basis, we may provide assistance in decrypting the association between GeoDAR IDs (Table 3) and ICOLD's International Codes, and using the International Codes, the user can link each of the dams/reservoirs in GeoDAR to the entire 40 or so proprietary attributes in WRD. This is also based on the premise that the user needs to acquire the WRD attribute data from ICOLD, and that the user agrees not to release the GeoDAR-WRD association or the WRD attributes to the public. Alternatively, since we imposed no usage 525 restrictions on our spatial features (geometric dam points and reservoir polygons), users are free to integrate them with other datasets and tools, such as remote sensing observations and modelling, to acquire the needed attributes, particularly those not yet documented in ICOLD WRD. Acquisition methods have been exemplified for at least the following attributes: reservoir hypsometry and bathymetry (Li et al., 2020;Yigzaw et al., 2018)

Validation
Separate from the QA/QC during data production, we performed a posterior validation to further assess the accuracy of the 535 georeferenced ICOLD WRD records. The validation sample consists of about 1400 dam points (Fig. 8), which were selected worldwide from GeoDAR v1.0 and represent the results of our geo-matching and geocoding prior to GRanD harmonization.
The collection of the validation points followed a stratified sampling method (Table 4). From the subset of GeoDAR v1.0 produced by geo-matching, we randomly selected about 40 dam points for each of the geo-matching regions (Brazil, Canada, Europe, South Africa, and United States), with the exception of Southeast Asia (Cambodia and Laos) where all 17 geo-540 matched WRD dams were included for validation. We allowed the sample to occasionally overlap with GRanD because dams in GeoDAR v1.0 were georeferenced independently from GRanD and those shared with GRanD reflect our georeferencing accuracy for the world's largest dams. However, for each regional sample, we limited the number of GRanDoverlapping dams to no more than 30% of the entire regional sample size if possible (Table 4). This was to comply with the size ratio between GRanD and GeoDAR v1.0 (about 1:3) so that our validation still emphasized smaller, newly 545 georeferenced dams. We also randomly selected 40 out of the 133 large WRD dams supplemented by Wada et al. (2017), considering that they are part of GeoDAR v1.0 and the supplementation was based on attribute association similar to regional geo-matching. In total, 260 dams were selected for validating the geo-matching accuracy. For each dam, we manually checked whether its spatial coordinates in GeoDAR v1.0 are consistent with those documented in the geomatching source (see source references in Table 2). 550 (left) and the number of dams in this sample that overlap with GRanD v1.3 (right), respectively. "Error source" lists error scenarios in decreasing order of frequency. "Mismatch" indicates geo-matching errors due to incorrect association between WRD and the source/reference register. "Register" indicates geo-matching errors due to inaccurate spatial coordinates in the 555 source register (despite correct association). "Misplacement" indicates geocoding errors where the WRD attribute information disagrees with the Google Maps label. "Google Maps" indicates geocoding errors due to endogenous feature labelling mistakes in Google Maps (despite the WRD attribute information and the Goole Maps label agreeing with each other). See Table 2 (column "Register/Source") for reference details.
From the remaining subset of GeoDAR v1.0 produced by geocoding, we followed the same stratified sampling scheme and 560 selected 220 to 250 dam points for each of China, India, and Japan. Another 450 dam points were sampled from the other regions of the world (Table 4). Compared to geo-matching which was based on attribute association with georeferenced regional registers, the geocoding process was more complicated and relied largely on the geographic information repository in Google Maps and its embedded geocoding algorithms. To increase our confidence in the geocoding results, we therefore purposefully enlarged the sample size for each validation region. As described in Section 2.2, three additional georeferenced 565 datasets from authoritative registries in China, Indian, and Japan were used exclusively for the purpose of geocoding validation (refer to Table 2 for register details). For the remaining regions of the world, the validation was based on a meticulous manual comparison between the WRD information of each sampled dam point and the associated Google Maps label, including the dam/reservoir name, administrative divisions, the nearest town/city, and the impounded river name if possible. When necessary, we also referred to other auxiliary information including open-source gazetteers and other literature. In total, we collected 1,152 dam points for validating the accuracy of geocoding, including all 232 Japanese dams in GeoDAR v1.0. The distribution of all sampled validation dams is shown in Fig. 8. Table 4, our geo-matching accuracy ranges from 88% to 100% among different regions, with an overall accuracy of 97%. Causes of the identified geo-matching errors (see the last column in Table 4) were not necessarily mistakes in our attribute association between WRD and the georeferenced registers, but sometimes inaccurate spatial coordinates 575 provided by the georeferenced registers themselves. An example is Skutvik Dam (completion year 1991) in Norway (Fig. 8),

As reported in
where coordinates are documented to be 68.025° N and 15.345° E in MARS. However, inspected from high-resolution Google Maps imagery, no dam or reservoir could be conclusively verified at or near this coordinate point, except for three surrounding lakes that are all over 2 km away and labelled with other names (Vanbassenget, Lanstøvatnet, and Stenslandsvatnet). The documented coordinates for this dam are probably inaccurate. 580 The accuracies of our geocoded samples ranges from 90% for Japan to 98-99% for India and China, with an overall accuracy of 95%. As shown in Table 4, most of the errors were related to the misplacement of the dam/reservoir to another feature, typically a free-flowing river reach, which shares the name and administrative divisions with the dam/reservoir. One example is Nambiar Dam near the city of Tirunelveli in the state of Tamil Nadu, southern India (Fig. 8). The correct coordinates, according to NRLD, are 8.374° N and 77.738° E where the Google Maps labelled "Nambi Dam" instead of 585 Nambiar Dam. Probably because of this spelling inconsistency, our geocoded coordinates were misplaced on a reach of the Nambi(y)ar River (8.435° N,77.569° E, labelled as "Nambiyar") about 20 km upstream from the dam. Although our recursive geocoding procedure (Section 2.3) embedded an automated filter that examines the type of the feature at each returned point, this filter was designed to only eliminate the coordinates where feature types are clearly disparate from a dam or reservoir (such as commercial and residential buildings). Our experiments showed that dams/reservoirs and free-flowing 590 river reaches could both be categorized as "establishment" of "natural feature" and a feature type that is more specific to dams/reservoirs was hardly seen. Thus, to avoid over-filtering, we allowed a certain ambiguity in the geocoded feature types, and then relied on manual QC to correct or remove mistaken coordinates as thoroughly as possible. The misplacement of dams to their upstream/downstream river reaches is a major cause of the relatively low geocoding accuracy in Japan.
Through experimentations, we noticed that Google Maps labelling for some of the Japanese dams that are homonymous to 595 their impounded rivers, is either lacking or highly adapted to the Japanese language. The latter further challenged our geocoding accuracy using English-based ICOLD information. For one of the errors in Japan, we verified from the JDF register that Google Maps mislabelled Myojin Dam in Horoshima Prefecture (34.587° N, 132.505° E) as "Nabara Dam" whose correct location is 3 km downstream (34.563° N, 132.517° E; Fig. 8). As a result, our georeferenced coordinates for Nabara Dam were wrong although our geocoding process was correct. However, given what we have observed, such 600 endogenous labelling errors in Google Maps are probably rare.
Integrating the validations for both geo-matching and geocoding, our overall georeferencing accuracy is 95.3% in terms of dam count or 99.1% in terms of total storage capacity based on the sampled 1,412 dams. While these statistics can be considered as an accuracy measure of our data product, the identified errors in the validation sample have been corrected wherever possible or otherwise removed in our released GeoDAR v1.0 and v1.1 (for simplicity, our reported statistics after 605 QC have considered this additional correction). To reflect the accuracy of GRanD harmonization, we also randomly sampled another ~100 dams in v1.0 that were associated with GRanD in v1.1, and identified no association errors among them.  Table 4 for detailed validation statistics. 610

Comparisons with existing global datasets
To better understand the improvements and potential applications of GeoDAR, we compare it with three major global dam and reservoir datasets: the complete ICOLD WRD, GRanD (v1.3), and GOODD (V1). To recap the pros and cons of each dataset, ICOLD WRD documents over 56,000 unique dam records with a broad suite of attributes, but the provided dam records are not georeferenced. GOODD depicts the spatial details of more than 38,000 dam points and their catchments but 615 does not include any other attribute. GRanD is georeferenced and provides multiple essential attributes, but the records are limited to 7320 large dams. Accordingly, our comparison first emphasized the aspects of dam quantity, reservoir area, and if applicable, the spatial pattern and distribution of the dams. These aspects are openly available from the spatial features (i.e., dam points and reservoir polygons) in GeoDAR. Considering that each GeoDAR feature is also linked to a WRD or GRanD record which contains detailed attributes, our comparison also includes two important attributes, i.e., reservoir storage 620 capacity and catchment area, to help inform the extended capability of GeoDAR once the attributes are acquired.

Comparison with ICOLD WRD
Despite our efforts to integrate multi-source registers and the Google Maps geocoding API, georeferencing ICOLD WRD, particularly smaller dams in poorly documented regions, has proven to be challenging. This challenge was reflected by the proportion of WRD that was spatially resolved in GeoDAR. As compared in Table 5, GeoDAR v1.0 included 40% of the 625 56,815 records in the entire WRD. Although limited in number, these georeferenced records compromised a balance between geocoding thoroughness and quality (see Sections 2.2 and 2.3), and account for 84% of the total reservoir storage capacity in WRD. The larger proportion in terms of storage capacity indicates that most of the sizable dams in WRD have been spatially resolved. This message is also corroborated by Fig. 9. Nearly 70% of the 12,412 WRD dams larger than 10 mcm, for example, have been georeferenced in GeoDAR v1.0 (Fig. 9a). While 80% of the 21,849 WRD dams smaller than 1 630 mcm were not georeferenced, these smaller dams account for less than 1% of the total WRD storage capacity (Fig. 9b). After harmonization with GRanD, the proportion of WRD georeferenced in GeoDAR v1.1 increased to 42% by count or 92% by storage capacity (Table 5), and these percentages represent our best result for georeferencing WRD. By absorbing the remaining dams in GRanD as well, v1.1 has a total dam count equivalent to 44% of WRD and a cumulative storage capacity less than 5% below that of the full WRD (Table 5; Fig. 9b). Compared to v1.0, the margin between the distribution curves of 635 GeoDAR v1.1 and WRD, particularly for relatively large dams, was further reduced (Fig. 9a). As a result, the number of dams larger than 10 mcm in GeoDAR v1.1 exceeds 80% of that in WRD, and the number of dams larger than 1 mcm reaches 60% of that in WRD.  Table S5). When a dam has both a reservoir polygon and an area attribute, the polygon area took precedence for calculating "Reservoir area" statistics. Reservoir area statistics for GeoDAR v1.1 only considered the dams whose reservoir polygons were successfully retrieved. Statistics for GRanD are based on the entire records in v1.3. 645 The spatial coverage of GeoDAR, in comparison with WRD, was summarized for each of the 165 countries with registered 650 WRD records (Fig. 10). Our comparison focused on GeoDAR v1.1 as it represents an improved version of our spatial dam inventory. Among these 165 countries, the median proportion of the dam count covered by GeoDAR v1.1 is 62%, with the first and third quartiles being 35% and 89%, respectively. As shown in Fig. 10a, better coverages tend to occur in North America, Europe, Russia, Australia, and part of South America and Africa, whereas poorer coverages are seen in East Asia, South Asia, and part of the Middle East. The coverages in China and India, for example, are only about 22-26% due to a 655 large quantity of WRD records for these two countries (23,749 in China excluding Taiwan, and 5,074 in India) but relatively limited information on Google Maps. Compared with dam counts, GeoDAR's coverage for reservoir storage capacity is higher overall (Fig. 10b). Among the 157 countries with documented reservoir storage capacities, the median coverage in GeoDAR reaches 98%, with the first and third quartiles being 87% and 100%, respectively. If we exclude the 809 dams supplemented by GRanD alone and only consider the WRD portion of GeoDAR v1.1, the coverage becomes overall lower 660 but by a limited extent. Among these countries, the median proportion of the WRD dams covered by the WRD portion of GeoDAR v1.1 is 59% (with 33% and 83% as the first and third quartiles) in terms of dam count and 95% (82% and over 99% as first and third quartiles) in terms of reservoir storage capacity ( Supplementary Fig. S1), suggesting that a substantial proportion of WRD had been georeferenced in many of the register countries before the additional supplementation from GRanD. More detailed comparisons (among ICOLD, GranD v1.3, and GeoDAR v1.3) for each of the 165 countries are 665 given in Supplementary Table S6. Catchment areas of the reservoirs often indicate the stream order of the impounded river, and thus the scales of flow and sediment alterations by the dam. Locating dams with an improved representation of catchment areas, particularly smaller ones, has been increasingly needed by hydrologic modelling and watershed managements (Grill et al., 2019;Lin et al., 2019). To evaluate how GeoDAR spatially resolved WRD in this aspect, we directly used the values of the attribute "catchment area" provided in WRD. As many records in WRD are missing catchment areas, we combined the available 675 values in both WRD and GRanD, and when a dam has catchment areas in both datasets, we preferred the value in GRanD. Table 5, the subset of WRD georeferenced in GeoDAR v1.1 has a total catchment area of 140 million km 2 , which covers 94% of the total catchment area in WRD. The remaining 6% gap was largely closed by the inclusion of the remaining non-WRD dams from GRanD. It is worth mentioning that these statistics do not take into account the dams without documented catchment areas. While it is possible to retrieve catchment boundaries for GeoDAR dams (e.g., using 680 high-resolution DEM as per Mulligan et al. (2020)), acquiring accurate catchment areas of the other WRD dams (which have not been georeferenced) is prohibited due to unknown pour point locations. Therefore, our comparison was only based on the attribute values that are already available. This explains why GeoDAR georeferenced less than half of the WRD records by count but included more than 90% of the total catchment area. Similar to the pattern of reservoir storage capacity, higher proportions of the WRD catchment area covered by GeoDAR are skewed towards the dams with larger catchment areas (Fig.  685   11a). For example, the number of dams with a catchment area larger than 10 km 2 in GeoDAR equals 88% of that in WRD, and the coverage increases to 95% for the dams with a catchment area larger than 100 km 2 . Although GeoDAR does not include reservoir catchment boundaries, it does provide reservoir polygons for 87% of the 695 georeferenced dam points. As reported in Section 3.2, the remaining 13% of the dam points without reservoir polygons, if inferred from their available attribute values, yield a reservoir area that is only 2% of the total reservoir area of all GeoDAR dams. For this reason, we focus on the retrieved reservoir polygons for comparing how GeoDAR v1.1 represents the reservoir areas in the entire ICOLD WRD. Among the 21,515 polygons, 20,718 (96%) are associated with the georeferenced WRD dams. These retrieved WRD reservoirs have a total area of 477 thousand km 2 , accounting for 92% of the cumulative 700 reservoir area in WRD (Table 5). After supplementation of the other 797 polygons from GRanD, the total reservoir area reached 496 thousand km 2 , equivalent to 96% of the cumulative reservoir area in WRD. Like other attributes, the values of reservoir area are not always available in all WRD records, so our reported coverage percentages are theoretically overestimated. However, if a WRD record is missing its area attribute value but has a retrieved reservoir polygon, we used the area of the reservoir polygon as the de facto reservoir area in calculating WRD statistics, and the other WRD records still 705 missing reservoir areas probably contribute a miniscule fraction of the aggregated area. Therefore, we consider our comparison to be overall reasonable. Keeping this limitation in mind, we showed in the distribution curves (Fig. 11b) that the number of GeoDAR reservoir polygons accounts for 68% of all WRD records that have reservoir area values (either documented or de facto), and consistent with the distributions of other attributes, higher coverages for reservoir area tend to occur for larger reservoirs. For example, GeoDAR retrieved 8,263 reservoirs larger than 1 km 2 , which account for 80% of 710 those in WRD. The coverage increases to 92% for reservoirs larger than 10 km 2 although the reservoir polygon number decreases to 2,570.

Improved spatial density over GRanD
While GRanD emphasized dams larger than 100 mcm (or 0.1 km 3 ), GeoDAR aimed to georeference WRD records which, by definitions, have a minimum storage capacity of 3 mcm or smaller if the dam is higher than 15 m (see Section 1). This 715 reduced storage threshold entailed a substantial increase of the dam quantity in GeoDAR. As compared in Table 5, GeoDAR v1.0, which was generated independently from GRanD, is already more than triple the dam quantity in GRanD (7,320) and accounts for 95% of the total reservoir storage capacity in GRanD (6,881 Gt). With the harmonization with GRanD, the number of dams in GeoDAR v1.1 reaches 339% of that in GRanD, with a total reservoir storage capacity also exceeding 7% of that in GRanD. This comparison suggests that the improvement of GeoDAR is mainly manifested as the increased dam 720 quantity, rather than reservoir storage capacity. The increased dam quantity in GeoDAR is manifested as a ubiquitous improvement of the spatial density of smaller dams worldwide (Fig. 12). Since GeoDAR v1.1 has absorbed GRanD v1.3, the global patterns for capacious reservoirs are overall similar between the two datasets. What is noticeably different are the proliferated density of thousands of smaller reservoirs, particularly those beyond the main focus of GRanD (such as smaller than 100 mcm). The substantial increase of smaller dams and reservoirs is corroborated by the distribution curves in Fig. 9a, where the mode storage capacity (i.e., the capacity 730 corresponding to the peak frequency) shifted from about 100 mcm in GRanD to about 3-5 mcm in GeoDAR (both v1.0 and v1.1). The area between the distribution curves is largely explained by the addition of ~16,500 dams smaller than 100 mcm in GeoDAR v1.1 (Fig. 9a), which correspond to a total storage increase of 124 Gt or 95% of the total storage of the dams smaller than 100 mcm in GRanD (Fig. 9b). It is important to note that the added reservoirs in GeoDAR still comply with ICOLD's definition of "large dams" (see Section 1). Although their aggregated storage is limited, these relatively small 735 reservoirs are geographically widespread, meaning that they are locally significant for filling service gaps between more sporadic larger dams. Examples include hundreds of smaller dams/reservoirs that provide irrigation from southern Europe ( Fig. 13b) to north-western and central India (Fig. 13c), hydropower and water usage in central and southern China (Fig.   13a), and flood controls across the Mississippi River Basin and southern Texas in the US (Fig. 13d). The sheer number of these added smaller dams and reservoirs accentuate the benefits of an improved knowledge of their spatial locations, such as 740 what GeoDAR offers, for strategizing water and energy managements and assessing fragmentation of the river ecosystems (Belletti et al., 2020;Grill et al., 2019). To assist regional applications, we further aggregated the improvements of GeoDAR over GRanD into national scales. As shown in Fig. 14, GeoDAR's improvements in either dam count or reservoir storage capacity pervade more than 120 countries occupying 86% of the continental landmass (excluding Antarctica). The increase of dam count occurs in 127 out of 750 the 155 GeoDAR countries (Fig. 14a). These countries include 18 countries without GRanD records at all (such as Haiti, United Arab Emirates, Yemen, and Bhutan), and the other 109 countries comprise 80% of the 137 countries with GRanD records. There are slightly fewer countries with a confirmed increase of reservoir storage capacity (Fig. 14b)  While GeoDAR's improvements are widespread, the improvement levels are not geographically uniform (Fig. 14). Globally speaking, the spatial patterns of number and capacity increases are overall consistent, with the major hotpots concurring with large or industrialized nations (e.g., US, China, Brazil, India, and European countries) and less impressive increases in smaller, drier, and/or less developed nations (e.g., part of Africa and South America). This is reasonable as bigger and/or more developed nations usually possess a larger quantity of dam infrastructures and thus a greater potential for GeoDAR to 760 improve. However, this pattern also reflects the disparities due to several factors, such as a possible bias in WRD (as it is a volunteered dataset and not all member nations contributed equally), the accessibility of regional registers for geo-matching, and geocoding challenges for different countries. The top five countries in terms of dam count increase are the US (an increase of 6,039 or 314%), China (4,352 or474%), India (963 or 290%), South Africa (667 or 248%), and Brazil (575 or 219%) (Supplementary Table S6). These five countries cover nearly three quarters of the global dam count increase 765 (17,463). Similarly, the top five countries in terms of storage capacity increase are the US (123 km 3 or 16%), Canada (73 Gt or 8%), Brazil (66 km 3 or 12%), China (44 km 3 or 7%), and India (33 km 3 or 12%), which together comprise 68% of the global storage capacity increase (503 km 3 ).  total reservoir storage capacity for each country or territory. Aggregated statistics for dam count and storage capacity were also compared for each continent. For convenience of comparison, both statistics were displayed on Panel a. By further aggregating national statistics to each continent (Fig. 14a), the result echoes that GeoDAR's major improvement lies on the quantity or spatial density of the dams, rather than their total reservoir storage capacity. However, this should not overshadow the fact that improvements of both dam count and storage capacity do exist in all continents. As summarized in 785 Fig. 14a, the continental improvement ascends from 173 more dams with a 6 km 3 total capacity in Oceania, to a scale of 6000-7000 more dams with a 100-200 km 3 capacity in North America or Asia. Unfortunately, because the total storage capacity is disproportionally dominated by the largest reservoirs and GRanD has already included most of them, the added storage capacity by GeoDAR relative to what has existed in GRanD appears limited and descends from 9-11% in Asia and North America, 7-8% in Oceania and South America, to 1-3% in Africa and Europe. By contrast, GeoDAR's dam quantity 790 ranges from being almost double that of GRanD in Oceania and Africa, to being triple to quadruple in the other continents.
A derivative benefit of the increased dam quantity is a more complete representation of the regulated watersheds, which is critical to improving discharge estimates. As revealed by the distribution curves in Fig. 11a, GeoDAR improved GRanD in the inclusion of reservoir catchment areas from two aspects. First, the exceedance of the number of reservoir catchments is almost unanimous on all area levels. This corresponds to a total increase of the regulated catchment area by 31,502 km 2 or 795 27% (Table 5). Second, the increase of reservoir catchments is skewed towards smaller catchments, signifying a more realistic inventory of human water regulations in the basins of lower stream orders or closer to stream headwaters. As shown in the distribution curves (Fig. 11a), the average increasing rate is augmented from about 30% for catchments larger than 1000 km 2 , 80% for catchments between 10 and 1000 km 2 , to more than 600% for those smaller than 10 km 2 . The mode of catchment areas decreases from about 200-400 km 2 in GRanD to 30-100 km 2 in GeoDAR, with the latter much closer to the 800 mode of the entire WRD (15-50 km 2 ). As a result, the number of dams with a catchment size smaller than 25 km 2 , for example, which is the channelization threshold for the high-resolution MERIT Basins hydrography dataset (Lin et al., 2019;Yamazaki et al., 2017)), is 3,570 or 27% in GeoDAR in comparison to 695 or 10% in GRanD. These small-catchment dams, once integrated into river networks, may substantially improve the performance of routing models. Consistent with our comparison with WRD (Section 5.1), these statistics are only based on the records with valid catchment areas. Considering 805 that missing values more likely occur to dams with smaller catchments, our reported improvement could be theoretically conservative.
The increased dam count in GeoDAR also enabled the retrieval of surface extents of another 14,000 or so smaller reservoirs (Fig. 7). These added reservoir polygons have an average size of 1.4 km 2 in comparison to 65 km 3 in GRanD. They aggregate to a total area of 19,880 km 2 , a scale comparable to 30 Lake Meads. Although this area increase may appear 810 substantial, it only expanded the global reservoir area in GRanD by a marginal proportion of 4%. Similar to the pattern of storage capacities, reservoir areas follow a quasi-Pareto distribution, meaning that smaller reservoirs tend to dominate the population (or number) whereas larger reservoirs dominate the area and storage. This explains why the increase of relative area is small, but the increase of absolute quantity is double that of the entire reservoir polygons in GRanD. For example, 95% of the total reservoir area in GeoDAR comes from only 12% of the reservoir polygons larger than 10 km 2 , and about 815 90% of these large reservoirs are already included by GRanD (Fig. 11b). This pattern again suggests that the core value of GeoDAR is not to augment the global scale of reservoir area or storage, but to amplify the local details of smaller dams and reservoirs. Owing to the added details, the mode of reservoir area is on the order of 1-10 km 2 in GRanD but was refined by one order of magnitude to 0.1-1 km 2 in GeoDAR.
If we group the global dams by their documented main purpose, we observe in Fig. 15 that GeoDAR improved GRanD 820 unanimously in both dam count and storage capacity for all main purposes (Fig. 15). For the same reason as explained above (i.e., the added reservoirs are small), the increases of dam count appear more prominent than those of storage capacity, and the increases of storage capacity from GRanD to GeoDAR are overall more evident than those from GeoDAR to ICOLD WRD. The exception is the dams with "others" or "unknown" purposes whose total storage capacity in GeoDAR is lower. This is because when GRanD and WRD records conflict with each other in the GeoDAR harmonization process, the attribute 825 values in GRanD took precedence only if they are available or valid ("others" or "unknown" was considered as invalid reservoir purpose). Assuming that reservoir operations vary by purpose, this unanimous improvement of the spatial inventory for all reservoir purposes, in conjunction with satellite-observed water budget variations, can help us better generalize reservoir operation rules which are critical to improving water managements.

Spatially complementary to GOODD
The recently published GOODD (V1) dataset (Mulligan et al., 2020) includes 38,667 dam points in the world, which were digitized by scanning through Google Earth imagery with supports of regional inventories and the Shuttle Radar Topography and accuracy of GOODD also depend on the sizes of the dams or reservoirs. According to Mulligan et al. (2020), the resolution and quality of available Google Earth imagery during the digitization period were low in some parts of the world (such as China), and an experiment in the US showed that detectable dams and reservoirs from low resolution imagery (e.g., Landsat Geocover 2000) may require the reservoir length greater than 500 m and the dam width greater than 150 m. These 845 minimum size criteria do not necessarily overlap with those of ICOLD WRD which instead emphasize the reservoir storage capacity and dam height (see Section 1).
Because of these digitizing limitations and criterion difference, the dam points in GeoDAR are spatially complementary to, rather than always duplicated by, those in GOODD across many regions. Figure 16 identified four examples in Cerrado Brazil, northern China, southwestern France, and northern Pakistan, where a large proportion of the GeoDAR dams were not 850 digitized by GOODD. Some of the dams that only appear in GeoDAR also comply with the minimum size criteria of GOODD, and examples are those enlarged in the right panels except the Duber Khwar Dam in Pakistan (35.119° N,72.927° E; Fig. 16j) which was completed more recently in 2014. Since the area of the Duber Khwar Reservoir (about 0.05 km 2 ) is smaller than the resolution of HydroLAKES (0.1 km 2 ) and the dam completion year overlaps with the image acquisition period of the UCLA Circa 2015 Lake Inventory (from May 2013 to August 2015 (Sheng et al., 2016)), GeoDAR 855 georeferenced the dam point but did not successfully retrieve the reservoir polygon.
To approximate how GeoDAR and GOODD complement each other globally, we intersected both dam datasets with the 30m-resolution UCLA Circa 2015 Lake Inventory. As a result of manual snapping to the 30-arc second HydroSHEDS streamflow network (Lehner et al., 2008), some of the points in GOODD ended up having substantial geographic offsets from the actual locations. For a pilot experiment, we applied a 1-km tolerance (about 30-arc-second on the equator) when 860 intersecting the UCLA lake inventory with GOODD, and kept a 500-m tolerance as used in Section 2.5 for intersecting the lake inventory with GeoDAR. The result shows that among the 55,000 or so water bodies that intersect either datasets, 80% intersect with GOODD and the other 20% with GeoDAR alone. These statistics imply that GeoDAR may have an ability to expand the number of dams in GOODD by roughly 25% (i.e., 20% divided by 80%). Since we applied a larger tolerance for GOODD, this estimated expansion by GeoDAR is likely conservative (considering that the number of GOODD-intersecting 865 reservoirs may be overestimated). If a 500-m tolerance is used for both intersections, the expansion by GeoDAR will increase to roughly 45%. In addition to the expanded spatial coverage, GeoDAR indexed each georeferenced dam point to a WRD and/or GRanD record and thus enabled access to multiple attributes, whereas GOODD carries no attribute information except the delineated reservoir catchments. These regional and global comparisons suggest that, even just with the geometric dam points, GeoDAR is not a simple replication of GOODD, but instead complements GOODD for an improved spatial 870 coverage and density of global dams.

Data availability
GeoDAR v1.0 (dam points) and v1.1 (both dam points and reservoir polygons) are available for download from the Zenodo repository https://doi.org/10.5281/zenodo.6163413. The dam points are stored in both csv and shapefile formats, and the reservoir polygons are provided in shapefile. Their attributes and values are described in Table 3 as well as in the repository  880 website. The data usage information is described in Section 3.3. Other citation courtesy and disclaimer information are given in the Disclaimer section and the repository website. All released datasets and information are available under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license (https://creativecommons.org/licenses/by/4.0). Users who would like to link GeoDAR records to the proprietary WRD attributes they have purchased in advance from ICOLD should contact the corresponding author. 885

Summary and applications
We have produced a comprehensive and spatially resolved dam and reservoir dataset, GeoDAR, which complementarily improved the existing global inventories of large dams. We demonstrated that the production of GeoDAR is not a direct compilation or collation of existing dam datasets. Instead, it involved a first-known effort to georeference ICOLD WRD. This was jointly enabled by geo-matching (or table-associating) multi-source regional registers and geocoding descriptive 890 attributes through the Google Maps API. This georeferencing effort resulted in GeoDAR v1.0 which contains 22,560 spatially resolved dam points, each associated with a WRD record, with an overall accuracy of 95%. Each of the georeferenced records was also labelled with a QA score, providing users a reference to the qualities of individual dam locations. Our georeferencing process and accuracy validation, as we have elaborated in substantive detail, have important methodological values for future expansions of spatial dam inventories using similar approaches, such as Geo-Wiki and 895 OpenStreetMap.
To further ensure the optimal inclusion of the world's largest dams, we harmonized the georeferenced WRD (or GeoDAR v1.0) carefully with GRanD v1.3. Using the harmonized dam points as spatial identifiers, most of their reservoir boundaries were then retrieved from high-resolution water body datasets. This ICOLD-GRanD harmonization and the subsequent reservoir retrieval resulted in GeoDAR v1.1, our end product, which holds 24,783 dam points (including 23,974 linked to 900 WRD) and 21,515 reservoir polygons. This product spatially resolved 44% of the entire ICOLD WRD by dam count and more than 90% by reservoir storage capacity. Since most of the world's largest reservoirs (e.g., >0.1 km 3 ) are already included in GRanD, GeoDAR adds limited improvements (by 4-27%) to the total reservoir area, storage capacity, and catchment area. However, by including many smaller dams particularly in lower and middle latitudes, GeoDAR is triple the size of GRanD in terms of dam and reservoir quantity. For this reason, one of the major improvements of GeoDAR is its 905 unparalleled ability to capture relatively small dams, or in other words, to enhance the spatial detail of global dam and reservoir distributions.
Besides an improved quantity and spatial detail, another unique value of GeoDAR is its capability of bridging the locations of dams to a broad suite of attributes that are essential to scientific applications. A standing dilemma of existing global dam datasets is the divergence between the focus on dam quantity or spatial detail and the provision of detailed attributes for a 910 limited dam quantity. This dilemma was partially ameliorated by GeoDAR because its georeferenced dams and reservoirs were explicitly indexed to WRD and/or GRanD records where many attributes are available. Since the original WRD is not georeferenced, our perception was that the task of georeferencing WRD to enable a spatially explicit application of the attribute information, even at regional scales, may fell on individual users. To avoid the duplication of efforts and to facilitate scientific applications, we performed this comprehensive georeferencing on the entirety of ICOLD WRD as 915 thoroughly as possible, and hereby released the resultant dam coordinates and reservoir polygons to the public as part of GeoDAR. We would like to reiterate the disclaimer that GeoDAR does not directly contain, and neither do we intend to release, the original WRD attribute data which are proprietary to ICOLD. In other words, the association between GeoDAR IDs and WRD IDs exist but were purposefully encrypted. However, if individual users need GeoDAR records to be linked to the WRD attributes that they already purchased from ICOLD, we can be contacted and on a case-by-case basis, we may 920 provide this assistance given that the users agree not to release the decryption key or the proprietary WRD attributes.
We envision that GeoDAR, with its enhanced spatial density and extended accessibility to essential attributes, will benefit a wide spectrum of disciplines and applications. It is worth noting that although most dams in GeoDAR are smaller than those in GRanD or AQUASTAT, they are still compliant with ICOLD's size criteria which exclude countless tiny on-farm reservoirs and water storage tanks. Nevertheless, we have suggested from regional examples that GeoDAR partially 925 complements some of the most extensive global dam inventories such as GOODD, despite GOODD owning a larger number of dams. In this sense, even just with the 25,000 or so geometric dam points, GeoDAR contributes yet another fundamental extension to global water infrastructure databases. If these dam points are rectified to high-resolution hydrographic networks (such as MERIT Hydro (Lin et al., 2021;Yamazaki et al., 2019)), GeoDAR, together with other existing dam and barrier datasets, can help refine our understanding of how human water infrastructure fragmented global rivers and their ecosystems 930 (Belletti et al., 2020;Grill et al., 2019;Yang et al., 2022), especially with a more exhaustive inclusion of smaller and/or headwater catchments.
Alongside the detailed dam points, GeoDAR's reservoir boundaries provide thus far the most comprehensive global base maps for assessing reservoir dynamics and the impacts of human water regulation. In combination with the expanding constellation of satellite sensors (e.g., ICESat-2, Sentinel-6, and the forthcoming SWOT), this high-resolution base map will, 935 for instance, enable a more complete and accurate monitoring of water storage variation and surface evaporation in global reservoirs (Biancamaria et al., 2016;Chen et al., 2021;Cretaux et al., 2016;Zhao and Gao, 2019a). Tracking the spatiotemporal balance between reservoir water storage and evaporative loss will help strategize regional water managements under a warming climate (Cretaux et al., 2015). Since our knowledge and understanding improves as observations increase, the observed water storage dynamics for an increased quantity of reservoirs will inevitably entail a 940 more realistic generalization of the reservoir operation rules. This is particularly true if the attribute information such as reservoir purpose and storage capacity are also utilized. Considering that small but widespread reservoirs have a strong cumulative impact on discharge (Habets et al., 2018;Lin et al., 2019), the improved operation rules and the fine details of reservoir storage changes will benefit discharge estimations from hydrological models. From another perspective, GeoDAR's reservoir polygons can also help refine surface water typology, either by directly using them to mask artificial 945 impoundments from natural lakes, or by expanding the training pool to enhance machine learning algorithms so that additional reservoirs can be detected (Fang et al., 2019). A refined water typology map will, in turn, assist other analysis tools in improving our assessments of how human footprints alter surface hydrology and its related biodiversity and ecosystem health.

Code availability 950
Python scripts for geo-matching, geocoding, and reservoir assignment are publicly available at https://github.com/surfhydro/georeferencing-ICOLD-dams-and-reservoirs. We request users who adapt or use the scripts to cite Wang et al. (2021).

Competing interests
The authors declare no conflict of interest. 965