Comment on essd-2021-468

The glacial lake dataset presented in 'Landsat and Sentinel-derived glacial lake dataset in the China-Pakistan Economic Corridor from 1990 to 2020' identifies and classifies lakes in the CPEC region of High-mountain Asia from three time steps 1990, 2000 and 2020. Lakes are identified from Landsat and Sentinel-2 optical satellite imagery using a semiautomated approach that utilises the well-established NDWI (Normalised Difference Water Index) mehod. Statistics and analysis of lake abundance and size distribution are presented, along with changes in lakes over the course of the time-series and intercomparison of the Landsatand Sentinel-derived lake outlines. The dataset is then compared to alike glacial lake datasets from the same region, in order to examine and evaluate discrepancies.

compared to alike glacial lake datasets from the same region, in order to examine and evaluate discrepancies. This is a valuable dataset that I foresee will be readily used by the cryosphere and hydrology research communities. In particular, the use of two highly-detalied lake classification systems (based on Gardelle et al., 2011, and a modified version of Yao, 2018) is a unique aspect of the dataset that is insightful alongside the general size and abundance information. This type of classification is seldom seen in glacial lake datasets, and reflects the thoroughness of the dataset.
The manuscript is structured in a clear and concise manner, guiding the reader through the dataset methods and description, results and statistics from the datastes, followed by an evaluation of the dataset scope and certainty.
Several key points need to be addressed, which are detailed below, largely regarding the dataset itself and the definition of a glacial lake. The comparison to alike datasets is flawed given that many of the discrepancies are due to the differing definitions of a glacial lake, rather than the classification method itself. Once these major revisions have been addressed, then the manuscript and associated dataset will be a great addition to ESSD.

Dataset transparency
A large part of the presented dataset is manually derived -metadata generation, georeferencing, and outline modifications. This can make reproducibility challenging. I would like to see a version of this dataset provided in the supplementary material that presents the dataset before manual intervention/inputs. Therefore, readers can see the dataset before and after manual modifications, and tangibly distinguish the automated and user-defined components of the methodology presented.

Definition of a glacial lake
A large focus in the manuscript is glacier-related hazards, specifically GLOFs and draining lakes that are either on or share a boundary with a glacier. However, the dataset includes lakes that are not influenced or effect glacier dynamics, such as lakes that are hydrologically unconnected from a glacier. The abundance of glaciologically-unconnected lakes markedly influences the identified trends in the dataset, such as the visible influence on lake abundance in GLCS1. In addition, the dataset lacks a spatial filter relative to the ice margin. If indeed the aim of the dataset is to inform on glacier-related hazards, this dataset should focus exclusively on active glacial lakes, rather than active and ancient lakes.
Another aspect is the small threshold size used for the glacial lake dataset. Again, if the focus of this paper is glacier-related hazards then small lakes (<0.05 sq km; Shugar et al., 2021) are largely irrelevant to this study as they have limited GLOF impact. These small lakes make up over 80% of the dataset and heavily influence the identified temporal trends.
In order to overcome this, I would suggest shifting the focus of the manuscript away from glacier-related hazards and framing the manuscript under the importance of freshwater transfer and storage in the region. Whilst Section 4.1 adequately outlines the definition of glacial lake, I think a brief definition should be defined early in the manuscript to assist framing the focus of the manuscript. Additionally, I would like to see a passage in the results/discussion that analyses active glacial lakes, under which their relation to glacierhazards and GLOFs can be addressed. The comparison to other glacial lake datasets should be revisited to provide an adequate examination that focuses on discrepancies in the classification methodologies rather than the definition of a glacial lake.

Broader overview of remote sensing classification methods
Optical classification methods are solely focused on in the introduction section of the manuscript (L86-103), which falsely represents them as the sole classification method readily used in remote sensing. I would like to see the overview include other remote sensing classification methods, namely SAR backscatter classification, but also other alternative approaches such as from hydrological sink analysis and from land surface temperature.
I am not sure if there are any studies in this region where alternative classification methods are used to detect water bodies; but if there are any then I think they would be a great addition to the dataset comparison section to serve as an inter-comparison of methodologies beyond alike optical classification approaches.

Specific comments
L41-66: I think this a detailed and concise overview of the importance of glacial lakes and GLOFs in a regional context. However, I think a global perspective is needed to thoroughly illustrate the significance of this study -especially if you are referring to global studies of glacial lakes, such as Shugar et al. (2021). Please include a sentence or two near the beginning about glacial lakes and GLOFs globally (i.e. importance, general trends etc.) L67-85: You largely focus on remote sensing efforts in HMA regional studies, but there are also references to papers from other regions such as Greenland and the Alps. Either open up this section as an overview of remote sensing studies from all regions, or keep it refined to the HMA region. There have been many regional studies that have been published recently (e.g. Alaska, Rick et al., 2022;Greenland, How et al., 2021), not just in HMA, so I would recommend widening this section to outline the methods in a general context, rather than focusing on HMA.
L92: What exactly do you mean by object-oriented classification here? This term is generally used in programming rather than in reference to a classification approach. Please change this, or clarify what is meant here; preferably with a more suitable term.
L117-119: Are these sub-basins divided by catchments and/or watershed? What determines these sub-basins? L132-170: Great outline of data sources.
L178: Why are landslide-dammed lakes irrelevant to glaciation? Can some glacial lakes also be landslide-dammed lakes? L199: Change 'the method automatically generated the histogram...' to 'the method calculated the histogram...' L201: Change 'interactively' to 'manually'. In reference to this comment and the last, I think it needs to be clear in the methodology how this approach is 'semi-automated'.
L224-228: False classifications from cloud and topographic shadows can be eliminated with cloud and terrain masking, which are well-established remote sensing methods in land classification. Why did you choose not to include this in the automated component of your workflow? Table 1: The characteristics of a proglacial lake should specify that these lakes share a boundary with the ice margin, according to your definition -'shared boundary' is a better description than 'connected with glaciers' as this could be intepretted as hydrologically connected instead of physically adjacent. Table 2: There must be occurrences where a lake's formation and/or dam material properties are ambiguous (especially in relation to GLCS2), even from Google Earth imagery. I see in the dataset that there are no instances where a lake's classification is determined as uncertain; even though you state later on that occassional misclassifications are inevitable (L561). In such instances of ambiguous lake types, how do you decide the classification? L272-273: Please provide references to studies that use lake perimeter and displacement error to estimate uncertainty.  L339: 'proglacier' >> 'proglacial' Figure 5: The four maps are somewhat repetitive and it is difficult to see differences between the Sentinel and Landsat lake sizes/abundance from this. I would suggest changing this figure to have an overview map on the left showing all detected lakes from both methods, and a series of inset maps to the right displaying a closer look at certain regions of interest; divided into Sentinel and Landsat lakes. Also, maybe change the outline colour of the lake points to a darker shade, as it is hard to identify the lake points in the current figure. L352: This is a hanging line, and I am not sure which panels and sub-graphs are being referred to here. Does this belong somewhere else or is this a fault with the journal formatting?  Table 5: Can you include some statistics on the link between lake size and % overlap between the Sentinel-2 and Landsat counts? -this would help gauge how much spatial resolution (differentiated from image acquisition) affects lake classification in this study. L441-443: Are these lakes persistently large or just at a particular time step? Do you have evidence as to why they are disproportionally large? L481-485: Can you state here the number of instances where overlapping acquisitions were acqiured? L492: 'approximate' >> 'approximately' L503-504: Are there any studies that present glacial lake datasets derived from Sentinel-2? If so, please reference them here. If not, then change this to state that there are no comparable datasets, rather than a scarce number of datasets.
L515-521: I think, similar to your suggestion regarding landslide-dammed lakes, a likely answer is that Wang et al. focus more on glacier-connected lakes, given that they adopt a 10 km buffer to filter out unconnected lakes. And therefore they identify an increasing trend, possibly reflective of a subset of your lake types. Could you subset your lake dataset to match the lakes identified by Wang et al., and examine whether you also see this increasing trend evident in your subsetted dataset? (And perhaps also include landslide-dammed lake for the purpose of this comparison?) L538-544: Discrepancies in glacial lake datasets can be because of minimum lake size, classification method (i.e. not just optical), image acqusition and post-filtering. However, if the purpose of this dataset is to 'further promote the capacity of GLOF risk assessment and predicting glacier evolutions' then I am unsure why there is no spatial filter (relative to ice margin position) adopted to remove lakes that are unconnected to the glacial system. I think the focus of this study needs to be shifted (as stated earlier in major comments), and further analysis needs to be presented that demonstrates changes in GLOF and glacier-fed lakes specifically (i.e. filtered by lake type and size) -see major comments for full details. L545-564: This is a valuable section to include in the study. The temporal range of these datasets and limited image availability (especially in formative years) will not adequately capture the dynamic nature of draining glacial lakes; and therefore such datasets serve as a gauge of long-term, regional trends rather than individual lake change.
L565-575: It is great to hear that this work will be continued, and new time steps will be included in the dataset in the future.