High-resolution dataset of thermokarst lakes on the Qinghai-Tibetan Plateau

High-resolution dataset of thermokarst lakes on the Qinghai-Tibetan Plateau Xu Chen , Cuicui Mu a, b, c, , Lin Jia , Zhilong Li , Chengyan Fan , Mei Mu , Xiaoqing Peng , Xiaodong Wu b, e a Key Laboratory of Western China's Environmental Systems (Ministry of Education), College of Earth 5 and Environmental Sciences, Lanzhou University, Lanzhou, 730000, China b Cryosphere Research Station on Qinghai-Tibetan Plateau, State Key Laboratory of Cryospheric Science, Northwest Institute of Eco-Environment and Resource, Chinese Academy of Sciences, Lanzhou, 730000, China c Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), 519000, China 10 d University Cooperation of Polar Research, Beijing, 100875, China e University of Chinese Academy of Sciences, Beijing, 100049, China

As the largest middle-low latitude and high-altitude permafrost region, the Qinghai-Tibetan Plateau (QTP) occupies a vast area underlain by permafrost, which is estimated to be as high as 1.06 × 10 6 m² km² (Zou et al., 2017). Owing to its middle-low latitude, the permafrost on the QTP is characterized as being relatively thin, but with thick active layers and high ground temperatures (Ran et al., 2018;Ran et al., 2020). In recent decades, permafrost on the QTP has experienced obvious degradation, as is indicated by 65 the increasing ground temperatures  and further thickening of active layers . Meanwhile, accelerated formation of thermokarst terrains, including permafrost collapse, thaw slump, and thermokarst lakes, has also been observed (Luo et al., 2019;Huang et al., 2020;Mu et al., 2020b). Among these features, thermokarst lakes are of the highest concern. On the QTP, most thermokarst lakes have been reported from the middle plateau (Niu et al., 2011), and in previous reports, 70 much attention has been paid to thermokarst lakes for their ability to cause serious thermal erosion and permafrost thawing, leading to the instability of road embankments (Niu et al., 2011). Recently, it has also been recognized that these lakes can release considerable greenhouse gas into the air (Wu et al., 2014;Mu et al., 2016;Mu et al., 2020a). Climate warming is expected to lead to an increase in the number of thermokarst lakes forming in continuous permafrost areas (Wang and Mi, 1993), yet the identification 75 and investigation of thermokarst lakes has mainly been conducted only at the local scale (Niu et al., 2008).
For example, in the Beiluhe River basin, located in the middle plateau with an area of 2513.6 m² km² , it was found that thermokarst lakes showed an increasing trend from 1969 to 2010 (Luo et al., 2015). Thus far, the distribution and changes of thermokarst lakes at the larger plateau scale remain unknown, and there is an urgent need to establish a high-resolution dataset of thermokarst lakes on the plateau in order 80 to provide better scientific data for Earth System Models.
In this study, we extracted water bodies using the normalized difference water index (NDWI) from a large number of Sentinel-2 data based on the Google Earth Engine (GEE) platform (Mcfeeters and S., 1996;Ouma and Tateishi, 2006;Xu, 2006). Because this method has been associated with the overestimation of water bodies, we also used visual interpretation to calibrate the automatically extracted 85 water vectors and imagery. In addition, field-based unmanned aerial vehicle (UAV) imagery and field data retrieved from the literature were used to verify the thermokarst lake distribution. We further examined the relationships between the distribution of thermokarst lakes and temperature, precipitation, active layer thickness, vegetation coverage, and normalized difference vegetation index (NDVI). The https://doi.org/10.5194/essd-2020-378 main aims of this study were to: 1) establish a high-resolution dataset of thermokarst lake distribution on 90 the QTP, and 2) explore the effects of environmental factors on the distribution of thermokarst lakes.

Study area
The QTP is the largest and highest plateau in the world, with an area underlain by permafrost of approximately 1.06 × 106 m² km² (Zou et al., 2017). From 1980 to 2018, the air temperature, precipitation,

95
and soil water content in the permafrost region showed a significant increasing trend (Yang et al., 2019;Zhao et al., 2019). The largest permafrost thickness was found to be ~128 m, and the storage of underground ice is estimated to be ~1.27 × 10 4 km3 of water equivalent (Cheng et al., 2019). The active layer thickness on the QTP ranges from 100 to 400 cm, and the active layer thickness along the Qinghai-Tibet Highway has increased by 19.5 cm/10a from 1982 to 2018 . In relation to the 100 mean annual ground temperature (MAGT) in the permafrost area of the QTP, it has been found that -3 °C < MAGT < -1.5 °C accounted for 30.4%, -1.5 °C < MAGT < -0.5°C accounted for 22.1%, and -0.5 °C < MAGT < 0.5 °C accounted for 22.6% of the permafrost regions Ran et al., 2020). The total area of lakes (> 1 m² km² ) for the entire QTP is 5 × 10 4 m² km² , accounting for 1.67% of the total land area Zhang et al., 2020). In recent decades, the area and number of 105 lakes on the QTP have increased extensively (Zhang et al., 2017); for example, the number and total area of lakes greater than 1 m² km² expanded from 1081 and 4 × 10 4 m² km² in the 1970s to 1236 and 4.74 × 10 4 m² km² in the 2010s, respectively , owing to increased precipitation, glacier melting, permafrost degradation, and other changes in additional components of terrestrial water .

Data resources
Sentinel-2A, launched by the European Space Agency (ESA) on June 23, 2015, carries the sensor known as Multi-Spectral Instrument (MSI). MSI has 13 spectral bands covering the visible spectrum (VIS), near-infrared (NIR), and short-wave infrared (SWIR) parts. Sentinel-2A has three spatial resolutions of 115 10, 20, and 60 m, and a revisit time of 10 days (Drusch et al., 2012;Li and Roy, 2017   Center for Tropical Agriculture (CIAT) using the interpolation algorithm (Reuter et al., 2007).

140
The MAGT of the QTP was retrieved from Ran et al. (2020), of which the MAGT data were established using remote sensing data and the field-measured MAGT data of 233 boreholes (Ran et al., 2020). This dataset has a spatial resolution of 1 × 1 km. Vegetation type data with a spatial resolution of 1 × 1 km in permafrost areas of the QTP were obtained from Wang et al. (Wang et al., 2016), of which the land cover types were classified into five types: swamp meadow, meadow, steppe, desert steppe, desert, and barren

Research framework
The framework in this study comprises a collation of knowledge and formulation of the thermokarst lake inventory specifications, as well as the data preprocessing completed using GEE, manual vectorization of the thermokarst lakes, visual interpretation, and environmental factor extraction ( Figure. 2).

155
1. Specifications of the thermokarst lake inventory. Literature relevant to the investigation and recording of thermokarst lakes were collected. Various definitions and classifications of thermokarst lakes, as well as the methods adopted previously for lake boundary extraction and assessment of the extent of lake distributions, were summarized.

Thermokarst lakes identification
When ground ice or ice-rich permafrost thaws, thermokarst lakes or ponds gradually form due to the surface water accumulation following ground subsidence. Our field investigation showed that more than 90% of lakes along the Qinghai-Tibet Highway have an area of less than 5,000 m² , with an average area of 5,039 m² , and the largest thermokarst lake had an area of 4.49 × 10 5 m² (Niu et al., 2014). The Sentinel-

175
2A imagery is applicable to identify water bodies over 350 m² (Freitas et al., 2019). Therefore, we assumed that the area of thermokarst lakes in the permafrost regions ranged from 350 to 5.0 × 10 5 m² .
Although there is a possibility that additional water bodies outside of this area were also thermokarst lakes, this assumption does represent the most likely thermokarst lake distribution in permafrost regions.

GEE processing
GEE is a geospatial processing platform which utilizes Google's cloud computing resources and large datasets, making it possible to process, compute, and analyze large and useful data sets from MODIS data and Sentinel satellite data, as well as climatic and hydrological data, and other reanalysis products (Gorelick et al., 2017). Through the GEE platform, to automatically extract the total water body

Normalized Difference Water Index (NDWI)
Based on the GEE platform and Sentinel 2A L1C data, the NDWI was used to extract the water bodies (Mcfeeters and S., 1996).The calculation formula is as follows Eq. (1): where GREEN is the green light band and NIR is in the near-infrared band. NDWI is effective in extracting water information from images by inhibiting vegetation and highlighting water bodies.
Sentinel-2 MSI images include SWIR bands with a resolution of 20 m and green and near-infrared bands with a resolution of 10 m, making it possible to extract water bodies with a spatial resolution of 10 m.

Extraction of water body boundary
The water index can highlight the difference between the water body and other terrestrial features, while the threshold value should be established to extract the water body boundary. It has been suggested that threshold values should be adjusted to achieve the optimal segmentation effect (Lei Ji, 2009;Huang et 200 al., 2018). In order to determine the optimal threshold, other data such as high-resolution remote sensing images and field investigative data from the same area can be combined to reduce the errors of water bodies (Liu et al., 2012). Generally, lakes are always at a relatively stable water level with a flat surface, and the gradients of lake surfaces are relatively slight. Thermokarst lakes in the QTP are mainly distributed in high plains or flat low-lying intermontane basins and valleys, where the slopes are less than 205 3° (Pan et al., 2014;Qin et al., 2016). The threshold values for a large number of lake sample images were studied, and it was found that 0.1, as the threshold during water extraction, could more accurately extract the exact area of the potential lake area (Li and Sheng, 2012). Therefore, the threshold value of water body index extraction in this study was set to 0.1.

Visual interpretation
Visual interpretation is the process that obtains the information of specific objects from remote sensing images through direct observation or auxiliary interpretation instruments. Due to the vast area of the QTP, The complex environmental conditions of the QTP, such as massive clouds, snow cover, and glaciers, 215 make the data process relatively inaccurate. In addition, many lakes and rivers are interconnected, and are thus difficult to separate using automatic methods. Therefore, we used the visual digitization method to create the final thermokarst lake map. Although this is a time-consuming process, especially for such a large area, this method allowed for lake boundary inspection with the highest quality control, and ensured consistency. Therefore, on the basis of online water body extraction, images of the corresponding 220 year and the same period were downloaded, and the visual interpretation method was used to correct the extraction results and eliminate the influence of rivers. The Sentinel-2A images used in the study comprised more than one hundred scenes over three months of visual interpretation, while the large structural lakes, glacial, and river water bodies, which were automatically extracted by the Google Earth Engine platform, were accordingly removed, thereby correcting the locations of the thermokarst lakes.

230
where NIR is the reflectance value of the near-infrared band and RED is the reflectance value of the red band. Based on the GEE platform and Landsat8 L1T data, NDVI data for the QTP in 2018 was extracted.
To obtain the NDVI values in a given catchment, we set a buffer zone around the thermokarst lake, and then extracted the NDVI values in the buffer areas, and further calculated the average value in the buffer zones.

Accuracy verification
The thermokarst lakes along the Qinghai-Tibet Highway were surveyed using an unmanned aerial vehicle kinematic (RTK) positioning sites were also used for correction and accuracy evaluation, and the calculated mean ground sampling distance (GSD, equivalent to the ground resolution in satellite remote sensing) was 2.60 cm. The accuracy assessment showed that some relative errors for small thermokarst lakes were present, while the relative error was close to 0 for large lakes (Table 1).

Distribution of thermokarst lakes
A total of 121151 thermokarst lakes were identified on the QTP, comprising a total thermokarst lake area 255 of 1730.34 m² km² , and accounting for 0.16% of the permafrost area ( Figure. 4). The lakes were mainly distributed in the central and western regions, with an overall density of 12/100 m² km² . Thermokarst lakes less than 1,000 m² accounted for 24% of the total thermokarst lake numbers, while those larger than 150,000 m² accounted for 50% of the total thermokarst lake area ( Figure. 5).

Relationship between thermokarst lakes and environmental factors
Thermokarst lakes on the QTP were closely related to environmental factors ( Figure. 8), as most lakes were distributed in the areas with an MAGT of -2 to 0 ℃ and an active layer thickness of 250-300 mm.
According to the distribution of ground ice content, 78% of thermokarst lakes on the QTP were 290 distributed in areas with ground ice contents higher than 20%. The areas with a mean annual air temperature of -10 to 5 ℃ and mean annual precipitation of 400-600 mm had the highest probability of potentially featuring thermokarst lakes. Soil texture was also associated with thermokarst lakes, as the highest occurrence of lakes appeared in areas of loamy sand. Although alpine wet meadow featured the https://doi.org/10.5194/essd-2020-378 lowest area and number of thermokarst lakes, alpine swamp meadow had the highest probability.

295
Thermokarst lakes were also related to the NDVI values, as most lakes were distributed in the areas with NDVI values less than 0.1 ( Figure. 9).

Comparison and limitations
In general, lakes larger than 1 m² km² on the QTP have been well documented Wan et al., 2016;Zhai et al., 2017;Zhang et al., 2017), while fewer reports exist on smaller water bodies.
However, in permafrost regions, omitting small lakes and ponds leads to large underestimations of water 310 body count and water surface area (Muster et al., 2017). In recent years, the increased availability of high-resolution satellite imagery, such as the Sentinel-2A satellite, made it possible to study remote thermokarst lakes in large areas of the unpopulated zone (Kokelf and Jorgenson, 2013). Theoretically, Sentinel-2 images at a spatial resolution of 10 m can identify ponds smaller than 400 m² . In many cases, ponds smaller than 400 m² can be reliably mapped, and it has been suggested that the Sentinel-2A data

315
can identify the minimum water body of 350 m² (Freitas et al., 2019). In our study, we extracted water bodies larger than 400 m² on the QTP, and the UAV multispectral image and ground survey data showed high accuracy. Notably, the relative error of validation for the thermokarst lakes was related to the https://doi.org/10.5194/essd-2020-378 thermokarst lake area, as the smaller lake areas had larger relative errors. This may be explained by the fact that the small thermokarst lake areas are typically relatively dynamic, showing strong seasonal 320 changes. Overall, our results showed that the dataset of thermokarst lakes was reliable.
Moreover, our results show that the total area of thermokarst lakes on the QTP is 173,0.34 m² km² , accounting for approximately 0.2% of the permafrost area and 4% of the total water area of lakes and ponds. The average density of thermokarst lakes was 12/100 m² km² , which is lower than the circumarctic region (28/100 m² km² ) (Nitze et al., 2018). This can be explained by the fact that the QTP has 325 lower underground ice content than the Arctic (Mackay, 2015). Thermokarst lakes on the QTP less than 1000 m² accounted for 24% of total thermokarst lake numbers, while it has been found that thermokarst lakes less than 1,000 m² in the Siberian also accounted for a large proportion of the total lakes (Grosse et al., 2008). These findings confirm that thermokarst lakes usually have smaller areas. In this study, the area of thermokarst lakes larger than 150,000 m² accounted for 50% of the total thermokarst lake area,

330
suggesting that large thermokarst lakes play an important role in carbon and water cycling (Olefeldt et al., 2016). According to the QTP lake survey data from 2000, the density of lakes over 1 m² km² on the QTP was 1.1/100 m² km² . Compared with the density of thermokarst lakes in this study, it can be seen that thermokarst lake density has increased on the QTP .
According to our results, thermokarst lakes on the QTP are mainly distributed in places where the mean 335 annual air temperature ranges from -10 to -5 °C , and the area of MATGs of -2 to 0 ℃ occupied the most thermokarst lakes. Previous results have shown that the average active layer thickness on the QTP was approximately 2.3 m (ranging from 2.2-2.4 m), 80% of which was concentrated in the depth range of 0.8-3.5 m (Qin et al., 2018). Although these factors are associated with the development of thermokarst areas, it is difficult to draw robust conclusions about the relationships between thermokarst lake areas 340 and factors related to the area of larger lakes.
Lower permafrost elevations occur largely in the eastern part of the plateau with higher precipitation. For the middle and western parts of the plateau, permafrost largely exists in areas with elevations higher than 4,000 m (Zou et al., 2017). Areas higher than 5,000 m largely belong to mountain areas, with rugged topography and steep slopes (Dong et al., 2010), which makes it difficult to form lakes. has higher NDVI values, accounts for only a small proportion of the vegetative land (4.18%) (Wang et al., 2016). As a result, wet meadows feature the lowest number of thermokarst lakes, yet wet meadows also feature the highest percentage of thermokarst lakes. This can be explained by the fact that the wet 350 meadows are mainly distributed in the eastern part of the plateau, which usually has an annual precipitation between and 400-600 mm (Gao et al., 2016). Although thermokarst lake areas have not been directly linked to precipitation in western Siberia (Karlsson et al., 2012), it has been suggested that precipitation could affect the total area of thermokarst lakes in Alaska (Swanson, 2019). Considering this on the QTP, precipitation is the main determinant of lake areas larger than 1 m² km² ,

355
thus it can be seen that thermokarst lakes on the QTP are mostly located in regions with ground ice content higher than 20%. This result is reasonable because the formation of thermokarst lakes is mainly due to ground ice melting (Grosse et al., 2008). In terms of the different soil textures, the highest thermokarst lake probability was found in the loamy sand-type soil. In general on the QTP, the soil textures are largely characterized by coarse particles (Li et al., 2015), and it should be noted that sandy 360 soils have little probability of thermokarst lakes because such soils are easily eroded or exhibit strong infiltration processes (Wakindiki and Ben-Hur, 2002).
Assessing the impact of climate change on the lake area is of great significance for water resource management and ecological protection . In the past 50 years, the average precipitation on the QTP has shown a slight increasing trend, while the average annual temperature has increased 365 significantly . Meanwhile, the NDVI values have increased significantly since the 1980s (Shen et al., 2015). Based on these trends and the formation mechanisms of thermokarst lakes, it could be inferred that thermokarst lakes will likely increase both in numbers and total area in the future, which may greatly affect land surface processes on the QTP.

Conclusions
In the QTP permafrost regions, approximately 121,000 thermokarst lakes larger than 400 m² were identified, comprising a total lake area of 1730.34 km² and accounting for 0.20% of the total permafrost area. Most of these thermokarst lakes are smaller than 50,000 m² and are mainly distributed at altitudes 380 of 4,750-5,250 m, with slopes of less than 5°. These lakes were mainly recorded in areas with an active layer thickness of 250-300 cm, mean annual air temperatures of -10 to 5 ℃, MAGTs of -4 to 0 ℃, and annual precipitation of 400-600 mm. The alpine desert steppe land type was found to feature the largest number of thermokarst lakes, followed by the alpine meadow, while the percentage of thermokarst lakes was highest in the wet meadow area. Owing to the current technical limitations, it was difficult to 385 investigate thermokarst lakes less than 400 m² in area, thus future work is required to create a dataset that includes these smaller water bodies.

Acknowledgements.
This work was supported by the National Natural Science Foundation of China (