A new global dataset of mountain glacier centerlines and lengths

. The length of a glacier is a key determinant of its geometry and is an important parameter in glacier inventories and modeling; glacier centerlines are the lines along which the main ﬂow of glaciers takes place and, thus, are crucial inputs for many glaciological applications. In this study, the centerlines and maximum lengths of global glaciers were extracted using a self-designed automatic extraction algorithm based on the latest global glacier inventory data, digital elevation model (DEM), and European allocation theory. The accuracy of the dataset was evaluated through random visual assessments and comparisons with the Randolph Glacier Inventory (RGI) version 6.0. A total of 8.25 % of the outlines of the RGI were excluded, including 10 764 erroneous glacier polygons, 7174 ice caps, and 419 nominal glaciers. A total of 198 137 glacier centerlines were generated, accounting for 99.74 % of the input glaciers. The accuracy of glacier centerlines was 89.68 %. A comparison between the dataset and the previous dataset suggested that most glacier centerlines were slightly longer than those in RGI v6.0, meaning that the maximum lengths of some glaciers had been likely underestimated

The most noticeable distinction between glaciers and other natural ice bodies is their property of moving towards lower altitudes under the influence of gravity.Glacier flow lines correspond to a glacier's motion trajectories, and the main flow line is the key trajectory.Due to the lack of glacier velocity field data, the main flow lines cannot be obtained on a large scale.The glacier centerline, generated via the axis line method (Le Bris and Paul, 2013;Machguth and Huss, 2014;Kienholz et al., 2014;Zhang et al., 2021), is typically used to represent the main flow line.The glacier centerline is a critical parameter for analyzing the ice velocity field (Heid and Kääb, 2012;Melkonian et al., 2017), estimating glacier volumes (Li et al., 2012;Gao et al., 2018), and developing glacier models (Oerlemans, 1997;Sugiyama et al., 2007;Maussion et al., 2019).
Glacier length usually refers to the maximum length of a glacier centerline (main flow line) and represents the longest motion trajectory of a glacier, which is among the key determinants of glacier geometry and a basic parameter of glacier inventories (RGI Consortium, 2017) and modeling (Maussion et al., 2019).Glacier length fluctuations can be used to quantify glacier changes (Zhou et al., 2021a), such as by identifying glacier advancement, surge, or retreat.Glacier length fluctuations (e.g., Leclercq et al., 2014) have also been used to study the relationships with changes in glacier area (Winsvold et al., 2014) and the geometric structure of a glacier (Herla et al., 2017), estimate glacier volume in combination with the glacier area (Lüthi et al., 2010), and reconstruct annual averaged surface temperatures over the past 400 years on hemispherical and global scales (Leclercq and Oerlemans, 2011).
A complete global inventory of glacier outlines (RGI Consortium, 2017) was created following the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR5).There are three types of automatic and semiautomatic methods that have been proposed to meet the demand for large-scale acquisitions of glacier lengths.First, there are typical hydrological analysis methods (Schiefer et al., 2008), but they result in lengths that are longer than equivalent maximum distances taken along typical longitudinal centerline profiles.The second type is a simplified algorithm based on the skeleton theory (Le Moine and Gsell, 2015), but it has not been widely used.Third, there are centerline methods based on the axis concept proposed by Le Bris and Paul (2013) and first applied to calculating global glacier length by Machguth and Huss (2014).However, with this type of algorithm, the glacier centerlines tend to be noticeably deflected by their tributaries (Le Bris and Paul, 2013).The cost grid-least cost route approach of Kienholz et al. (2014), based on the axis concept, is more accurate but also more labor intensive and time-consuming, which limits its application to global glaciers.The tradeoff function approach of Machguth and Huss (2014), which is based on the axis concept, has been applied to almost all global mountain glaciers but excludes the centerlines of the branches of glaciers.Despite many attempts to overcome these limitations in recent years (Yao et al., 2015;Yang et al., 2016;Ji et al., 2017;Hansen et al., 2020;Xia, 2020;Zhang et al., 2021), to date, global datasets of the centerline and length of mountain glaciers are rare.Based on our recent study on successfully extracting the glacier centerline using the Euclidean allocation method (Zhang et al., 2021), we aim to combine publicly available digital elevation data into one global digital elevation model (DEM), at 30 m resolution and extending from 90 • N to 90 • S, to check and correct the global glacier outlines and obtain a new graphic dataset of the centerline and length of global mountain glaciers.

Study region and data
The glacier dataset used in this study was the Randolph Glacier Inventory version 6.0 (RGI v6.0; http://www.glims.org/RGI/randolph60.html,last access: 15 November 2021) released via the Global Land Ice Measurements from Space initiative (GLIMS), which is a globally complete collection of digital glacier outlines, excluding ice sheets (Pfeffer et al., 2014).RGI v6.0 includes 216 502 global glaciers (215 547 glaciers described in the product handbook), with a total area of 705 738.793 km 2 (RGI Consortium, 2017).All glaciers can be divided into 19 first-order glacier regions (Radiae and Hock, 2010), which were used in our study (Fig. 1).
In total, five DEM products (Table 1) were used in this study.The National Aeronautics and Space Administration (NASA) DEM (NASADEM; https://lpdaac.usgs.gov/news/release-nasadem-data-products/, last access: 17 November 2021) was released by the Land Processes Distributed Active Archive Center (LP DAAC) in January 2020.NASA-DEM is the reprocessed version of the NASA Shuttle Radar Topography Mission (SRTM) data (Farr et al., 2007), with a low mean absolute error (MAE; Carrera-Hernández, 2021) and improved root mean square error (RMSE; Uuemaa et al., 2020).Serving the zonal extent of (56 • S, 61 • N), NASADEM was used as the preferred DEM in this study because of its superior performance.The Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) is a 14-channel imaging instrument that has been operating on the Terra satellite of NASA since 1999.The ASTER Global Digital Elevation Model (GDEM) version 3 (https://lpdaac.usgs.gov/news/nasa-and-meti-release-aster-global-dem-version-3/, last access: 17 November 2021; Abrams et al., 2020) was released by Japan's Ministry of Economy, Trade, and Industry (METI) and NASA in July 2019.Using the Ice, Cloud, and Land Elevation Satellite (ICESat) Geoscience Laser Altimeter System (GLAS) data, Carabajal and Boy (2016) found that ASTER GDEM v3 displayed smaller means, similar medians, and less scatter than the ASTER GDEM v2 in Greenland and Antarctica.ASTER GDEM v3 was used as the second priority DEM to cover the zonal extents of (56, 83 • S) and (61, 83 • N).
NASADEM and ASTER GDEM v3 do not cover all glacierized regions, as they are missing parts of the polar regions and the Kamchatka Peninsula.Because of their high temporal and spatial resolutions at high latitudes, the Reference Elevation Model of Antarctica (REMA; Howat et al., 2019) and ArcticDEM (https://www.pgc.umn.edu/data/arcticdem/, last access: 17 November 2021) were preferred as the supplementary data of our preliminary studies in Earth Syst.Sci.Data, 14, 3889-3913, 2022 https://doi.org/10.5194/essd-14-3889-2022these glacier regions.Nevertheless, ArcticDEM and REMA were found to be inadequate because of insufficient coverage and sporadic data.Some other DEMs in high-latitude areas (Fan et al., 2022;Zhang et al., 2022) were also not considered because their spatial resolutions are very different from that required in this study.Therefore, the widecoverage Copernicus DEM (https://spacedata.copernicus.eu/web/cscda/cop-dem-faq, last access: 17 November 2021) was finally selected as the supplementary dataset for glacier regions not entirely covered by NASADEM and ASTER GDEM v3.The Copernicus DEM was released in November 2020.The accuracy assessment undertaken by its development team (the product handbook) comparing TanDEM-X/WorldDEM data (TanDEM-X is a TerraSAR-X add-on for digital elevation measurements; TerraSAR-X is an X-band satellite imaging radar Earth observation satellite) with ICESat GLAS reference points found an absolute vertical accuracy of approximately 10 m at the periphery of Antarctica and Greenland.In summary, NASADEM, ASTER GDEM v3, and Copernicus DEM were compiled to create a 30 m DEM that covered the study area completely.
In addition, graphical data (Machguth and Huss, 2014) of the glacier length in *.xy format (the Universal Transverse Mercator, UTM, projection), which correspond to the attribute of the glacier maximum length (L max ) in RGI v6.0, were collected in high-mountain Asia (HMA) area.Because these data were obtained from an unofficial source, we could not access their documentation and recovered only the coor-dinates of points matching some of the glaciers in RGI v6.0.The registration of the *.xy file depends on matching its filename with the feature identity document (FID) of the glacier polygon of RGI v6.0 in the same glacier area.The glacier lengths (MHMLDS) of successful registration were used as the graphical validation data for this study.

Outline of workflow
This study relied on the following two key input datasets: the global glacier inventory and the compiled global glacier elevation.An outline of the workflow for establishing a new dataset of global graphic glacier centerlines and lengths is shown in Fig. 2. The process was divided into six parts, i.e., (1) design an algorithm to check all glacier outlines and exclude defective glacier polygons, (2) buffer glaciers to produce a mask containing global glaciers and their buffers, (3) mosaic the compiled global DEMs according to the mask in step 2 to prepare the global glacier elevation data, (4) determine the automatic extraction parameters of global glacier centerlines by repeated testing in each region, (5) input the global DEM, glacier outline dataset, and all parameters into the designed automatic extraction software (Zhang et al., 2021) to generate global centerlines and lengths, and (6) verify and compare with existing centerline results to evaluate the accuracy of the new datasets. https://doi.org/10.5194/essd-14-3889-2022 Earth Syst.Sci.Data, 14, 3889-3913, 2022 Note that the interval in the "Extent" column represents all landmasses within the zonal range, although not all areas within the range may be covered.

Pre-processing of glacier outlines
This study had strict requirements for glacier outlines, and therefore, all glacier complexes were divided into individual glaciers prior to the centerline extraction.However, because of the limited semi-automatic glacier segmentation approach (Kienholz et al., 2013) and the high-priority strategy of completeness of coverage adopted by RGI v6.0 (RGI Consortium, 2017), some glaciers were not supported by our algorithm.These unsupported glaciers included three categories, namely glacier complexes with/without inaccurate segmentation (Fig. 3a-b), erroneous glacier outlines (Fig. 3c) resulting from the vectorization, and flawed glaciers (Fig. 3d-f) generated by the automatic extraction algorithm.For the third category, we designed an identification algorithm to mark and screen them (described in the last paragraph of this section).The flaws in these glacier outlines were mainly caused by topological errors of polylines/polygons, such as unclosed, sawtooth, and overlapping polygons.The first two categories did not affect the algorithm's normal operation; however, the extraction accuracy is not always guaranteed.We could not identify the source of the problem at the time of the study, and a solution is needed to improve the quality of the global glacier inventory.
In this study, we defined the external contour of a glacier (P gec ), namely the polygon corresponding to the longest closed polyline of the glacier, to reduce the storage of DEMs and improve the efficiency of batch processing.The buffer masks of all glaciers (buffer distance of approximately 100 m) were generated by their P gec to meet the requirement for the extent of input DEMs to be slightly larger than the P gec .The buffer masks generated initially were partially broken because there were overlaps or gaps between adjacent polygons of the buffer zone; thus, polygons with a perimeter of less than 12 times the buffer mask distances of each region were removed.
The third category of flawed glaciers (Fig. 3d-f) was identified by obtaining P gec .The most common error type was a glacier outline with two or more closed polylines with the same endpoint.These flawed glacier outlines were identified by assessing whether there were multiple polylines shar- ing endpoints after converting the glacier from a polygon to a polyline.However, these outlines did not include the unclosed types.There were a few glacier outlines that appeared to be closed polylines but had geometric flaws such as noncoinciding head and tail endpoints of the polylines.

Preparation of input datasets
All the data associated with the dataset production were processed in units of first-order glacier regions.The input glacier outlines excluded all the defective glacier outlines.Similarly, the nominal glaciers (represented by an ellipse) and ice caps remarked in RGI v6.0 were also treated, which were distinguished by two attributes, i.e., status (nominal glacier) and form (ice cap).The inspection of glacier outlines show that there are 10 764 defective glacier outlines (FGODS) in RGI v6.0, accounting for approximately 4.97 % of the total (216 502; Table 2).After excluding nominal glaciers (461) and ice caps (7174), 198 646 glaciers remained as input glacier outlines (IGODS), accounting for 91.75 % of the global mountain glaciers.
The P gec of all glaciers in RGI v6.0 comprises the global glacier external contour dataset (GGECDS), which generated the global mountain glacier's buffer mask dataset (GGB-MDS).The collected DEMs were extracted using GGBMDS, and 43 035 DEM tiles were generated.They were then mosaicked according to different first-order glacier regions to generate a global glacier elevation dataset (GGEDS).The details of the two input datasets are presented in Table 2.

Generation of centerline and glacier length
Glacier centerlines and lengths were automatically extracted with the GlacierCenterlines_Py27 (update to version 5.2.1) tool, which is based on the axis concept and Euclidean allocation (Zhang et al., 2021).The principle is briefly explained as follows: the highest and lowest points of the external outline of a glacier were extracted as two endpoints that divide the glacier outline into two parts.In the glacier polygon, points that have the equal shortest distances to the two parts were identified as other vertices.The line formed by two endpoints and these other vertices was regarded as the glacier centerline.The maximum length of glaciers was calculated using an algorithm similar to the critical path.The updated contents focused on formulating the parameterization scheme (Appendix A; Table A1) for extracting global glacier centerlines and repairing some newly discovered bugs, such as a dead cycle in the process of auxiliary line extraction.All glacier outlines included in the IGODS were divided into 10 levels based on the proportion of cumulative area after ranking the area of all input glacier polygons from small to large (Table 3).The Albers projection (see the Supplement for detailed parameter files) with WGS1984 was used as a unified projection coordinate system for each glacier region.The empirical values of the other parameters were determined in repeated attempts, and their values were significantly correlated with the glacier scale.The generated glacier centerlines were merged according to the glacier regions.Then, the graphics and attribute information https://doi.org/10.5194/essd-14-3889-2022 Earth Syst.Sci.Data, 14, 3889-3913, 2022 of glacier length were exported as corresponding independent Esri shapefiles.In addition, other data associated with the dataset production were exported, such as the segmentation results of glacier outlines, the lengths in the accumulation and ablation region of each glacier, the lowest points, the local highest points (P max ), the extracted failed glacier outlines, and logs.

Accuracy assessment
A random assessment was prioritized to assess the accuracy of the extracted centerlines.We randomly selected 100 glaciers in each of the 19 glacier regions, obtaining a total of 1900 glacier centerlines.These glacier centerlines were divided into three first-level categories (Zhang et al., 2021), namely correct (I), inaccurate (II), and incorrect (III).Type II mostly contained glaciers with accurate glacier maximum lengths but missing, redundant, or unreasonable branches of glacier centerlines.When calculating the dataset accuracy, types I and II were regarded as correct, and only type III was considered incorrect.Finally, the proportion of type III glaciers in the sample was counted, and the valuation result (R) was calculated using Eq. ( 1): where N G is the total number of glacier centerlines, and S i and N T i are the verification accuracy and the number of glaciers in the ith glacier region (i = 1, 2, 3, . . ., 18,19), respectively.
This study's maximum glacier lengths (G Lmax ) were compared with the L max (Machguth and Huss, 2014) in RGI v6.0 using linear correlation and ratio analysis.The correlations between G Lmax and L max were established according to different glacier regions and glacier levels, and the length ratio R r (Eq.2) was calculated as follows: R r = G Lmax L max . (2) In addition, considering the differences between the graphics, we also collected the graph data of the glacier length extracted by Machguth and Huss (2014).Considering the limited availability of the data (obtained R13-R15), we only compared two glacier-covered regions in the Himalayas, namely Mount Qomolangma and Kangchenjunga (the world's third-highest mountain) and their surrounding areas.

Centerline and length of glaciers
Taking the IGODS, GGEDS, and other model parameters (Appendix A; Table A1) as input data, 198 137 glacier centerlines were automatically generated using the centerline extraction tool of GlacierCenterlines_Py27 v5.2.1, with an overall success rate of 99.74 %.The number and proportion of flawed glacier outlines, nominal glaciers, ice caps, input glacier outlines, and extraction results for distinct glacier regions are shown in Fig. 4.  Except for Antarctica and the Subantarctic (R19), the success rate of extracting glacier centerlines in other glacier regions was greater than 99 %, which indicates that the automatic extraction algorithm for glacier centerlines is robust.A small number of glacier outlines with falsely closed boundaries and unidentified ice caps were the main reasons for the failure of the automatic extraction of glacier centerlines; however, it is difficult to establish rules for accurately identifying these glacier polygons.In total, 510 unsuccessful glacier outlines were identified, of which Antarctica and the Subantarctic (R19) accounted for 71.57%, the southern Andes (R17) and Greenland periphery (R05) for 5.29 % and 5.1 %, respectively, Arctic Canada, north (R03), and Alaska (R01) for 4.71 % and 2.94 %, respectively, and other glacier regions for less than 2 %.
Overall, the global glacier centerline dataset (GGCLDS) constructed in this study contained 91.52 % of the total glaciers in RGI v6.0.The lengths of each branch of the glacier centerline were derived, and the longest branch lengths of the glacier centerline were defined as the glacier maximum length (G Lmax ), which were used to form the global glacier maximum length dataset (GGMLDS).The average centerline length of all branches of a glacier is called the glacier mean length (G Lmean ).In addition, the median https://doi.org/10.5194/essd-14-3889-2022 Earth Syst.Sci.Data, 14, 3889-3913, 2022 glacier altitude was regarded as the equilibrium line altitude (ELA; Machguth and Huss, 2014).The part with G Lmax that was higher than ELA was regarded as the length of the glacier accumulation zone (G Lacc ), and the part that was lower than ELA was regarded as the length of the glacier ablation zone (G Labl ), which formed the glacier accumulation zonal length dataset (GACLDS) and glacier ablation zone length dataset (GABLDS).The key process data corresponding to GGCLDS were also output to form the glacier outline segmentation results (GOSRDS), lowest points (GLPDS), local highest points (GLHPDS), and unsuccessful glacier outlines (GUGODS).The fields involved in all datasets are listed in Table 4.
The glacier outlines of RGI v6.0 without centerlines obtained in this study were limited by the quality of the glacier polygons, which mainly correspond to the flawed glacier outlines (FGODS) and the identified ice caps in RGI v6.0 (Table 2).Among the FGODS (10 764), the southern Andes (R17) had the most, followed by southwestern Asia (R14), western Canada and USA (R02), and Greenland periphery (R05), with slightly more than 1500, and low latitudes (R16) and Alaska (R01), with slightly more than 700.There were 451 in other glacier regions, including two regions with 0 defective glacier outlines, the Russian Arctic (R09), and New Zealand (R18).Among the ice caps (7174) identified by RGI v6.0, slightly more than 1500 were in R05 and central Asia (R13), between 500 and 1000 in the Arctic Canada, south (R04), Arctic Canada north (R03), and the southern Andes (R17), and fewer than 500 were in other glacier regions.Nominal glaciers (461) existed in three glacial regions, i.e., Caucasus Middle East (R12), northern Asia (R10), and Scandinavia (R08).

Random assessment results
The evaluation results using random samples from the glacier centerline dataset suggested that the average verification accuracy of the glacier centerline dataset was 89.68 %.There were significant differences across the accuracies of the 19 glacier regions around the world (Fig. 5).Among them, R11, R15 and R10, R09, and R19 had the highest (98 %), second highest (95 %), second lowest (78 %), and lowest (50 %) accuracies, respectively.In terms of types, the average proportions of types I and II were 83.53 % and 6.16 %, respectively.The proportions of type I in R07 and R09 were relatively low, at 79 % and 73 %, respectively, and the lowest in R19 was only 50 %.Type II had the highest proportion in R19 at 16 %, followed by R07 (10 %).Moreover, type II accounted for more than 5 % in seven regions, including R11, R13, R17, R18, R16, R01, and R06.
The above results indicate that, in addition to the three glacier regions of R07, R09, and R19, the random samples of the glacier centerline dataset show excellent performance in terms of accuracy, particularly in R02, R12, and R14.The unmarked ice cap and local low-quality DEM were the main reasons for the poor quality of the glacier centerlines in R07 and R09.Owing to glacier complexes and low-altitude differences in low-quality DEMs at the glacier tongues, the glacier centerlines obtained in R19 were of poor quality but were included for completeness.

Comparison with previous results
We compared the glacier lengths (G Lmax ) automatically obtained in this study with those (L max ) obtained by Machguth and Huss (2014;Fig. 6).After eliminating 5408 glaciers with the L max value of −9 (missing value), the length values of the other 192 728 glaciers in the global glacier length dataset were compared directly.The G Lmax and L max were generally comparable (Fig. 6a).The glaciers in grades L 4 -L 10 showed excellent agreement, while those of L 1 -L 3 determined the linear correlation coefficient owing to their large number.There were approximately 35 000 glaciers with a length ratio (R r ) between G Lmax and L max that was greater than 1.55, and these were excluded from the histogram in Fig. 6b because there was a high probability that the length of at least one of the two datasets was wrong.The peak value of the histogram of R r is in the interval 1.05-1.15,and R r in the interval 0.95-1.25 accounts for 64.55 % (Fig. 6b).The glacier length G Lmax determined in this study was generally 10 % longer than L max , which suggests that the glacier centerline lengths were probably underestimated in previous studies.In addition, the length ratio of glacier L 1 was the highest, and the median value was high (Fig. 6c).The R r values of glaciers L 4 -L 10 fluctuated greatly.The R r distributions of glaciers L 2 and L 3 were relatively concentrated.The reason for this is that the length of glacier L 1 was affected by the DEM, while glaciers L 4 -L 10 were mainly impacted by differences in glacier scale and the accuracy of the auxiliary line.
Comparisons between G Lmax and L max for each first-order glacier region and all random samples are shown in Appendix B. The fit between G Lmax and L max was better in seven glacier regions, including R01, R04, R07, and R12-R15, in which the R 2 was larger than 0.95 (Fig. B1).The R r in R17 (R 2 = 0.8174), R05 (R 2 = 0.8136), and R03 (R 2 = 0.6311) were lower, whereas that in R19 (R 2 = 0.5487) was the worst.The R 2 values of the other eight glacier regions were between 0.85 and 0.95.The histograms (Fig. B2) suggest that G Lmax and L max fitted well in R04, R06, R07, R09, and R12-R15 because they had recognizable single peak values.The peak values of R03, R05, R17, and R19 were not prominent, and the proportion of glaciers with R r > 1.55 was extremely high, further increasing the uncertainty in glacier length estimates in these four regions.R01, R07, R08, R11-R15, and R18 performed well in the box plot (Fig. B3), whereas the results for R09 were not good.Moreover, the fit of all random samples was poor (Fig. B1; R 2 = 0.7547), Earth Syst.Sci.Data, 14, 3889-3913, 2022 https://doi.org/10.5194/essd-14-3889-2022  the peak value was more prominent (Fig. B2), and the length ratio distribution of glaciers of different grades was relatively scattered (Fig. B3).In general, the glacier lengths of R07 and R12-R15 were the closest, while there were significant differences in R03, R05, R17, and R19.Furthermore, the graphic results, which were collected for the maximum length of glaciers in parts of HMA (Machguth and Huss, 2014), were used to compare with those in this study.There were two parts of R15 shown, which were Mount Qomolangma and its surrounding area (Fig. 7a) and Kangchenjunga and its surrounding area (Fig. 7b).A visual comparison suggested that the extraction approach used in this study was robust (Fig. 7a) and that its sensitivity to topography was lower than that of Machguth and Huss (2014;Fig. 7b).Large differences in glacier length extraction schemes are present only in a few glaciers or in certain types of glaciers, such as slope glaciers and ice caps.

Uncertainties and possibilities for improvement
Although we compared the two current global length datasets, it is still difficult to accurately characterize the dataset's quality in this study.For glaciers for which centerlines were not provided in this dataset, users need to update the corresponding glacier outlines and could use the https://doi.org/10.5194/essd-14-3889-2022 Earth Syst.Sci.Data, 14, 3889-3913, 2022 automatic extraction tool provided in this study to generate their centerlines, including the defective glacier outlines (FGODS), nominal glaciers, and ice caps of the RGI v6.0.Specifically, the centerlines of the FGODS rely on the glacier outlines that meet the requirements of this study.These glacier outlines include glacier inventory data from other sources or the FGODS that were repaired by some algorithm or manual process.Nominal glaciers are similar to FGODS and also require users to obtain corresponding glacier outlines.Automatic approaches to dividing ice caps from glacial complexes into individual glaciers are currently limited, and data users can only use their own criteria to separate the ice caps and then use our tool to generate the centerlines.In addition, prioritizing the coverage of this dataset, we designed a geometry-based algorithm to repair FGODS and provided users with their centerlines in the form of a supplementary dataset.Corresponding codes and results can be seen in subdatasets CODES and SUP_220707.
The automatic extraction algorithm in this study is more suitable for single-outlet glaciers, particularly valley glaciers; it is not suitable for ice caps, flat-top glaciers, and tidal glaciers, which tend to be widely distributed in the Antarctic, Subantarctic, and northern Canadian Arctic, among other areas.In short, the uncertainties in this dataset probably come from the centerlines of some slope glaciers and the ice caps that are not identified in RGI v6.0 or a few centerlines with unpredictable quality due to the input data, such as the incorrect glacier polygons and erroneous DEMs.In future work, improved glacier inventories and more accurate DEMs will contribute to improving centerline quality.Furthermore, optimizing the automatic glacier segmentation approach, the DEM-based extraction algorithm of glacier feature lines, and the centerline tradeoff algorithm will also likely further improve the accuracy of glacier centerlines.In addition, centerline accuracy will probably benefit from further improving the classification type of each glacier in the glacier inventory.

Conclusions
In  ing errors were controllable.Furthermore, the pre-processing algorithm we designed accurately identified 10 764 erroneous glacier polygons from RGI v6.0, which formed the defective glacier dataset (FGODS).
The global dataset contains 17 sub-datasets, including two basic input datasets (IGODS and GGEDS), two key result datasets (GGCLDS and GGMLDS), four process datasets, six derived result datasets, and three supplementary datasets.Ice caps, nominal glaciers, and erroneous glacier polygons were eliminated from most sub-datasets, accounting for approximately 8.25 % of the total RGI v6.0.The poor status of these glacier polygons did not support the automatic extraction of glacier centerlines, which needs to be improved in future work.Inevitably, some defects in the algorithm or datasets will also need to be addressed in future research.For instance, the glacial regions (R19 and R03) had the worst results but were nevertheless added to the dataset to prioritize data coverage integrity.The global glacier DEM dataset (GGEDS), global glacier external outline dataset (GGECDS), and global glacier buffer mask datasets (GGB-MDS) cover all glaciers in RGI v6.0.Accordingly, they will help design more efficient automated extraction algorithms to produce datasets containing all types of glacier centerlines and lengths worldwide, which is our next goal.https://doi.org/10.5194/essd-14-3889-2022Earth Syst.Sci.Data, 14, 3889-3913, 2022   Author contributions.All authors contributed to writing and editing the paper.DZ processed the data, performed all calculations, created all figures, and wrote most of the paper.SZ contributed significantly to the development of the analyses, figures, and writing.XY contributed to the development of the data production strategy and writing.GZ and WL contributed to the initial data production.SW participated in writing Sect. 4.

Competing interests.
The contact author has declared that none of the authors has any competing interests.
Disclaimer.Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Special issue statement.This article is part of the special issue "Extreme environment datasets for the three poles".It is not associated with a conference.

Figure 2 .
Figure 2. Workflow of the centerline and length of the dataset production.

Figure 3 .
Figure 3. Schematic of three types of flawed glacier outlines.(a-b) Glacier complexes with/without inaccurate segmentation.(c) Erroneous glacier outline caused by vectorization.(d-f) Common problems in flawed glaciers, with defects in the automatic algorithm, defects in post-processing, and artificial errors.The auxiliary line represents lower-grade ice divides in an individual glacier, which is part of the ridge lines.

Figure 4 .
Figure 4. Extraction of glacier centerlines in different glacier regions.The pie charts indicate the proportions of input glaciers and of the three types of excluded glaciers in the region.The pie charts represent the correct rate, which is the proportion of correctly extracted glaciers.The size of the pie chart represents the grade of the glaciers in the region.

Figure 5 .
Figure 5. Statistical chart of random evaluation results.The pie charts show the proportion of each type with the total number of samples in the region.The pie charts show the correct rate, which is the proportion of types I and II in each region.The size of the pie chart represents the grade of the correct rate in the region.Types I, II, and III (see Sect. 3.2.4) are correct, inaccurate, and incorrect centerlines, respectively.

Figure 6 .
Figure 6.Comparison between longest centerlines calculated in this study and by Machguth and Huss (2014).(a) Linear regression of maximum length for all input glaciers (IGODS), determined as G Lmax , calculated in this study, and L max , obtained in Machguth and Huss (2014).(b) Histogram of the length ratio (R r ; G Lmax /L max ) for distinct grades of glaciers.(c) Box plots of length ratio (R r ) for different scales of glaciers.

Figure 7 .
Figure 7. Visual comparison of the longest centerlines calculated in this study and by Machguth and Huss (2014) for two glacier-covered regions in the Himalayas, covering Mount Qomolangma (a) and Kangchenjunga (b, the world's third-highest mountain) and their surrounding areas.In the background is the DEM used for the calculation.

Figure B2 .
Figure B2.Histograms of the length ratio (R r ; G Lmax /L max ) of distinct glacier grades in glacier-covered regions and all samples.

Figure B3 .
Figure B3.Box plots of the length ratio (R r ; G Lmax /L max ) of glaciers of distinct grades in every glacier-covered region and whole sample.

Table 1 .
All DEMs collected in this study.

Table 2 .
Pre-processing results of different glacier regions and information of input datasets.
Note that GDEM and COP DEM refer to ASTER GDEM v3 and Copernicus DEM, respectively.

Table 3 .
Global glaciers stratified by area.

Table 4 .
Description of the attributes contained in all datasets.Tag of the same segment in a glacier RASTERVALU Long int.4 Altitude of a P max (m)Note that Char. is for character, int. is for integer, and (m) is the unit in meters.

Table 5 .
Description of the sub-datasets contained in this dataset.