Shelf-Bench: A benchmark dataset for Antarctic ice shelf front and coastline delineation from multi-sensor radar satellite data

Baumhoer, Celia A.; Morgan, Amy B.; Hou, Xinyu; Fromentin, Jowan L.; Hoeser, Thorsten; Dietz, Andreas J.; Markham, Andrew; Stevens, Laura A.; Kuenzer, Claudia

doi:10.5194/essd-2025-758

Preprints

https://doi.org/10.5194/essd-2025-758

Preprints

02 Mar 2026

| 02 Mar 2026

Status: this preprint is currently under review for the journal ESSD.

Shelf-Bench: A benchmark dataset for Antarctic ice shelf front and coastline delineation from multi-sensor radar satellite data

Celia A. Baumhoer, Amy B. Morgan, Xinyu Hou, Jowan L. Fromentin, Thorsten Hoeser, Andreas J. Dietz, Andrew Markham, Laura A. Stevens, and Claudia Kuenzer

Abstract. Continuous monitoring of Antarctic ice shelf fronts is essential for understanding ice sheet dynamics, detecting iceberg calving events, supporting operational logistics, and generating up-to-date continental maps. However, the automated and continuous delineation of ice shelf fronts has been held back by a lack of suitable training data for deep learning models. We present Shelf-Bench, a comprehensive benchmark dataset comprising 161 manually annotated SAR scenes from three satellite sensors (ERS, Envisat, and Sentinel-1), providing spatial coverage of the Antarctic coastline with multi-temporal seasonal acquisitions spanning 1992–2021. The dataset features manually delineated masks paired with pre-processed imagery at moderate spatial resolution. Through complexity analysis, we characterize delineation challenges, including fast ice, crevassed surfaces, dense iceberg mélange, and limited spatial context. We evaluate five state-of-the-art semantic segmentation architectures, establishing baseline performance metrics. Baseline models showed strongly contrasting behaviour on Shelf-Bench: architectures that achieved higher pixel-wise accuracy tended to produce larger boundary errors, while models with better geometric precision obtained lower semantic scores. This trade-off indicates that the dataset jointly challenges ice-ocean classification and fine-scale calving front delineation, revealing complementary challenges which make it a profound benchmark for automated ice front mapping. By providing this open-access, standardized benchmark, Shelf-Bench enables accelerated development of deep learning methodologies for automated Antarctic coastline detection and supports continuous monitoring across current and future SAR satellite missions. The Shelf-Bench dataset is available at https://doi.org/10.5281/zenodo.17610870.

Received: 10 Dec 2025 – Discussion started: 02 Mar 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Celia A. Baumhoer, Amy B. Morgan, Xinyu Hou, Jowan L. Fromentin, Thorsten Hoeser, Andreas J. Dietz, Andrew Markham, Laura A. Stevens, and Claudia Kuenzer

Status: final response (author comments only)

RC1: 'Comment on essd-2025-758', Anonymous Referee #1, 17 Mar 2026

Please see attached file.

Citation: https://doi.org/10.5194/essd-2025-758-RC1
RC2:
'Comment on essd-2025-758', Sepideh Jalayer, 18 Apr 2026
The authors have introduced a new dataset which is very valuable, but the gap and novelty could be stated more clearly. It is required to explain what limitations of existing datasets are addressed and what new capabilities this dataset provides.

It would be helpful to add benchmark datasets from the Arctic, such as the AutoICE dataset and its associated challenge to section (2. Background and Related Work) and cite some references including (https://doi.org/10.5194/tc-18-3471-2024, https://doi.org/10.1016/j.rsase.2025.101538, https://doi.org/10.5194/tc-18-1621-2024). This would provide a more complete overview of existing benchmark efforts in polar regions.

Need to clarify how the inputs are combined before being fed into the models. The authors have mentioned about resampling to a common resolution, but more detail on the interpolation method and data representation need to be included to improve clarity. In addition, it is not clear whether any fusion strategy (like early or late fusion across sensors) is used. A brief discussion or simple comparison could improve the study. Exploring different fusion approaches has been shown to improve performance and relevant works on multi-resolution data fusion for sea ice mapping (https://doi.org/10.1109/igarss55030.2025.11243075) can be cited here and used as a basis for additional experiments.

(Lines ~155–160) The authors have reported values such as 221 m, 75 m, and 38 m, but it is not clear how these should be interpreted. It would help to briefly explain at first mention that these correspond to mean distance error between predicted and ground-truth fronts, and to relate them to human annotation variability. This would make it easier for readers to understand what counts as good performance.

It would be helpful to clarify why the evaluation is performed at the patch level rather than on full scenes? Patch-based evaluation can introduce edge effects and not reflect the real performance. It is required to either include scene-level results or justify why patch-based evaluation is sufficient. It would also be useful to clarify whether the other metrics (IoU, F1, accuracy, etc) are computed at the patch level or aggregated at the scene level?

It is better to include training/validation curves of loss and F1 score to show convergence behavior. The statement about “stable fine-tuning of model heads” also needs clarification, because training length alone does not ensure stability. Please clarify whether all models converged within 150 epochs and whether only the heads were fine-tuned or the full networks were updated. Clarification is important for a fair comparison, especially between different models.

The authors mention about temporal/geographic separation between training and test sets (“The entire benchmark dataset guarantees that there exists either a temporal or geographical distinction between the test and training datasets to prevent overlaps”), but generalization is not evaluated clearly. It would be useful to include a brief analysis of performance across different regions and seasons, and discuss how well the models generalize spatially and temporally.

Please clarify how areas outside the 50 km buffer are handled during training (masked or zero-valued) and whether this leads to patches with limited useful content, especially given the patch-based setup?

It would be helpful to clarify whether there is any spatial overlap between training and test scenes, especially given the use of patch-based sampling.

It is required to run the experiments with multiple random seeds (5 or 10) and report the average (with standard deviation) of the results or even ensemble. This can provide a better comparison between the models.

It would be useful to include some form of uncertainty estimations (such as prediction confidence, variability across runs, std, variance or entropy).

The explanation of the loss function is unclear. Since Dice loss already helps with class imbalance, it would be helpful to explain why Focal loss is also used and how they work together. A short explanation of how the two losses are combined would make this clearer. It would also be useful to report the performance when using each loss separately to better understand the benefit of combining them.

The authors mention about using multiple GPUs with a batch size per GPU, which suggests parallel training?, but this is not clearly explained. Need to briefly clarify how the model was trained across GPUs for better reproducibility.
Citation: https://doi.org/10.5194/essd-2025-758-RC2

Celia A. Baumhoer, Amy B. Morgan, Xinyu Hou, Jowan L. Fromentin, Thorsten Hoeser, Andreas J. Dietz, Andrew Markham, Laura A. Stevens, and Claudia Kuenzer

Data sets

The Shelf-Bench Dataset: A benchmark dataset for Antarctic ice shelf front and coastline delineation from multi-sensor radar satellite data C. A. Baumhoer and A. B. Morgan https://doi.org/10.5281/zenodo.17610870

Model code and software

Shelf-Bench Baselines A. B. Morgan, X. Hou, and J. L. Fromentin https://github.com/amymorgan01/Shelf-Bench

Celia A. Baumhoer, Amy B. Morgan, Xinyu Hou, Jowan L. Fromentin, Thorsten Hoeser, Andreas J. Dietz, Andrew Markham, Laura A. Stevens, and Claudia Kuenzer

Viewed

Total article views: 646 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
354	265	27	646	40	44

HTML: 354
PDF: 265
XML: 27
Total: 646
BibTeX: 40
EndNote: 44

Views and downloads (calculated since 02 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	191	79	18	288
Apr 2026	107	120	7	234
May 2026	56	66	2	124

Cumulative views and downloads (calculated since 02 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	191	79	18	288
Apr 2026	107	120	7	234
May 2026	56	66	2	124

Viewed (geographical distribution)

Total article views: 632 (including HTML, PDF, and XML) Thereof 632 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 31 May 2026

Short summary

Researchers from DLR and University Oxford created Shelf-Bench, a dataset to help AI track Antarctic ice shelf edges using satellite radar images (1992–2021). Testing five AI models revealed ice shelf delineation remains challenging and requires new model developments. This public benchmark will help researchers build better tools for monitoring ice front changes, detecting iceberg calving events, supporting operational logistics, and generating up-to-date continental maps.


Total:	0
HTML:	0
PDF:	0
XML:	0