the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Shelf-Bench: A benchmark dataset for Antarctic ice shelf front and coastline delineation from multi-sensor radar satellite data
Abstract. Continuous monitoring of Antarctic ice shelf fronts is essential for understanding ice sheet dynamics, detecting iceberg calving events, supporting operational logistics, and generating up-to-date continental maps. However, the automated and continuous delineation of ice shelf fronts has been held back by a lack of suitable training data for deep learning models. We present Shelf-Bench, a comprehensive benchmark dataset comprising 161 manually annotated SAR scenes from three satellite sensors (ERS, Envisat, and Sentinel-1), providing spatial coverage of the Antarctic coastline with multi-temporal seasonal acquisitions spanning 1992–2021. The dataset features manually delineated masks paired with pre-processed imagery at moderate spatial resolution. Through complexity analysis, we characterize delineation challenges, including fast ice, crevassed surfaces, dense iceberg mélange, and limited spatial context. We evaluate five state-of-the-art semantic segmentation architectures, establishing baseline performance metrics. Baseline models showed strongly contrasting behaviour on Shelf-Bench: architectures that achieved higher pixel-wise accuracy tended to produce larger boundary errors, while models with better geometric precision obtained lower semantic scores. This trade-off indicates that the dataset jointly challenges ice-ocean classification and fine-scale calving front delineation, revealing complementary challenges which make it a profound benchmark for automated ice front mapping. By providing this open-access, standardized benchmark, Shelf-Bench enables accelerated development of deep learning methodologies for automated Antarctic coastline detection and supports continuous monitoring across current and future SAR satellite missions. The Shelf-Bench dataset is available at https://doi.org/10.5281/zenodo.17610870.
- Preprint
(2002 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on essd-2025-758', Anonymous Referee #1, 17 Mar 2026
-
RC2: 'Comment on essd-2025-758', Sepideh Jalayer, 18 Apr 2026
- The authors have introduced a new dataset which is very valuable, but the gap and novelty could be stated more clearly. It is required to explain what limitations of existing datasets are addressed and what new capabilities this dataset provides.
- It would be helpful to add benchmark datasets from the Arctic, such as the AutoICE dataset and its associated challenge to section (2. Background and Related Work) and cite some references including (https://doi.org/10.5194/tc-18-3471-2024, https://doi.org/10.1016/j.rsase.2025.101538, https://doi.org/10.5194/tc-18-1621-2024). This would provide a more complete overview of existing benchmark efforts in polar regions.
- Need to clarify how the inputs are combined before being fed into the models. The authors have mentioned about resampling to a common resolution, but more detail on the interpolation method and data representation need to be included to improve clarity. In addition, it is not clear whether any fusion strategy (like early or late fusion across sensors) is used. A brief discussion or simple comparison could improve the study. Exploring different fusion approaches has been shown to improve performance and relevant works on multi-resolution data fusion for sea ice mapping (https://doi.org/10.1109/igarss55030.2025.11243075) can be cited here and used as a basis for additional experiments.
- (Lines ~155–160) The authors have reported values such as 221 m, 75 m, and 38 m, but it is not clear how these should be interpreted. It would help to briefly explain at first mention that these correspond to mean distance error between predicted and ground-truth fronts, and to relate them to human annotation variability. This would make it easier for readers to understand what counts as good performance.
- It would be helpful to clarify why the evaluation is performed at the patch level rather than on full scenes? Patch-based evaluation can introduce edge effects and not reflect the real performance. It is required to either include scene-level results or justify why patch-based evaluation is sufficient. It would also be useful to clarify whether the other metrics (IoU, F1, accuracy, etc) are computed at the patch level or aggregated at the scene level?
- It is better to include training/validation curves of loss and F1 score to show convergence behavior. The statement about “stable fine-tuning of model heads” also needs clarification, because training length alone does not ensure stability. Please clarify whether all models converged within 150 epochs and whether only the heads were fine-tuned or the full networks were updated. Clarification is important for a fair comparison, especially between different models.
- The authors mention about temporal/geographic separation between training and test sets (“The entire benchmark dataset guarantees that there exists either a temporal or geographical distinction between the test and training datasets to prevent overlaps”), but generalization is not evaluated clearly. It would be useful to include a brief analysis of performance across different regions and seasons, and discuss how well the models generalize spatially and temporally.
- Please clarify how areas outside the 50 km buffer are handled during training (masked or zero-valued) and whether this leads to patches with limited useful content, especially given the patch-based setup?
- It would be helpful to clarify whether there is any spatial overlap between training and test scenes, especially given the use of patch-based sampling.
- It is required to run the experiments with multiple random seeds (5 or 10) and report the average (with standard deviation) of the results or even ensemble. This can provide a better comparison between the models.
- It would be useful to include some form of uncertainty estimations (such as prediction confidence, variability across runs, std, variance or entropy).
- The explanation of the loss function is unclear. Since Dice loss already helps with class imbalance, it would be helpful to explain why Focal loss is also used and how they work together. A short explanation of how the two losses are combined would make this clearer. It would also be useful to report the performance when using each loss separately to better understand the benefit of combining them.
-
The authors mention about using multiple GPUs with a batch size per GPU, which suggests parallel training?, but this is not clearly explained. Need to briefly clarify how the model was trained across GPUs for better reproducibility.
Citation: https://doi.org/10.5194/essd-2025-758-RC2
Data sets
The Shelf-Bench Dataset: A benchmark dataset for Antarctic ice shelf front and coastline delineation from multi-sensor radar satellite data C. A. Baumhoer and A. B. Morgan https://doi.org/10.5281/zenodo.17610870
Model code and software
Shelf-Bench Baselines A. B. Morgan, X. Hou, and J. L. Fromentin https://github.com/amymorgan01/Shelf-Bench
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 299 | 200 | 25 | 524 | 33 | 39 |
- HTML: 299
- PDF: 200
- XML: 25
- Total: 524
- BibTeX: 33
- EndNote: 39
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Please see attached file.