the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification
Abstract. Global tree species mapping using remote sensing data is vital for biodiversity monitoring, forest management, and ecological research. However, progress in this field has been constrained by the scarcity of large-scale, labeled datasets. To address this, we introduce GlobalGeoTree – a comprehensive global dataset for tree species classification. GlobalGeoTree comprises 6.3 million geolocated tree occurrences, spanning 275 families, 2,734 genera, and 21,001 species across the hierarchical taxonomic levels. Each sample is paired with Sentinel-2 image time series and 27 auxiliary environmental variables, encompassing bioclimatic, geographic, and soil data. The dataset is partitioned into GlobalGeoTree-6M, a large subset for model pretraining, and curated evaluation subsets, primarily GlobalGeoTree-10kEval, a benchmark for zero-shot and few-shot classification. To demonstrate the utility of the dataset, we introduce a baseline model, GeoTreeCLIP, which leverages paired remote sensing data and taxonomic text labels within a vision-language framework pretrained on GlobalGeoTree-6M. Experimental results show that GeoTreeCLIP achieves substantial improvements in zero- and few-shot classification on GlobalGeoTree-10kEval over existing advanced models. By making the dataset, models, and code publicly available, we aim to establish a benchmark to advance tree species classification and foster innovation in biodiversity research and ecological applications. The code is publicly available at https://github.com/MUYang99/GlobalGeoTree, and the GlobalGeoTree dataset is available at https://huggingface.co/datasets/yann111/GlobalGeoTree.
- Preprint
(14134 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on essd-2025-613', Anonymous Referee #1, 14 Nov 2025
-
RC2: 'Comment on essd-2025-613', Anonymous Referee #2, 29 Nov 2025
This manuscript presents a large-scale, multimodal dataset for global tree species mapping and provides baseline models built on vision–language architectures. The study is novel and well-executed, with clear potential impact. I particularly appreciate the innovative data integration and modeling approach. I think this paper has a strong contribution to the AI4forest and AI4ecology communities. I have a few points for discussion regarding some methodological choices, data integration strategies, validation procedures, and practical applicability, which could benefit from further clarification.
- The authors convert all non–Sentinel-2 environmental layers (e.g., bioclimatic, soil, and SRTM data) into single scalar values and feed them into an additional MLP branch. Given that these auxiliary layers are nearly constant across each Sentinel-2 patch, it remains unclear whether a dedicated MLP module is efficient. A more straightforward and parameter-efficient alternative would be to treat these raster layers as additional bands and integrate them directly into the Sentinel-2 data cube. I suggest the authors simply discuss: Why not merge these coarse-resolution raster layers into the Sentinel-2 cube and encode them jointly with the visual encoder? What are the advantages and disadvantages of the current “scalar + MLP” design compared with the “extended raster cube” approach? How do these two strategies differ in terms of model capacity, parameter efficiency, and representational effectiveness?
- The validation set is derived from the same volunteer-contributed datasets (e.g., GBIF, iNaturalist) that were used for training. These datasets share similar sources of error and observation bias. I recommend incorporating independent validation sources, such as national forest inventories, ecological monitoring networks, or field-based regional datasets. If additional data cannot be included, a brief discussion is needed on how shared annotation noise and observation bias may affect validation reliability and the overall model evaluation.
- The manuscript states that 21,001 species are included. I suggest adding a structured summary table that includes representative families, genera, and species; their geographic distribution or dominant regions; brief ecological descriptions; and, optionally, sample images from public datasets. This would greatly enhance the readability and interpretability of the dataset.
- The current framework relies on multispectral imagery and environmental variables but lacks structural canopy information, which is highly relevant for separating woody species. I suggest simply discussing the potential benefits of incorporating global 1-m canopy height maps (e.g., Meta’s global CHM), global-scale LiDAR-derived products, or other structural/vertical metrics.
- Products such as GLC_FCS10 provide 10-m vegetation functional-type information and could be valuable for excluding non-forested regions or constraining the candidate species space. A short discussion on the potential integration of global land-cover products would be beneficial.
- The proposed dataset is highly valuable for global-scale tree species mapping. However, the manuscript does not sufficiently address how the dataset can be used in real-world inference and operational mapping. I recommend simply adding a dedicated section that discusses practical workflows for deploying the dataset in large-scale tree species mapping, potential inference pipelines, and challenges in global deployment (e.g., domain shifts, spatial biases, and seasonal variability). Such a discussion would help bridge the gap between dataset creation and applied remote-sensing or ecological use cases.
Citation: https://doi.org/10.5194/essd-2025-613-RC2
Data sets
GlobalGeoTree: A Multi-Granular Vision- Language Dataset for Global Tree Species Classification Y. Mu et al. https://doi.org/10.15468/dd.9qxqyy
Model code and software
GlobalGeoTree: A Multi-Granular Vision- Language Dataset for Global Tree Species Classification Y. Mu et al. https://github.com/MUYang99/GlobalGeoTree
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 338 | 137 | 29 | 504 | 29 | 25 |
- HTML: 338
- PDF: 137
- XML: 29
- Total: 504
- BibTeX: 29
- EndNote: 25
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General Comments:
The authors present a global multi-modal dataset for tree species classification, integrating diverse data sources and offering both a large-scale pretraining dataset and a separate evaluation set. They also propose GeoTreeCLIP, a model that leverages hierarchical label structures and demonstrates improvements over baseline methods. The experimental setup is comprehensive, including comparisons with CLIP-style models and supervised learning approach. All code and data are publicly available.
Specific Comments:
1. Dataset Construction:
1.1 The authors use the JRC Forest Cover Map v1 for filtering. Given that version 2 has been publicly released with documented improvements, is there a reason for not using the updated version?
1.2 The GlobalGeoTree-10kEval set includes 90 species out of over 21,000. Could the authors clarify the selection criteria? Were any sampling or filtering strategies applied to ensure the reliability of the evaluation set, particularly given the inclusion of citizen science sources like iNaturalist?
1.3 While the evaluation set is constructed as a separate test set, there appears to be no explicit validation process to assess its quality. Given the integration of heterogeneous data sources, some form of validation (manual or automated) would greatly enhance the trustworthiness and utility of the dataset.
2. Model (GeoTreeCLIP):
2.1 The authors attribute the performance improvements of GeoTreeCLIP to domain-specific pretraining. However, it’s difficult to isolate the effects of pretraining alone, as other baseline models lack temporal fusion and may differ in how auxiliary data are handled. A more controlled ablation or discussion would strengthen this claim.
3. Evaluation Metrics and Reporting:
3.1 The paper mentions addressing class imbalance by grouping species into frequent, common, and rare categories. However, results are not reported per group. Including group-specific performance would align with common practices in imbalanced classification tasks
3.2 Given the global scope of the dataset and the known regional biases, regional performance breakdowns would be informative and important for understanding model generalizability.
Additional comments:
The authors’ effort in assembling such a large-scale, publicly available dataset and developing a strong benchmark model is highly appreciated. However, since this is a data description paper, the dataset itself should be the focal point. At present, the lack of validation for datasets is a significant limitation. While the work offers valuable contributions for machine learning research, particularly within benchmark or workshop tracks at venues like CVPR or NeurIPS, it may not yet meet the expectations for a journal like ESSD, which prioritizes data quality.