GlobalGeoTree: a multi-granular vision-language dataset for global tree species classification

Mu, Yang; Xiong, Zhitong; Wang, Yi; Shahzad, Muhammad; Essl, Franz; Kreft, Holger; van Kleunen, Mark; Zhu, Xiao Xiang

doi:10.5194/essd-18-1379-2026

Articles | Volume 18, issue 2

https://doi.org/10.5194/essd-18-1379-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/essd-18-1379-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 18, issue 2

Data description article

|

24 Feb 2026

Data description article |

| 24 Feb 2026

GlobalGeoTree: a multi-granular vision-language dataset for global tree species classification

Yang Mu, Zhitong Xiong, Yi Wang, Muhammad Shahzad, Franz Essl, Holger Kreft, Mark van Kleunen, and Xiao Xiang Zhu

Download

Final revised paper (published on 24 Feb 2026)
Preprint (discussion started on 05 Nov 2025)

Interactive discussion

Status: closed

RC1: 'Comment on essd-2025-613', Anonymous Referee #1, 14 Nov 2025

General Comments:
The authors present a global multi-modal dataset for tree species classification, integrating diverse data sources and offering both a large-scale pretraining dataset and a separate evaluation set. They also propose GeoTreeCLIP, a model that leverages hierarchical label structures and demonstrates improvements over baseline methods. The experimental setup is comprehensive, including comparisons with CLIP-style models and supervised learning approach. All code and data are publicly available.
Specific Comments:
1. Dataset Construction:
1.1  The authors use the JRC Forest Cover Map v1 for filtering. Given that version 2 has been publicly released with documented improvements, is there a reason for not using the updated version?
1.2  The GlobalGeoTree-10kEval set includes 90 species out of over 21,000. Could the authors clarify the selection criteria? Were any sampling or filtering strategies applied to ensure the reliability of the evaluation set, particularly given the inclusion of citizen science sources like iNaturalist?
1.3  While the evaluation set is constructed as a separate test set, there appears to be no explicit validation process to assess its quality. Given the integration of heterogeneous data sources, some form of validation (manual or automated) would greatly enhance the trustworthiness and utility of the dataset.
2. Model (GeoTreeCLIP):
2.1  The authors attribute the performance improvements of GeoTreeCLIP to domain-specific pretraining. However, it’s difficult to isolate the effects of pretraining alone, as other baseline models lack temporal fusion and may differ in how auxiliary data are handled. A more controlled ablation or discussion would strengthen this claim.
3. Evaluation Metrics and Reporting:
3.1  The paper mentions addressing class imbalance by grouping species into frequent, common, and rare categories. However, results are not reported per group. Including group-specific performance would align with common practices in imbalanced classification tasks
3.2  Given the global scope of the dataset and the known regional biases, regional performance breakdowns would be informative and important for understanding model generalizability.
Additional comments:
The authors’ effort in assembling such a large-scale, publicly available dataset and developing a strong benchmark model is highly appreciated. However, since this is a data description paper, the dataset itself should be the focal point. At present, the lack of validation for datasets is a significant limitation. While the work offers valuable contributions for machine learning research, particularly within benchmark or workshop tracks at venues like CVPR or NeurIPS, it may not yet meet the expectations for a journal like ESSD, which prioritizes data quality.

Citation: https://doi.org/10.5194/essd-2025-613-RC1
RC2:
'Comment on essd-2025-613', Anonymous Referee #2, 29 Nov 2025
This manuscript presents a large-scale, multimodal dataset for global tree species mapping and provides baseline models built on vision–language architectures. The study is novel and well-executed, with clear potential impact. I particularly appreciate the innovative data integration and modeling approach. I think this paper has a strong contribution to the AI4forest and AI4ecology communities. I have a few points for discussion regarding some methodological choices, data integration strategies, validation procedures, and practical applicability, which could benefit from further clarification.
The authors convert all non–Sentinel-2 environmental layers (e.g., bioclimatic, soil, and SRTM data) into single scalar values and feed them into an additional MLP branch. Given that these auxiliary layers are nearly constant across each Sentinel-2 patch, it remains unclear whether a dedicated MLP module is efficient. A more straightforward and parameter-efficient alternative would be to treat these raster layers as additional bands and integrate them directly into the Sentinel-2 data cube. I suggest the authors simply discuss: Why not merge these coarse-resolution raster layers into the Sentinel-2 cube and encode them jointly with the visual encoder? What are the advantages and disadvantages of the current “scalar + MLP” design compared with the “extended raster cube” approach? How do these two strategies differ in terms of model capacity, parameter efficiency, and representational effectiveness?

The validation set is derived from the same volunteer-contributed datasets (e.g., GBIF, iNaturalist) that were used for training. These datasets share similar sources of error and observation bias. I recommend incorporating independent validation sources, such as national forest inventories, ecological monitoring networks, or field-based regional datasets. If additional data cannot be included, a brief discussion is needed on how shared annotation noise and observation bias may affect validation reliability and the overall model evaluation.

The manuscript states that 21,001 species are included. I suggest adding a structured summary table that includes representative families, genera, and species; their geographic distribution or dominant regions; brief ecological descriptions; and, optionally, sample images from public datasets. This would greatly enhance the readability and interpretability of the dataset.

The current framework relies on multispectral imagery and environmental variables but lacks structural canopy information, which is highly relevant for separating woody species. I suggest simply discussing the potential benefits of incorporating global 1-m canopy height maps (e.g., Meta’s global CHM), global-scale LiDAR-derived products, or other structural/vertical metrics.

Products such as GLC_FCS10 provide 10-m vegetation functional-type information and could be valuable for excluding non-forested regions or constraining the candidate species space. A short discussion on the potential integration of global land-cover products would be beneficial.

The proposed dataset is highly valuable for global-scale tree species mapping. However, the manuscript does not sufficiently address how the dataset can be used in real-world inference and operational mapping. I recommend simply adding a dedicated section that discusses practical workflows for deploying the dataset in large-scale tree species mapping, potential inference pipelines, and challenges in global deployment (e.g., domain shifts, spatial biases, and seasonal variability). Such a discussion would help bridge the gap between dataset creation and applied remote-sensing or ecological use cases.
Citation: https://doi.org/10.5194/essd-2025-613-RC2
AC1: 'Comment on essd-2025-613', Xiao Xiang Zhu, 14 Jan 2026

See attachment

Citation: https://doi.org/10.5194/essd-2025-613-AC1

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Xiao Xiang Zhu on behalf of the Authors (14 Jan 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (25 Jan 2026) by Birgit Heim

AR by Xiao Xiang Zhu on behalf of the Authors (28 Jan 2026) Manuscript

Short summary

To better protect our planet's forests, we need to know what trees are where. We created GlobalGeoTree, a massive public dataset linking 6.3 million tree locations worldwide with satellite data. This dataset helps computers learn to identify tree species from space, supporting biodiversity monitoring and climate action. Our baseline model shows this is a promising path to understanding global forests.