Articles | Volume 17, issue 3
https://doi.org/10.5194/essd-17-1245-2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
https://doi.org/10.5194/essd-17-1245-2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
ChatEarthNet: a global-scale image–text dataset empowering vision–language geo-foundation models
Zhenghang Yuan
Data Science in Earth Observation, Technical University of Munich, 80333 Munich, Germany
Zhitong Xiong
Data Science in Earth Observation, Technical University of Munich, 80333 Munich, Germany
Lichao Mou
Data Science in Earth Observation, Technical University of Munich, 80333 Munich, Germany
Data Science in Earth Observation, Technical University of Munich, 80333 Munich, Germany
Munich Center for Machine Learning, 80333 Munich, Germany
Related authors
No articles found.
Xiao Xiang Zhu, Sining Chen, Fahong Zhang, Yilei Shi, and Yuanyuan Wang
Earth Syst. Sci. Data Discuss., https://doi.org/10.5194/essd-2025-327, https://doi.org/10.5194/essd-2025-327, 2025
Preprint under review for ESSD
Short summary
Short summary
We introduce GlobalBuildingAtlas, a publicly available dataset offering global and complete coverage of building polygons (GBA.Polygon), heights (GBA.Height) and Level of Detail 1 3D models (GBA.LoD1). This is the first open dataset to offer high quality, consistent, and complete building data in 2D and 3D at the individual building level on a global scale. With more than 2.75 billion buildings worldwide, it surpasses the most comprehensive database to date by more than 1 billion buildings.
Viola Steidl, Jonathan Louis Bamber, and Xiao Xiang Zhu
The Cryosphere, 19, 645–661, https://doi.org/10.5194/tc-19-645-2025, https://doi.org/10.5194/tc-19-645-2025, 2025
Short summary
Short summary
Glacier ice thickness is difficult to measure directly but is essential for glacier evolution modelling. In this work, we employ a novel approach combining physical knowledge and data-driven machine learning to estimate the ice thickness of multiple glaciers in Spitsbergen, Barentsøya, and Edgeøya in Svalbard. We identify challenges for the physics-aware machine learning model and opportunities for improving the accuracy and physical consistency that would also apply to other geophysical tasks.
Yifan Tian, Yao Sun, and Xiao Xiang Zhu
Abstr. Int. Cartogr. Assoc., 7, 171, https://doi.org/10.5194/ica-abs-7-171-2024, https://doi.org/10.5194/ica-abs-7-171-2024, 2024
Erik Loebel, Mirko Scheinert, Martin Horwath, Angelika Humbert, Julia Sohn, Konrad Heidler, Charlotte Liebezeit, and Xiao Xiang Zhu
The Cryosphere, 18, 3315–3332, https://doi.org/10.5194/tc-18-3315-2024, https://doi.org/10.5194/tc-18-3315-2024, 2024
Short summary
Short summary
Comprehensive datasets of calving-front changes are essential for studying and modeling outlet glaciers. Current records are limited in temporal resolution due to manual delineation. We use deep learning to automatically delineate calving fronts for 23 glaciers in Greenland. Resulting time series resolve long-term, seasonal, and subseasonal patterns. We discuss the implications of our results and provide the cryosphere community with a data product and an implementation of our processing system.
Weiyan Lin, Jiasong Zhu, Yuansheng Hua, Qingyu Li, Lichao Mou, and Xiao Xiang Zhu
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLVIII-1-2024, 371–378, https://doi.org/10.5194/isprs-archives-XLVIII-1-2024-371-2024, https://doi.org/10.5194/isprs-archives-XLVIII-1-2024-371-2024, 2024
Tian Li, Konrad Heidler, Lichao Mou, Ádám Ignéczi, Xiao Xiang Zhu, and Jonathan L. Bamber
Earth Syst. Sci. Data, 16, 919–939, https://doi.org/10.5194/essd-16-919-2024, https://doi.org/10.5194/essd-16-919-2024, 2024
Short summary
Short summary
Our study uses deep learning to produce a new high-resolution calving front dataset for 149 marine-terminating glaciers in Svalbard from 1985 to 2023, containing 124 919 terminus traces. This dataset offers insights into understanding calving mechanisms and can help improve glacier frontal ablation estimates as a component of the integrated mass balance assessment.
Y. Sun, A. Kruspe, L. Meng, Y. Tian, E. J. Hoffmann, S. Auer, and X. X. Zhu
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLVIII-1-W2-2023, 225–232, https://doi.org/10.5194/isprs-archives-XLVIII-1-W2-2023-225-2023, https://doi.org/10.5194/isprs-archives-XLVIII-1-W2-2023-225-2023, 2023
J. Zhao, F. Roth, B. Bauer-Marschallinger, W. Wagner, M. Chini, and X. X. Zhu
ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci., X-1-W1-2023, 911–918, https://doi.org/10.5194/isprs-annals-X-1-W1-2023-911-2023, https://doi.org/10.5194/isprs-annals-X-1-W1-2023-911-2023, 2023
Yao Sun, Stefan Auer, Liqiu Meng, and Xiao Xiang Zhu
Abstr. Int. Cartogr. Assoc., 6, 250, https://doi.org/10.5194/ica-abs-6-250-2023, https://doi.org/10.5194/ica-abs-6-250-2023, 2023
Jingliang Hu, Rong Liu, Danfeng Hong, Andrés Camero, Jing Yao, Mathias Schneider, Franz Kurz, Karl Segl, and Xiao Xiang Zhu
Earth Syst. Sci. Data, 15, 113–131, https://doi.org/10.5194/essd-15-113-2023, https://doi.org/10.5194/essd-15-113-2023, 2023
Short summary
Short summary
Multimodal data fusion is an intuitive strategy to break the limitation of individual data in Earth observation. Here, we present a multimodal data set, named MDAS, consisting of synthetic aperture radar (SAR), multispectral, hyperspectral, digital surface model (DSM), and geographic information system (GIS) data for the city of Augsburg, Germany, along with baseline models for resolution enhancement, spectral unmixing, and land cover classification, three typical remote sensing applications.
S. Zhao, S. Saha, and X. X. Zhu
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLIII-B3-2022, 1407–1413, https://doi.org/10.5194/isprs-archives-XLIII-B3-2022-1407-2022, https://doi.org/10.5194/isprs-archives-XLIII-B3-2022-1407-2022, 2022
S. Saha, J. Gawlikowski, and X. X. Zhu
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLIII-B3-2022, 423–428, https://doi.org/10.5194/isprs-archives-XLIII-B3-2022-423-2022, https://doi.org/10.5194/isprs-archives-XLIII-B3-2022-423-2022, 2022
T. Beker, H. Ansari, S. Montazeri, Q. Song, and X. X. Zhu
ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci., V-3-2022, 85–92, https://doi.org/10.5194/isprs-annals-V-3-2022-85-2022, https://doi.org/10.5194/isprs-annals-V-3-2022-85-2022, 2022
K. R. Traoré, A. Camero, and X. X. Zhu
ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci., V-3-2022, 217–224, https://doi.org/10.5194/isprs-annals-V-3-2022-217-2022, https://doi.org/10.5194/isprs-annals-V-3-2022-217-2022, 2022
Y. Xie, K. Schindler, J. Tian, and X. X. Zhu
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLIII-B2-2021, 247–254, https://doi.org/10.5194/isprs-archives-XLIII-B2-2021-247-2021, https://doi.org/10.5194/isprs-archives-XLIII-B2-2021-247-2021, 2021
P. Ebel, S. Saha, and X. X. Zhu
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLIII-B3-2021, 243–249, https://doi.org/10.5194/isprs-archives-XLIII-B3-2021-243-2021, https://doi.org/10.5194/isprs-archives-XLIII-B3-2021-243-2021, 2021
S. Saha, L. Kondmann, and X. X. Zhu
ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci., V-3-2021, 311–316, https://doi.org/10.5194/isprs-annals-V-3-2021-311-2021, https://doi.org/10.5194/isprs-annals-V-3-2021-311-2021, 2021
Cited articles
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J.: Qwen-vl: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond, arXiv [preprint], https://doi.org/10.48550/arXiv.2308.12966, 2023. a
Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., and Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023, 16772–16782, https://doi.org/10.1109/ICCV51070.2023.01538, 2023. a, b
Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., and Elhoseiny, M.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning, arXiv [preprint], https://doi.org/10.48550/arXiv.2310.09478, 2023. a, b
Cheng, G., Han, J., and Lu, X.: Remote sensing image scene classification: Benchmark and state of the art, P. IEEE, 105, 1865–1883, https://doi.org/10.1109/JPROC.2017.2675998, 2017. a
Cheng, Q., Huang, H., Xu, Y., Zhou, Y., Li, H., and Wang, Z.: NWPU-captions dataset and MLCA-Net for remote sensing image captioning, IEEE T. Geosci. Remote, 60, 5629419, https://doi.org/10.1109/TGRS.2022.3201474, 2022. a, b, c
Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., Meygret, A., Spoto, F., Sy, O., Marchese, F., and Bargellini, P.: Sentinel-2: ESA's optical high-resolution mission for GMES operational services, Remote Sens. Environ., 120, 25–36, https://doi.org/10.1016/j.rse.2011.11.026, 2012. a, b
ESA: Sentinel-2, https://www.esa.int/Applications/Observing_the_Earth/Copernicus/Sentinel-2 (last access: 27 February 2025), 2024a. a
ESA: Sentinel Online, https://sentinels.copernicus.eu/web/sentinel (last access: 27 February 2025), 2024b. a
Franklin, S. and Wulder, M.: Remote sensing methods in medium spatial resolution satellite data land cover classification of large areas, Prog. Phys. Geog., 26, 173–205, https://doi.org/10.1191/0309133302pp332ra, 2002. a
García-Mora, T. J., Mas, J.-F., and Hinkley, E. A.: Land cover mapping applications with MODIS: a literature review, Int. J. Digit. Earth, 5, 63–87, https://doi.org/10.1080/17538947.2011.565080, 2012. a
Hu, Y., Yuan, J., Wen, C., Lu, X., and Li, X.: RSGPT: A remote sensing vision language model and benchmark, arXiv [preprint], https://doi.org/10.48550/arXiv.2307.15266, 2023. a, b, c
Karra, K., Kontgis, C., Statman-Weil, Z., Mazzariello, J. C., Mathis, M., and Brumby, S. P.: Global land use/land cover with Sentinel 2 and deep learning, in: IEEE International Geoscience and Remote Sensing Symposium, Brussels, Belgium, 11–16 July 2021, 4704–4707, https://doi.org/10.1109/IGARSS47720.2021.9553499, 2021. a
Kuckreja, K., Danish, M. S., Naseer, M., Das, A., Khan, S., and Khan, F. S.: Geochat: Grounded large vision-language model for remote sensing, arXiv [preprint], https://doi.org/10.48550/arXiv.2311.15826, 2023. a
Li, X., Wen, C., Hu, Y., Yuan, Z., and Zhu, X. X.: Vision-Language Models in Remote Sensing: Current Progress and Future Trends, arXiv [preprint], https://doi.org/10.48550/arXiv.2305.05726, 2023a. a
Li, X., Wen, C., Hu, Y., and Zhou, N.: RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision, Int. J. Appl. Earth Obs., 124, 103497, https://doi.org/10.1016/j.jag.2023.103497, 2023b. a
Li, X., Ding, J., and Elhoseiny, M.: VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding, arXiv preprint https://doi.org/10.48550/arXiv.2406.12384, 2024a. a
Li, Y., Wang, L., Wang, T., Yang, X., Luo, J., Wang, Q., Deng, Y., Wang, W., Sun, X., Li, H., Dang, B., Zhang, Y., Yu, Y., and Yan, J.: STAR: A First-Ever Dataset and A Large-Scale Benchmark for Scene Graph Generation in Large-Size Satellite Imagery, arXiv [preprint], https://doi.org/10.48550/arXiv.2406.09410, 2024b. a
Liu, F., Chen, D., Guan, Z., Zhou, X., Zhu, J., and Zhou, J.: RemoteCLIP: A Vision Language Foundation Model for Remote Sensing, arXiv [preprint], https://doi.org/10.48550/arXiv.2306.11029, 2023a. a
Liu, F., Chen, D., Guan, Z., Zhou, X., Zhu, J., Ye, Q., Fu, L., and Zhou, J.: RemoteCLIP: A vision language foundation model for remote sensing, IEEE T. Geosci. Remote, 62, 5622216, https://doi.org/10.1109/TGRS.2024.3390838, 2024. a, b
Liu, H., Li, C., Wu, Q., and Lee, Y. J.: Visual instruction tuning, in: Advances in neural information processing systems, Curran Associates, Inc., Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023, arXiv [preprint], https://doi.org/10.48550/arXiv.2304.08485, 2023b. a, b
Lobry, S., Marcos, D., Murray, J., and Tuia, D.: RSVQA: Visual question answering for remote sensing data, IEEE T. Geosci. Remote, 58, 8555–8566, https://doi.org/10.1109/TGRS.2020.2988782, 2020. a
Lobry, S., Demir, B., and Tuia, D.: RSVQA meets BigEarthNet: a new, large-scale, visual question answering dataset for remote sensing, in: IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021, 1218–1221, https://doi.org/10.1109/IGARSS47720.2021.9553307, 2021. a
Lu, X., Wang, B., Zheng, X., and Li, X.: Exploring models and data for remote sensing image caption generation, IEEE T. Geosci. Remote, 56, 2183–2195, https://doi.org/10.1109/TGRS.2017.2776321, 2017. a, b, c, d
Luo, J., Pang, Z., Zhang, Y., Wang, T., Wang, L., Dang, B., Lao, J., Wang, J., Chen, J., Tan, Y., and Li, Y.: SkysenseGPT: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding, arXiv [preprint], https://doi.org/10.48550/arXiv.2406.10100, 2024. a
Mora, B., Tsendbazar, N.-E., Herold, M., and Arino, O.: Global land cover mapping: Current status and future trends, Land use and land cover mapping in Europe: practices & trends, Springer, Dordrecht, 11–30, https://doi.org/10.1007/978-94-007-7969-3_2, 2014. a
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I.: Learning transferable visual models from natural language supervision, in: International conference on machine learning, online, 18–24 July 2021, 8748–8763, arXiv [preprint], https://doi.org/10.48550/arXiv.2103.00020, 2021. a
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G.: LLaMA: Open and efficient foundation language models, arXiv [preprint], https://doi.org/10.48550/arXiv.2302.13971, 2023. a
Wang, Z., Prabha, R., Huang, T., Wu, J., and Rajagopal, R.: Skyscript: A large and semantically diverse vision-language dataset for remote sensing, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20–27 February 2024, vol. 38, 5805–5813, https://doi.org/10.1609/aaai.v38i6.28393, 2024. a, b, c, d
Xia, G.-S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., and Zhang, L.: DOTA: A large-scale dataset for object detection in aerial images, in: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–23 June 2018, 3974–3983, https://doi.org/10.1109/cvpr.2018.00418, 2018. a
Xiong, Z., Zhang, F., Wang, Y., Shi, Y., and Zhu, X. X.: EarthNets: Empowering AI in Earth Observation, arXiv [preprint], https://doi.org/10.48550/arXiv.2210.04936, 2022. a
Xiong, Z., Wang, Y., Zhang, F., Stewart, A. J., Hanna, J., Borth, D., Papoutsis, I., Saux, B. L., Camps-Valls, G., and Zhu, X. X.: Neural plasticity-inspired foundation model for observing the Earth crossing modalities, arXiv [preprint], https://doi.org/10.48550/arXiv.2403.15356, 2024. a
Yang, Y. and Newsam, S.: Bag-of-visual-words and spatial extensions for land-use classification, in: Proceedings of the 18th SIGSPATIAL International Conference on advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010, 270–279, https://doi.org/10.1145/1869790.1869829, 2010. a
Yuan, Z., Mou, L., Wang, Q., and Zhu, X. X.: From easy to hard: Learning language-guided curriculum for visual question answering on remote sensing data, IEEE T. Geosci. Remote, 60, 5623111, https://doi.org/10.1109/TGRS.2022.3173811, 2022. a
Yuan, Z., Mou, L., Hua, Y., and Zhu, X. X.: RRSIS: Referring remote sensing image segmentation, IEEE T. Geosci. Remote, 62, 5613312, https://doi.org/10.1109/TGRS.2024.3369720, 2024a. a
Yuan, Z., Xiong, Z., Mou, L., and Zhu, X. X.: ChatEarthNet [code], Zenodo, https://doi.org/10.5281/zenodo.11004358, 2024b. a, b
Yuan, Z., Xiong, Z., Mou, L., and Zhu, X. X.: ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models, Zenodo [data set], https://doi.org/10.5281/zenodo.11003436, 2024c. a, b
Zanaga, D., Van De Kerchove, R., De Keersmaecker, W., Souverijns, N., Brockmann, C., Quast, R., Wevers, J., Grosu, A., Paccini, A., Vergnaud, S., Cartus, O., Santoro, M., Fritz, S., Georgieva, I., Lesiv, M., Carter, S., Herold, M., Li, L., Tsendbazar, N.-E., Ramoino, F., and Arino, O.: ESA WorldCover 10 m 2020 v100, Zenodo [data set], https://doi.org/10.5281/zenodo.5571936, 2021. a, b, c
Zhan, Y., Xiong, Z., and Yuan, Y.: RSVG: Exploring data and models for visual grounding on remote sensing data, IEEE T. Geosci. Remote, 61, 5604513, https://doi.org/10.1109/TGRS.2023.3250471, 2023. a
Zhang, F., Du, B., and Zhang, L.: Saliency-guided unsupervised feature learning for scene classification, IEEE T. Geosci. Remote, 53, 2175–2184, https://doi.org/10.1109/TGRS.2014.2357078, 2014. a
Zhang, J., Huang, J., Jin, S., and Lu, S.: Vision-language models for vision tasks: A survey, IEEE T. Pattern Anal., 46, 5625–5644, https://doi.org/10.1109/TPAMI.2024.3369699, 2024. a
Zhang, Z., Zhao, T., Guo, Y., and Yin, J.: RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing, arXiv [preprint], https://doi.org/10.48550/arXiv.2306.11300, 2023. a, b, c
Zhou, C., Li, Q., Li, C., Yu, J., Liu, Y., Wang, G., Zhang, K., Ji, C., Yan, Q., He, L., Peng, H., Li, J., Wu, J., Liu, Z., Xie, P., Xiong, C., Pei, J., Yu, P. S., and Sun, L.: A comprehensive survey on pretrained foundation models: A history from BERT to ChatGPT, arXiv [preprint], https://doi.org/10.48550/arXiv.2302.09419, 2023. a
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.: MiniGPT-4: Enhancing vision-language understanding with advanced large language models, arXiv [preprint], https://doi.org/10.48550/arXiv.2304.10592, 2023. a, b
Short summary
ChatEarthNet is an image–text dataset that provides high-quality, detailed natural language descriptions for global-scale satellite data. It consists of 163 488 image-text pairs with captions generated by ChatGPT-3.5 and an additional 10 000 image-text pairs with captions generated by ChatGPT-4V(ision). This dataset has significant potential for training and evaluating vision–language geo-foundation models in remote sensing.
ChatEarthNet is an image–text dataset that provides high-quality, detailed natural language...
Altmetrics
Final-revised paper
Preprint