Preprints
https://doi.org/10.5194/essd-2024-140
https://doi.org/10.5194/essd-2024-140
27 Jun 2024
 | 27 Jun 2024
Status: this preprint is currently under review for the journal ESSD.

ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models

Zhenghang Yuan, Zhitong Xiong, Lichao Mou, and Xiao Xiang Zhu

Abstract. The rapid development of remote sensing technology has led to an exponential growth in satellite images, yet their inherent complexity often makes them difficult for non-expert users to understand. Natural language, as a carrier of human knowledge, can bridge common users and complicated satellite imagery. Additionally, when paired with visual data, natural language can be utilized to train large vision-language foundation models, significantly improving performance in various tasks. Despite these advancements, the remote sensing community still faces a challenge due to the lack of large- scale, high-quality vision-language datasets for satellite images. To address this challenge, we introduce a new image-text dataset, providing high-quality natural language descriptions for global-scale satellite data. Specifically, we utilize Sentinel-2 data for its global coverage as the foundational image source, employing semantic segmentation labels from the European Space Agency’s WorldCover project to enrich the descriptions of land covers. By conducting in-depth semantic analysis, we formulate detailed prompts to elicit rich descriptions from ChatGPT. We then include a manual verification process to enhance the dataset’s quality further. This step involves manual inspection and correction to refine the dataset. Finally, we offer the community ChatEarthNet, a large-scale image-text dataset characterized by global coverage, high quality, wide-ranging diversity, and detailed descriptions. ChatEarthNet consists of 163,488 image-text pairs with captions generated by ChatGPT3.5 and an additional 10,000 image-text pairs with captions generated by ChatGPT-4V(ision). This dataset has significant potential for both training and evaluating vision-language geo-foundation models for remote sensing. The code is publicly available at https://doi.org/10.5281/zenodo.11004358 (Yuan et al., 2024b), and the ChatEarthNet dataset is at https://doi.org/10.5281/zenodo.11003436 (Yuan et al., 2024c).

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.
Zhenghang Yuan, Zhitong Xiong, Lichao Mou, and Xiao Xiang Zhu

Status: open (until 03 Aug 2024)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
Zhenghang Yuan, Zhitong Xiong, Lichao Mou, and Xiao Xiang Zhu

Data sets

ChatEarthNet Zhenghang Yuan, Zhitong Xiong, Lichao Mou, and Xiao Xiang Zhu https://zenodo.org/records/11003436

Model code and software

ChatEarthNet Zhenghang Yuan, Zhitong Xiong, Lichao Mou, and Xiao Xiang Zhu https://github.com/zhu-xlab/ChatEarthNet

Zhenghang Yuan, Zhitong Xiong, Lichao Mou, and Xiao Xiang Zhu

Viewed

Total article views: 50 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
34 10 6 50 3 3
  • HTML: 34
  • PDF: 10
  • XML: 6
  • Total: 50
  • BibTeX: 3
  • EndNote: 3
Views and downloads (calculated since 27 Jun 2024)
Cumulative views and downloads (calculated since 27 Jun 2024)

Viewed (geographical distribution)

Total article views: 50 (including HTML, PDF, and XML) Thereof 50 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 29 Jun 2024
Download
Short summary
ChatEarthNet is an image-text dataset that provides high-quality, detailed natural language descriptions for global-scale satellite data. It consists of 163,488 image-text pairs with captions generated by ChatGPT-3.5, and an additional 10,000 image-text pairs with captions generated by ChatGPT-4V(ision). This dataset has significant potential for training and evaluating vision-language geo-foundation models in remote sensing.
Altmetrics