the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models
Abstract. The rapid development of remote sensing technology has led to an exponential growth in satellite images, yet their inherent complexity often makes them difficult for non-expert users to understand. Natural language, as a carrier of human knowledge, can bridge common users and complicated satellite imagery. Additionally, when paired with visual data, natural language can be utilized to train large vision-language foundation models, significantly improving performance in various tasks. Despite these advancements, the remote sensing community still faces a challenge due to the lack of large- scale, high-quality vision-language datasets for satellite images. To address this challenge, we introduce a new image-text dataset, providing high-quality natural language descriptions for global-scale satellite data. Specifically, we utilize Sentinel-2 data for its global coverage as the foundational image source, employing semantic segmentation labels from the European Space Agency’s WorldCover project to enrich the descriptions of land covers. By conducting in-depth semantic analysis, we formulate detailed prompts to elicit rich descriptions from ChatGPT. We then include a manual verification process to enhance the dataset’s quality further. This step involves manual inspection and correction to refine the dataset. Finally, we offer the community ChatEarthNet, a large-scale image-text dataset characterized by global coverage, high quality, wide-ranging diversity, and detailed descriptions. ChatEarthNet consists of 163,488 image-text pairs with captions generated by ChatGPT3.5 and an additional 10,000 image-text pairs with captions generated by ChatGPT-4V(ision). This dataset has significant potential for both training and evaluating vision-language geo-foundation models for remote sensing. The code is publicly available at https://doi.org/10.5281/zenodo.11004358 (Yuan et al., 2024b), and the ChatEarthNet dataset is at https://doi.org/10.5281/zenodo.11003436 (Yuan et al., 2024c).
- Preprint
(14071 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on essd-2024-140', Anonymous Referee #1, 28 Jul 2024
The authors propose a land cover dataset, ChatEarthNet, built by pairing Sentinel-2 patches with their corresponding WorldCover masks, which contain 12 land cover classes.
The originality comes from providing the land cover data, not directly a a bitmap, but as a textual description extracted from the WorldCover map by means of a large language model (LLM).
Specifically, they use two different models: ChatGPT-3.5, an LLM that can only receive text as input, and ChatGPT-4V, a vision LLM (VLLM) that is able to understand both text and images. Due to cost, they provide 163k images with captions generated by GPT-3.5 and 10k by GPT-4V.
The Sentinel-2 patches are obtained from the dataset SatlasPretrain.
Main comments:
1) The paper describes the prompting process, which differs for GPT-3.5 and -4V. Although the prompt is provided, some details are missing in relation to the exact construction of the outputs of algorithms 1 to 3, since the exact wording of the prompt produced by these algorithms is not given.
2) Section 2.5 briefly mentions that manual verification is applied in order to check that the LLM correctly followed the prompt instructions. However, it is not clear how many times the prompt had to be modified, and the kind of modifications that were required.
3) Although the authors claim that “10k high-quality image-text pairs using ChatGPT-4V are sufficient for fine-tuning large vision-language models”, they do not provide any evidence for this. There is not evaluation of the properties of a model trained with the proposed dataset, making it impossible to judge the quality of the representation that can be learned with it, in comparison with a model trained directly for land cover mapping using the WorldCover data.
4) The authors conclude that “ChatEarthNet is a valuable resource for training and evaluating vision-language geo-foundation models for remote sensing”. However, it is not fully clear how this evaluation would work. To be able to conclude this, I suggest the authors do use the dataset to evaluate existing models, such as RemoteCLIP [A], RSCLIP [B] and others.
[A] Liu, Fan, et al. "Remoteclip: A vision language foundation model for remote sensing." IEEE Transactions on Geoscience and Remote Sensing (2024).
[B] Li, Xiang, et al. "RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision." International Journal of Applied Earth Observation and Geoinformation 124 (2023): 103497.
Minor comments:
In Section 3.2, the authors write that “ChatGPT-3.5 is more dense, covering a wider range of areas”. However, aren’t both datasets obtained by randomly sampling SatlasPretain? Shouldn’t they therefore have roughly the same distribution? If I understand it well, the only difference should be the number of images.
In Section 3.3, they authors explore word frequency in the generated captions.
In line 50, “few pairs in the website” should be “few pairs on the web” or “online”.
The authors often refer to “land covers”, although may be “land cover types” or “classses” would be more appropriate.
Citation: https://doi.org/10.5194/essd-2024-140-RC1 - CC2: 'Reply on RC1', Zhenghang Yuan, 20 Sep 2024
-
RC2: 'Comment on essd-2024-140', Anonymous Referee #2, 12 Aug 2024
1. Throughout the paper, the author mentioned image-text datasets many times. Image-text datasets cover multiple different types of annotations, such as image caption, VQA, and visual grounding. Since this paper focuses on image captioning, the writing should be modified accordingly.
2. The authors use land cover labels from WoldCover products to formulate prompts. The information carried by image captions mainly covers land cover information, this limits the usage of the proposed dataset. This is a big drawback when compared to previous datasets (e.g., RSICap) that provide more diverse information (such as object counting, position, size, and complex reasoning).
3. Information Overlap. To generate image captions, the proposed method divides each image of 256x256 into 5 patches of size 128x128, top-left, top-right, bottom-left, bottom-right, and middle patches. The center patch overlaps with other patches. This causes two issues: 1) duplicated object description; 2) duplicated object counting.
5. "Moreover, considering the API request limit of ChatGPT-4V, we put four images into one request to generate descriptions more efficiently". By putting four images into one request, do you mean concate the images into one? Merging multiple images will cause undesired interactions between image features caused by self-attention in transformer architecture. As far as I know, GPT-4V allows 10,000 requests per day, it's therefore not necessary to put four images into one request.
5. Missing experimental verification. By the current version, it's unclear how this dataset can be used to boost the development of LVLMs in remote sensing. As a benchmark dataset, it's better to show the image captioning performance of existing well-known methods on the proposed dataset.
Minors:
1. ChatGPT-3.5 is not a widely used term. Instead, ChatGPT and gpt-3.5-turbo are more frequently used.
2. In line 67, referring image segmentation belongs to visual grounding and therefore should be merged.
3. In line 44, when mentioning large vision-language foundation models, the authors fail to cover popular models, such as MiniGPT-4, and QWen-VL.
4. In Table I, it's unclear whether the 10,000 images used with GPT-4V are included in those 163,488 images used with GPT-3.5. If included, the second column can be removed.
5. In Fig. 15, it's better to show the y-axis with probability distribution instead of No. images for a fair comparison between GPT-3.5 and GPT-4.
6. Section 3.3 can be compressed.Citation: https://doi.org/10.5194/essd-2024-140-RC2 - CC3: 'Reply on RC2', Zhenghang Yuan, 20 Sep 2024
-
RC3: 'Comment on essd-2024-140', Anonymous Referee #3, 06 Sep 2024
The authors propose an image-text dataset for remote sensing vision-language geo-foundation models. In detail, the image source is from Sentinel-2 data, and the descriptions of land covers is obtained from the semantic segmentation labels of the European Space Agency’s WorldCover project. Moreover, ChatGPT and the manual verification process are introduced to enhance the dataset. The presented work focus on considerable data collection and processing, however the experimentation could be further improved. The reviewer has the following comments:
Main comments:
- In this work, a global image-text dataset is presented in the field of remote sensing. There are some existing image-text datasets, and the authors are encouraged to specifically compare the proposed dataset with those that exist, such as SkySenseGPT (https://arxiv.org/pdf/2406.10100), SkyScript (https://ojs.aaai.org/index.php/AAAI/article/view/28393), and RemoteCLIP (https://ieeexplore.ieee.org/document/10504785).
- For the designed dataset, how to consider the imbalance between foreground and background in the remote sensing segmentation task?
- The authors are advised to explain the reasons for choosing the land cover maps from WorldCover. In addition, how to measure the accuracy of labelling in these land cover maps?
- The authors mentioned that the proposed dataset has many high-quality and detailed descriptions, and is it validated by quantitative comparison experiments with other datasets?
- The authors are encouraged to discuss which attributes are more important for the multimodal vision-language learning than for the vision representation, e.g., relative size or relative position described in the text.
- It seems that there is a lot of textual information described in the proposed dataset, does this introduce interfering information? How to avoid negative learning due to interfering information?
Minors:
- Please rephrase the description of “Image-Text Dataset”, could the proposed dataset be used with other vision-language tasks, such as, image-to-text and text-to-image synthesis?
- The “2.5 Manual verification” section is suggested to add details of manual adjustments, such as under what circumstances manual verification are required and what information is adjusted. An example is visual representation.
- The y-axis of Figs. 9-10 are suggested to be revised to the same to make contrasts clearer.
- The authors claim that “it stands out as the first dataset offering high-quality detailed land cover descriptions on a global scale” on line 230 of page 14. Please replace this expression with a more accurate description.
- Page 10 has gaps, and the authors are encouraged to reformat the article.
Citation: https://doi.org/10.5194/essd-2024-140-RC3 - CC4: 'Reply on RC3', Zhenghang Yuan, 20 Sep 2024
- CC1: 'Comment on essd-2024-140', Zhenghang Yuan, 20 Sep 2024
- AC1: 'Response on essd-2024-140', Xiao Xiang Zhu, 04 Oct 2024
Data sets
ChatEarthNet Zhenghang Yuan, Zhitong Xiong, Lichao Mou, and Xiao Xiang Zhu https://zenodo.org/records/11003436
Model code and software
ChatEarthNet Zhenghang Yuan, Zhitong Xiong, Lichao Mou, and Xiao Xiang Zhu https://github.com/zhu-xlab/ChatEarthNet
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
471 | 139 | 440 | 1,050 | 23 | 19 |
- HTML: 471
- PDF: 139
- XML: 440
- Total: 1,050
- BibTeX: 23
- EndNote: 19
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1