the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A large-scale image-text dataset benchmark for farmland segmentation
Abstract. Understanding and mastering the spatiotemporal characteristics of farmland is essential for accurate farmland segmentation. The traditional deep learning paradigm that solely relies on labeled data has limitations in representing the spatial relationships between farmland elements and the surrounding environment. It struggles to effectively model the dynamic temporal evolution and spatial heterogeneity of farmland. Language, as a structured knowledge carrier, can explicitly express the spatiotemporal characteristics of farmland, such as its shape, distribution, and surrounding environmental information. Therefore, a language-driven learning paradigm can effectively alleviate the challenges posed by the spatiotemporal heterogeneity of farmland. However, in the field of remote sensing imagery of farmland, there is currently no comprehensive benchmark dataset to support this research direction. To fill this gap, we introduced language-based descriptions of farmland and developed FarmSeg-VL dataset—the first fine-grained image-text dataset designed for spatiotemporal farmland segmentation. Firstly, this article proposed a semi-automatic annotation method that can accurately assign caption to each image, ensuring high data quality and semantic richness while improving the efficiency of dataset construction. Secondly, the FarmSeg-VL exhibits significant spatiotemporal characteristics. In terms of the temporal dimension, it covers all four seasons. In terms of the spatial dimension, it covers eight typical agricultural regions across China, with a total area of approximately 4,300 km2. In addition, in terms of captions, FarmSeg-VL covers rich spatiotemporal characteristics of farmland, including its inherent properties, phenological characteristics, spatial distribution, topographic and geomorphic features, and the distribution of surrounding environments. Finally, we present a performance analysis of vision language models and the deep learning models that rely solely on labels trained on the FarmSeg-VL, demonstrating its potential as a standard benchmark for farmland segmentation. The FarmSeg-VL dataset will be publicly released at https://doi.org/10.5281/zenodo.15099885 (Tao et al., 2025).
- Preprint
(3476 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 19 Jun 2025)
-
RC1: 'Comment on essd-2025-184', Anonymous Referee #1, 19 May 2025
reply
This paper proposes FarmSeg-VL, the first large-scale image-text benchmark dataset for farmland segmentation, which fills the gap of the lack of high-quality farmland multimodal data in the field of remote sensing. The research has significant innovation and application value, the experimental design is systematic, the results are analyzed in detail, the data are open and transparent, and it meets the publication criteria of journals. However, some of the methodological details, scope of application, and writing expressions need to be further optimized.
- The specific application of the model in annotation, such as parameter setting and manual correction ratio, needs to be further explained. In addition, it is necessary to quantify the improvement of annotation efficiency, such as the comparison of time consumption with traditional manual annotation.
- It is suggested to add the selection basis of the 11 key elements, such as whether they have been verified by experts or supported by the literature, in order to enhance the scientific nature of the description framework.
- The current dataset mainly covers the China region. It is suggested that the authors discuss whether the dataset applies to other countries with significant differences in climate or cropping patterns (e.g., Africa, Europe, and the United States). In addition, the authors need to consider whether the global data can be expanded in the future.
- The text mentions that the data cover four seasons, but it is not clear whether the full growth cycle of different crops is covered. It is suggested that additional clarification be provided.
- It is suggested to add the performance comparison of the model on the training set and the test set, or analyze the influence of data distribution on the robustness of the model through cross-validation.
- Tables 4~11 and Figures 9~16 clearly show the situation of each agricultural region, but they occupy more space, I suggest the authors put this part in the supplementary materials.
- Considering the wide international readership, I suggest the authors add some non-Chinese references.
Citation: https://doi.org/10.5194/essd-2025-184-RC1 -
RC2: 'Comment on essd-2025-184', Anonymous Referee #2, 02 Jun 2025
reply
Understanding the spatiotemporal characteristics of farmland is essential for accurate farmland segmentation. This study introduced language-based descriptions of farmland and developed FarmSeg-VL dataset, which was the first fine-grained image-text dataset for farmland segmentation. The proposed method is innovative and dataset is of high accuracy, which had great potentials as a standard benchmark for farmland segmentation. However, there are still some problems that deserve to solve before publications.
(1) In the introduction part, the author mentioned that the label-driven paradigm had some disadvantages for farmland segmentation. The vision-language models (VLMs) can capture more contextual and background information from imageries. More information about the construction of VLM model can be included.
(2) The image-text datasets is also the core of the VLM model, and what is the difference with the traditional label-driven deep learning method.
(3) In table 1, the abbreviation should be explained, such as SR for spatial resolution.
(4) In section 2.2, the author mentioned that the most of image-text datasets just described scene level or object level characteristics instead of specific like farmland segmentation. I wonder the inherent difference among these textural descriptions, and authors can take some detailed examples.
(5) China has a vast territory and different regions have different agricultural conditions. The construction of FarmSeg-VL dataset and related textural descriptions should consider this.
(6) In figure 2, the authors should an example of 5 types of text characteristics for farmland fragmentation. In future study, more quantitative description can be included.
(7) Table 4-11 can be converted into the figure to improve the readability. What is the LoveDA dataset in section 4.5?
(8) In the abstract and conclusion parts, the authors should add more information and quantitative results about the farmland segmentation results in this study.
(9) I wonder if the author adds some negative samples for the farmland fragmentation?
Table show the comparison results solely on labels and VLMs. The deep learning models for these two kinds are different and the training samples for these two kinds are also different. How could these be compared?
(10) There are still some grammatical and lingual problems, and authors should make a thorough revision, such as “annotationand” in line 115.
Citation: https://doi.org/10.5194/essd-2025-184-RC2
Data sets
A large-scale image-text dataset benchmark for farmland segmentation Chao Tao, Dandan Zhong, Weiliang Mu, Zhuofei Du, and Haiyang Wu https://doi.org/10.5281/zenodo.15099885
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
289 | 76 | 9 | 374 | 7 | 5 |
- HTML: 289
- PDF: 76
- XML: 9
- Total: 374
- BibTeX: 7
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1