A large-scale image–text dataset benchmark for farmland segmentation

Tao, Chao; Zhong, Dandan; Mu, Weiliang; Du, Zhuofei; Wu, Haiyang

doi:https://doi.org/10.5194/essd-17-4835-2025

Articles | Volume 17, issue 9

https://doi.org/10.5194/essd-17-4835-2025

Articles | Volume 17, issue 9

Data description paper

26 Sep 2025

Data description paper |

| 26 Sep 2025

A large-scale image–text dataset benchmark for farmland segmentation

Chao Tao, Dandan Zhong, Weiliang Mu, Zhuofei Du, and Haiyang Wu

Abstract

Understanding and mastering the spatiotemporal characteristics of farmland are essential for accurate farmland segmentation. The traditional deep learning paradigm that solely relies on labeled data has limitations in representing the spatial relationships between farmland elements and the surrounding environment. It struggles to effectively model the dynamic temporal evolution and spatial heterogeneity of farmland. Language, as a structured knowledge carrier, can explicitly express the spatiotemporal characteristics of farmland, such as its shape, distribution, and surrounding environmental information. Therefore, a language-driven learning paradigm can effectively alleviate the challenges posed by the spatiotemporal heterogeneity of farmland. However, in the field of remote sensing imagery of farmland, there is currently no comprehensive benchmark dataset to support this research direction. To fill this gap, we introduced language-based descriptions of farmland and developed the FarmSeg-VL dataset – the first fine-grained image–text dataset designed for spatiotemporal farmland segmentation. Firstly, this article proposed a semi-automatic annotation method that can accurately assign captions to each image, ensuring a high data quality and semantic richness while improving the efficiency of dataset construction. Secondly, FarmSeg-VL exhibits significant spatiotemporal characteristics. In terms of the temporal dimension, it covers all four seasons. In terms of the spatial dimension, it covers eight typical agricultural regions across China, with a total area of approximately 4300 km². In addition, in terms of captions, FarmSeg-VL covers rich spatiotemporal characteristics of farmland, including its inherent properties, its phenological characteristics, its spatial distribution, its topographic and geomorphic features, and the distribution of surrounding environments. Finally, we perform a performance analysis of the vision language model and a deep learning model that relies only on labels trained on FarmSeg-VL. Models trained on the vision language model outperform deep learning models that rely only on labels by 10 %–20 %, demonstrating its potential as a standard benchmark for farmland segmentation. The FarmSeg-VL dataset will be publicly released at https://doi.org/10.5281/zenodo.15860191 (Tao et al., 2025).

Download & links

How to cite.

Received: 01 Apr 2025 – Discussion started: 25 Apr 2025 – Revised: 03 Jul 2025 – Accepted: 15 Jul 2025 – Published: 26 Sep 2025

1 Introduction

Farmland has been the foundation of agricultural food security, and accurately monitoring farmland has been crucial for implementing policies such as farmland improvement, enhanced supervision, and planning and control (Sishodia et al., 2020). Currently, the intelligent interpretation of remote sensing images for farmland based on deep learning has become a primary method for farmland monitoring (Li et al., 2023; Tu et al., 2024).

However, existing farmland remote sensing image segmentation methods mainly follow a label-driven deep learning paradigm, which faces significant bottlenecks with regard to both data and models. Specifically, in terms of datasets, although existing benchmark datasets have contributed to the advancement of farmland segmentation technology to some extent, they rely solely on a label-driven deep learning paradigm, which has two main limitations: first, a single label can only drive the model to learn shallow visual features of farmland, which fails to reveal the underlying driving mechanisms affecting the spatial distribution and temporal evolution of farmland. Additionally, it is difficult to represent the spatial–temporal heterogeneity in complex agricultural environments. Specifically, the surface cover of farmland shows seasonal differences in terms of complete coverage, partial coverage, and no coverage with the growth cycle of crops, while diverse terrain leads to significant geographical differentiation in the spatial distribution of farmland and its associations with surrounding features such as waterbodies, buildings, and vegetation. However, existing datasets cannot represent this kind of spatial–temporal heterogeneity, making it difficult for models to establish the inherent relationships between farmland and its surrounding environment. In terms of models, although technologies such as convolutional neural networks (CNNs), graph convolutional networks (GCNs), and Transformer have significantly enhanced feature representation capabilities, the existing label-driven paradigm inherently has clear theoretical flaws. First, the existing label-driven paradigm relies excessively on visual cues and neglects the logical connections between farmland and its surrounding environment in complex farmland scenarios. Second, the label struggles to reflect the evolution of farmland across seasons and growth stages, severely limiting the model's generalization ability in spatiotemporally dynamic scenarios. Therefore, there is an urgent need to break through the theoretical framework of the traditional label-driven deep learning paradigm and explore a new paradigm capable of uncovering the deep semantic logic of farmland.

With the emergence of vision language models (VLMs) and their expanding applications across various fields, studies (Devlin et al., 2019; Wu et al., 2025b, a) have shown that language can reveal deeper semantic clues behind visual information. These VLMs typically follow a general construction process: first, feature representations are extracted from images through a visual encoder, a process aimed at capturing key visual representations in the images. For example, in the LLaVA model (Liu et al., 2023), the image representations generated by the fixed visual encoder lay the foundation for subsequent processing. Next, to establish a connection between vision and language, the model needs to map the extracted visual features to the space of the language model, enabling visual representation to be translated into natural language descriptions or to be understood. The LLaVA model precisely utilizes this method, mapping image representations to the prompt space of large language models, helping the model understand the relationship between visual representation and linguistic expressions, thereby achieving efficient downstream tasks. Furthermore, to enable the model to handle complex tasks, integrating visual perception with language understanding becomes a key step. LISA (Lai et al., 2023) is a typical example; it not only combines visual perception capabilities but also incorporates in-depth language understanding abilities, allowing it to perform reasoning-based tasks such as segmentation tasks. This multimodal information processing capability is one of the important characteristics of VLMs, enabling them to consider visual context while understanding and generating language. These breakthroughs make up for the shortcomings of relying solely on label-guided models to handle complex spatiotemporally heterogeneous farmland scenes, making it possible to mine the complex semantic information in farmland remote sensing images and then model the deep inherent logical relationship between farmland and its surroundings. Specifically, language can guide models to capture farmland features across multiple dimensions, including shapes and boundaries, phenological characteristics that reflect seasonal changes and crop growth states, spatial layout based on latitude and longitude, and geographical features such as terrain and landscape morphology. Additionally, language can describe the relative positional relationships between farmland and surrounding features such as waterbodies, buildings, and vegetation. By integrating these rich semantic cues, VLMs can better understand and interpret the complexity of farmland.

However, in remote sensing, many existing image–text datasets struggle to provide detailed captions and precise annotations for specific land features like farmland. As a result, they often fall short of meeting the requirements for high-accuracy farmland segmentation. For example, the first large-scale remote sensing image–text pair dataset, RS5M (Zhang et al., 2024), and the SkyScript dataset (Wang et al., 2024), which contains millions of image–text combinations, although large in scale, provide a relatively rough description of farmland and fail to deeply describe the specific characteristics of the farmland. In addition, although the manually annotated dataset RSICap (Hu et al., 2025) provides scene-level semantic descriptions, it lacks a refined depiction of the characteristics of the farmland itself, making it difficult to meet the model's need for deep semantic information extraction of the farmland. In contrast to the methods mentioned above, ChatEarthNet (Yuan et al., 2025) seeks to enhance the richness of semantic captions for land cover types by employing detailed prompt strategies and leveraging semantic segmentation labels from ChatGPT and the WorldCover project. However, due to the inherent randomness of automatically generated captions, these captions tend to emphasize the spatial location of farmland within the image while often lacking detailed information about its inherent attributes. Although these datasets have contributed significantly to advancing image–text understanding in remote sensing, most focus on general remote sensing tasks, with only a small portion being dedicated to farmland captions. Moreover, these captions are often neither comprehensive nor in-depth. Existing datasets have not fully reflected the complexity of farmland and its changing characteristics over time and space. This is particularly evident in high-precision farmland segmentation tasks, where there is a lack of deep analysis of farmland characteristics and how they behave in different scenarios.

Table 1Detailed information on non-image–text dataset of farmland.

Symbol / indicates unavailable.

Download Print Version | Download XLSX

To address the above issues, this paper constructs the FarmSeg-VL dataset, a dedicated image–text dataset focused on farmland segmentation, which fully reflects the spatiotemporal characteristics of farmland. FarmSeg-VL covers eight typical agricultural regions in China and includes data samples from four seasons, filling the gap of spatial and temporal imbalance in existing datasets. With its extensive geographical coverage and seasonal variations, this dataset ensures effective support for the learning of various forms of farmland.

The contributions of this paper are as follows:

This study constructed the first farmland image–text benchmark dataset, filling the gap in remote sensing image–text datasets for the farmland-dedicated domain. This dataset includes various types of farmland and covers a wide spatial and temporal range, providing a high-value data foundation for the application research of vision language models in the field of farmland segmentation.
We summarize 11 key elements for describing farmland's inherent properties and its surrounding environment, offering a comprehensive framework for characterizing farmland from multiple perspectives. Additionally, a text template for describing farmland images was designed, providing an important reference for constructing a language dataset focused on farmland.
This study developed a semi-automated annotation method based on the caption templates constructed in this paper. We utilize the semi-automated annotation approach to generate mask and rich captions, significantly reducing labor time while enhancing the authenticity and reliability of the annotations.
Extensive experiments have demonstrated that the model trained on the image–text farmland dataset proposed in this paper improves significantly in terms of farmland segmentation performance and exhibits strong transferability, providing a performance baseline for vision language models in farmland segmentation.

2 Review of existing remote sensing datasets for farmland segmentation

2.1 Non-image–text dataset

Traditional remote sensing datasets for farmland segmentation are mainly annotated with a single label, which can be divided into two categories: dedicated dataset and non-dedicated dataset. The detailed information is provided in Table 1. Non-dedicated datasets, such as the scene-level dataset BigEarthNet (Sumbul et al., 2019), are not very suitable for pixel-level farmland segmentation. Pixel-level datasets, such as WorldCover (ESA) (Zanaga et al., 2022), DynamicWorld (DyWorld) (Brown et al., 2022), and LandCover (Karra et al., 2021) , primarily focus on large-scale mapping and macro-level analysis, making them less suitable for fine-grained farmland segmentation. Moreover, Evlab-SS (Wang et al., 2017) focuses on pixel-level classification, but the proportion of farmland pixels is relatively low, and it remains limited in terms of data scale and coverage area. Although GID (Tong et al., 2020), DeepGlobe-LandCover (Demir et al., 2018), and LoveDA (Wang et al., 2022a) cover large farmland areas with relatively high pixel proportions, the farmland samples lack diversity. For example, the farmland forms in DeepGlobe-LandCover and LoveDA are mostly regular and contiguous, lacking diversity in farmland representation. While these non-dedicated datasets provide large amounts of data for farmland segmentation, their annotations are relatively coarse. Specifically, in pixel-level farmland segmentation, they struggle to fully cover the complex shapes; distribution patterns; and finer details, such as crop growth stages.

In contrast, dedicated datasets such as GFSAD30 (Phalke and Özdoğan, 2018), WEIMIN (Hou et al., 2023), VACD (Li et al., 2024), and FGFD (Li et al., 2025) are specifically designed for farmland segmentation. These datasets offer high-precision farmland annotation and cover a broader range of farmland forms, crop distributions, and other relevant information. The GFSAD30 dataset has a spatial resolution of 30 m, making it suitable for large-scale farmland monitoring but not for fine-grained farmland segmentation. By contrast, WEIMIN and VACD offer higher resolutions; however, since WEIMIN only covers Hebei and VACD only covers Guangdong in China, the diversity of farmland samples is limited. The Fine-Grained Farmland Dataset (FGFD) includes farmland samples from multiple geographic regions. However, it does not account for the phenological characteristics of farmland, limiting its ability to capture seasonal variations and crop growth stages. Although these dedicated datasets offer high annotation accuracy and support fine-grained regional monitoring, their reliance solely on labels to represent farmland's visual characteristics across different spatiotemporal conditions overlooks its inherent complexity and diversity. As a result, they struggle to capture the subtle differences and dynamic changes in farmland driven by seasonal variations and environmental factors.

Table 2Detailed information on the image–text dataset.

2.2 Image–text datasets

Existing remote sensing image–text paired datasets, such as UCM-Captions (Qu et al., 2016), RSICD (Lu et al., 2018), RS5M, NWPU-Captions (Cheng et al., 2022), RSICap, SkyScript, and ChatEarthNet, have been widely used in remote sensing research (see Table 2, where CGM denotes caption generation method). However, these datasets are primarily designed for tasks such as image captioning, scene classification, or image–text retrieval, with limited applicability to farmland segmentation. This limitation stems from their insufficient in-depth semantic representations of farmland morphological characteristics, spatial distribution patterns, and contextual relationships with surrounding features. Consequently, these datasets cannot meet the requirements of the fine-grained semantic understanding that is essential for high-precision farmland segmentation.

Specifically, most of these datasets focus on high-level descriptions of images, such as scene-level or object-level characteristics, rather than the detailed semantic annotations needed for fine-grained tasks like farmland segmentation. For example, in SkyScript, the image caption “land use of farmland” provides only broad classification information without offering specific details about farmland characteristics, such as shape, boundaries, crop growth stages, or surrounding environmental features. Similarly, the RS5M dataset provides only brief titles for images, primarily indicating the image source and land cover categories, without offering detailed descriptions of farmland. Additionally, while some datasets use automated methods to generate large-scale image–text pairs, these automatically generated datasets often suffer from inconsistent quality. The generated text frequently lacks detail and contains redundant information, reducing its effectiveness for fine-grained farmland analysis. For example, in ChatEarthNet, image captions divide each image into four sections, namely top, bottom, left, and right, focusing on the proportions of primary and secondary land cover types in each section rather than providing a dedicated description of farmland. Manually annotated datasets, such as UCM-Captions, RSICD, and NWPU-Captions, provide five captions per farmland image. However, these descriptions are often repetitive and lack specificity. For example, in UCM-Captions, farmland is described simply as “There is a piece of farmland”, while the remaining four descriptions merely rephrase this sentence without adding meaningful details. In RSICD, captions are limited to color and location, such as “green” or “between two forests.” NWPU-Captions expands on this slightly by incorporating shape descriptions, like “rectangular”, but it still lacks deeper insights into farmland characteristics. Although RSICap includes descriptions related to image quality, its farmland annotations remain focused on landscape features and surrounding environments, overlooking inherent farmland attributes. This limited descriptive approach fails to capture farmland's spatiotemporal complexity, making it hard to achieve precise farmland semantic segmentation.

Although these image–text datasets have achieved certain results in large-scale pre-training tasks, their application in the semantic segmentation of farmland remote sensing images is greatly limited due to the lack of pixel-level annotation for semantic segmentation and in-depth description of specific tasks such as farmland segmentation. Therefore, to better support farmland segmentation, the dataset needs to be enhanced by including more fine-grained semantic annotations and comprehensively covering the complex features of farmland.

https://essd.copernicus.org/articles/17/4835/2025/essd-17-4835-2025-f01

Figure 1Dataset construction.

A large-scale image–text dataset benchmark for farmland segmentation

2.1 Non-image–text dataset

2.2 Image–text datasets

3.1 Construction of FarmSeg-VL

3.1.1 (1) RS image acquisition and processing

3.1.2 (2) Caption construction

3.1.3 (3) Semi-automated annotation

3.2 The spatiotemporal-characteristic analysis of FarmSeg-VL based on multidimensional statistics

3.3 Why is FarmSeg-VL more suitable as a dataset benchmark for farmland segmentation?

4.1 Experimental setup

4.2 Fine tuning general VLMs with FarmSeg-VL: bridging domain gaps and enhancing semantic comprehension for farmland segmentation

4.3 Comparing model performance trained on FarmSeg-VL in different agricultural regions

4.4 Cross-domain performance evaluation of models trained on FarmSeg-VL

4.5 Enhanced model transferability: comparative analysis of FarmSeg-VL and conventional farmland datasets