the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
CY-Bench: A comprehensive benchmark dataset for sub-national crop yield forecasting
Abstract. In-season, pre-harvest crop yield forecasts are essential for enhancing transparency in commodity markets and improving food security. They play a key role in increasing resilience to climate change and extreme events and thus contribute to the United Nations’ Sustainable Development Goal 2 of zero hunger. Pre-harvest crop yield forecasting is a complex task, as several interacting factors contribute to yield formation, including in-season weather variability, extreme events, long-term climate change, soil, pests, diseases and farm management decisions. Several modeling approaches have been employed to capture complex interactions among such predictors and crop yields. Prior research for in-season, pre-harvest crop yield forecasting has primarily been case-study based, which makes it difficult to compare modeling approaches and measure progress systematically. To address this gap, we introduce CY-Bench (Crop Yield Benchmark), a comprehensive dataset and benchmark to forecast maize and wheat yields at a global scale. CY-Bench was conceptualized and developed within the Machine Learning team of the Agricultural Model Intercomparison and Improvement Project (AgML) in collaboration with agronomists, climate scientists, and machine learning researchers. It features publicly available sub-national yield statistics and relevant predictors—such as weather data, soil characteristics, and remote sensing indicators—that have been pre-processed, standardized, and harmonized across spatio-temporal scales. With CY-Bench, we aim to: (i) establish a standardized framework for developing and evaluating data-driven models across diverse farming systems in more than 25 countries across six continents; (ii) enable robust and reproducible model comparisons that address real-world operational challenges; (iii) provide an openly accessible dataset to the earth system science and machine learning communities, facilitating research on time series forecasting, domain adaptation, and online learning. The dataset (https://doi.org/10.5281/zenodo.11502142, (Paudel et al., 2025a)) and accompanying code (https://github.com/WUR-AI/AgML-CY-Bench, (Paudel et al., 2025b))) are openly available to support the continuous development of advanced data driven models for crop yield forecasting to enhance decision-making on food security.
- Preprint
(9762 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 17 May 2025)
-
AC1: 'Correction to Author List', Michiel Kallenberg, 13 Mar 2025
reply
Due to an oversight, three contributing authors were inadvertently omitted from the author list. Their names and affiliations are as follows:
- Dainius Masiliūnas, Wageningen University & Research
- Allard de Wit, Wageningen University & Research
- Maximilian Zachow, Technical University of Munich
They will be included in the next revision of the manuscript.
Citation: https://doi.org/10.5194/essd-2025-83-AC1 -
RC1: 'Comment on essd-2025-83', Anonymous Referee #1, 09 Apr 2025
reply
This study proposed a crop yield prediction benchmark for sub-national crop yield forecasting. The authors did a good job to build the dataset and the data sharing platform. However, there are some fundamental issues with the study design and protocols that need to be addressed before it can be considered for publication:
- Novelty: The novelty of this study is not entirely clear. Based on the reviewer's previous experience reviewing/reading benchmark dataset papers, researchers typically contribute either newly collected datasets—often the result of years of fieldwork or labor-intensive data labeling—or propose novel modeling approaches that produce grid-level maps with global coverage. Importantly, such studies also provide new insights derived from their unique datasets. In contrast, the current study relies on publicly available yield statistics, processes commonly used data sources, and generates predictors following standard protocols. Both the yield data and predictors have already been widely used in previous research. While compiling them in one place is useful, it is difficult to identify what is truly novel about this work.
- Static crop masks: Crop masks play a crucial role in processing yield predictors, as they help eliminate irrelevant pixels and reduce noise in the data. However, this study applies a static crop mask across multiple years, which is problematic. This approach does not align with the standards typically expected in benchmarking studies.
- Quality Control and Uncertainty analysis: Government-reported statistics are not always accurate, and the quality of data can vary significantly across countries. Therefore, it is essential to perform quality control and/or assess the uncertainty associated with the data samples—an important step that appears to be missing in the current study.
More specific comments are given below:
- “Crop yield forecasts are produced by both private entities and government institutes using field surveys, process-based crop models, statistical regression and machine learning (Basso and Liu, 2019; Schauberger et al., 2020; Paudel et al., 2021; Gavasso- Rita et al., 2023)”: It would be helpful if the authors clearly indicated which citations are associated with each specific method. Also, it is not clear what is the difference between statistical regression and machine learning.
- Line 35: It is biased to only give the example of EU while ignoring other major crop production regions (China, US, South America)
- Line 59: “Benchmark datasets must cover a wide variety of regions and countries”: The rationale behind this requirement is unclear. Why is it essential for benchmark datasets to span such a broad geographic scope? I think benchmark datasets are task specific. From the classic computer vision dataset ImageNet to the remote sensing datasets (the UC-Merced dataset (Yang and Newsam, 2010), the WHU-RS19 dataset (Xia et al., 2010), the AID dataset (Xia et al., 2017)), they focus on specific classes/regions/tasks. For the same reason, I think a dataset focusing on U.S. only or China only can still be regarded as benchmarks.
- Line 110: “we engaged a diverse community of researchers to weigh the benefits and limitations of data sources for each type of data necessary to produce crop yield forecasts” Could the authors be more specific on the decision and quality control processes? What specific benefits and limitations have been considered?
- Line 115: “The most relevant weather variables for crop yield forecasting are temperature, solar radiation, and precipitation” It is controversial. Some studies have claimed that VPD and ET are more informative than temperature and precipitation in terms of yield prediction.
- Line 118: Why was AgERA5 selected? I think there are a lot of alternative choices (PRISM, Gridmet, TerraClimate …) that have been used in yield prediction studies.
- Line 127: I think SSM and RSM are highly correlated.
- Line 145: It is not clear why only fPAR and NDVI are chosen. NDVI has been known for its issue of saturation and there are quite a few alternatives (EVI, GCVI).
- Line 149: Why using the eight-day composite?
- “Predictor data and yield statistics often differ in spatial and temporal resolution, requiring further processing to align them effectively” How would the mismatch between data sources increase the uncertainty and impact the quality of the predictors?
- Line 237 “CY-Bench currently includes predictor data up to and including 2023.” What is the starting year?
- Figure 3-4: Since the CY benchmark dataset focuses on the sub national yield statistics, it is misleading to color the whole counties, instead of color the sub-national units with the yield records. For example, there is no corn/wheat in US Alaska.
- “The crop masks and crop calendars included in CY-Bench are static, i.e. they do not reflect yearly changes.” It is a very critical issue. The cropland experiences dramatic changes from year to year, even at the subnational level. By using a static crop mask, the predictors can be totally wrong in early years.
- On the shared Github Leaderboard, there are quite a few counties in which the ML models achieved poor accuracy (negative R2). Given that, how would the authors justify the input features have been well processed, or the uncertainty in the yield data has been well controlled? Could the authors explain if it is safe to use a benchmark datasets whose benchmark accuracies are that low?
Citation: https://doi.org/10.5194/essd-2025-83-RC1 -
AC2: 'Authors' Initial Response to Reviewer Comments', Michiel Kallenberg, 15 Apr 2025
reply
Dear Editor and Reviewers,
We sincerely thank you for your insightful suggestions and feedback, as well as the time you have dedicated to reviewing our manuscript. Please find attached our initial response to your comments.
Sincerely,
Authors
Data sets
CY-Bench: A comprehensive benchmark dataset for subnational crop yield forecasting D. Paudel et al. https://doi.org/10.5281/zenodo.11502142
Model code and software
CY-Bench: A comprehensive benchmark dataset for subnational crop yield forecasting D. Paudel et al. https://github.com/WUR-AI/AgML-CY-Bench/
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
323 | 115 | 7 | 445 | 7 | 8 |
- HTML: 323
- PDF: 115
- XML: 7
- Total: 445
- BibTeX: 7
- EndNote: 8
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1