the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
NortheastChinaMaizeYield10m: A 10-m Resolution Maize Yield Dataset for Northeast China (2019–2024) Generated via a Mechanistically Interpretable, Label-free Framework
Abstract. In the face of escalating global food demand and increasing climate variability, precise and granular crop yield monitoring is indispensable for maintaining regional agricultural stability. However, current deep learning approaches for yield estimation are severely constrained by their heavy reliance on massive in situ labeled data, which limits their application in data-scarce regions. Furthermore, these models often overlook the essential temporal evolution logic of yield formation and lack a systematic discussion regarding the contribution patterns of different feature dimensions, resulting in a black-box nature of the underlying model mechanisms. To address these bottlenecks, this study proposes a label-free maize yield estimation framework that couples mechanistic models with deep learning. The framework’s core strength lies in a physiologically complete simulation database, using the WOFOST model to exhaustively cover 30 years of climate variability and habitat combinations across Northeast China (1.24 × 10⁶ km²). A Gated Recurrent Unit (GRU) network was then introduced for end-to-end modeling, accurately capturing the energy accumulation trajectory from vegetative to reproductive growth. Validation against 458 independent ground points (2022–2024) demonstrated robust generalization with an R² of 0.69, an RMSE of 1.21 t/ha, and an RRMSE of 13.71 %, despite using no ground data for training. Our analysis revealed that integrating photosynthetic intensity (LAImean), duration (LAD) and peak features (LAImax) across growth stages is critical for accuracy, while omitting early-stage features significantly impairs the model's ability to capture cumulative growth effects. Furthermore, the model successfully captured the spatiotemporal yield anomalies caused by the 2023 typhoon and flooding events. Ultimately, this study generated a 10-m resolution maize yield dataset (2019–2024) for Northeast China. The dataset exhibits consistent interannual stability, with the Root Mean Square Error (RMSE) ranging from 7.98 % to 22.21 % and the coefficient of determination (R2) remaining above 0.44 at the county level. By deeply coupling mechanistic simulation with data mining, this dataset provides detailed support for optimizing agricultural production and guiding farming practices. The Northeast China Maize Yield 10-m dataset is openly available at https://zenodo.org/records/19547014 (Hu et al., 2026).
- Preprint
(3308 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 01 Jul 2026)
- RC1: 'Comment on essd-2026-284', Anonymous Referee #1, 12 Jun 2026 reply
-
CC1: 'Comment on essd-2026-284', zhao zhang, 15 Jun 2026
reply
I find the general idea of this study quite innovative. In particular, the use of crop-model simulations to generate training samples, combined with remote-sensing time-series data for high-resolution yield estimation, has clear methodological value. However, I believe several aspects of the manuscript need to be further strengthened to improve its rigor, interpretability, and the practical usability of the resulting data product.
1.Clarify the use of the term “Label-free”.
The manuscript currently presents “Label-free” as a central contribution, but this term may be somewhat misleading. I suggest that the authors define it more precisely, for example as “field-label-free training,” “simulation-driven training,” or “simulation-driven yield estimation.” This distinction should be clearly stated in the abstract, introduction, and methods sections to avoid the impression that the entire study is completely free of any field observations.
2.Add comparisons with existing yield products.
The manuscript would benefit from a more systematic comparison with existing public yield products, such as SPAM and the “A 10 m maize, rice and soybean yield dataset from 2016 to 2021 in Northeast China” dataset. Both visual comparisons and quantitative accuracy assessments would help demonstrate the added value of the proposed product. In addition, the reported accuracy appears to be lower than that of current existing machine-learning-based yield estimation studies. The authors should discuss the possible reasons for this difference, such as the simulation-to-reality gap, limited field observations, simplified management assumptions, or uncertainties in remote-sensing inputs.
3.Include additional model-structure comparisons.
To better justify this model choice, I suggest adding other time-series models, such as LSTM and Transformer-based models, as baseline comparisons. This would help show whether GRU is indeed the most suitable architecture for this task, rather than simply being one possible option.
4.Strengthen the mechanistic interpretability analysis.
The manuscript mainly relies on feature-combination experiments to assess the effects of different input variables. I think it is not sufficient to support a strong claim of “mechanistic interpretability” .I suggest incorporating model interpretation tools such as SHAP to quantify the dynamic contributions of different factors across growth stages to the final yield estimation. This would provide a clearer understanding of how the model uses information from different phenological periods and environmental variables.
5.Provide confidence or uncertainty layers for the yield product.
For a data product intended for practical use, a single yield estimate may not be enough. I suggest providing annual 10 m confidence or uncertainty maps in TIF format. Such layers would allow users to identify where the yield estimates are more reliable and where they should be interpreted with caution.
6.Report uncertainty in model-performance metrics.
Machine-learning models are affected by randomness in model training, sample splitting, and parameter initialization. Therefore, I suggest reporting uncertainty ranges or confidence intervals for the main accuracy metrics. Methods such as bootstrap resampling, repeated random splits, spatial cross-validation, or Monte Carlo experiments could be used to report the mean and 95% confidence interval of R², RMSE, MAE, and bias. This would make the performance assessment more robust and more convincing.
7.Explain how the model captured typhoon-related flood-induced yield losses in 2023.
The manuscript highlights the model’s ability to capture yield losses caused by typhoon-related flooding in 2023. However, if WOFOST was mainly run under a water-limited mode, the simulated training samples would primarily represent water deficit or drought stress rather than yield losses caused by excessive soil moisture or waterlogging. The authors should explain which input signals allowed the model to detect flood-related yield reductions. Was the decline in LAI the main pathway? Were precipitation anomalies, soil moisture conditions, abrupt vegetation-index changes, or post-disaster canopy recovery also involved? If waterlogging processes were not explicitly represented in the simulations, this limitation should be clearly acknowledged in the discussion.
8.Expand the range of WOFOST simulation scenarios.
Although the study generates a series of crop-model simulations, the simulation settings appear to be relatively simplified. For example, soil parameters are limited to four major loam types, and only four fixed sowing dates are considered. These assumptions may not fully capture the diversity of real-world field conditions. The authors should consider adding more combinations of sowing dates, soil parameters, water management, fertilization levels, and irrigation scenarios to improve the representativeness of the simulated training samples.
9.Add validation at a finer administrative scale.
Validation based only on prefecture- or city-level yield statistics may be too coarse and could mask local spatial biases. I suggest collecting county-level maize yield statistics for the three northeastern provinces from 2019 to 2024, if available, and using them for additional validation. This would provide a more rigorous assessment of the spatial accuracy of the proposed yield product.
10.Discuss the simulation-to-reality gap more explicitly.
A key assumption of this study is that a deep-learning model trained on crop-model simulations can be successfully transferred to real remote-sensing conditions. Therefore, the gap between simulated samples and real-world fields is central to the reliability of the method. I suggest adding a dedicated discussion of this simulation-to-reality gap, including uncertainties related to crop-model parameters, management practices, remote-sensing LAI retrieval, meteorological inputs, and the absence or simplification of certain real-world stress processes such as flooding.
Citation: https://doi.org/10.5194/essd-2026-284-CC1
Data sets
NortheastChinaMaizeYield10m: A 10-m Resolution Maize Yield Dataset for Northeast China (2019–2024) Generated via a Mechanistically Interpretable, Label-free Framework Jingbo Hu https://zenodo.org/records/19547014
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 217 | 34 | 9 | 260 | 11 | 9 |
- HTML: 217
- PDF: 34
- XML: 9
- Total: 260
- BibTeX: 11
- EndNote: 9
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This manuscript presents a well-structured and comprehensive framework for large-scale maize yield estimation by integrating process-based modeling, remote sensing data, and a GRU deep learning model. The overall framework is well designed and addresses the common challenge of limited field yield observations in large-scale agricultural monitoring. The generation of a 10 m resolution maize yield dataset covering Northeast China from 2019 to 2024 further demonstrates the practical value of the framework. The evaluation based on both in-situ measurements and government statistical data provides strong validation. The study is relevant to the scope of the journal and provides useful insights into the integration of process-based models and deep learning for yield estimation.
The term "label-free" is used frequently, from the title to the main text. However, the study still relies on WOFOST-simulated labels during model training. It would be helpful for the authors to clarify the exact meaning and scope of "label-free" in the introduction, particularly how it differs from conventional supervised and unsupervised learning methods.
The dataset (field measurements, meteorological/soil data, satellite imagery, crop distribution maps, and statistics) is comprehensive and representative. However, some data preprocessing steps require more detailed technical descriptions. For example, the pre-processing for Sentinel-2 imagery is only briefly introduced, and the source and acquisition of the statistical datasets are not clearly documented.
The current input features are almost exclusively centered around the LAI. Some important factors affecting yield formation, such as water stress and extreme temperature conditions, are not explicitly considered. It is suggested that the authors further elucidate the rationale behind ultimately selecting LAI as the core feature for modeling. Especially, the authors may discuss whether LAI can be regarded as an integrated proxy of crop canopy growth and photosynthesis accumulation, as well as the advantages of using LAI for large-scale yield estimation.
The discussion regarding the differences between simulated and real-world observations could be further strengthened. While Figures 4 and 5 demonstrate the overall representativeness of the simulated data, some factors in real agricultural systems are not explicitly represented in the WOFOST simulations. Additional discussion of these potential limitations would improve the rigor of the manuscript.
The discussion regarding the ability of the simulated dataset to represent extreme conditions may be somewhat overstated. Although the results from the 2023 disaster year demonstrate the potential robustness of the framework under adverse conditions, they may not be sufficient to prove comprehensive coverage of all extreme scenarios. A more cautious interpretation of these findings is recommended.