NortheastChinaMaizeYield10m: A 10-m Resolution Maize Yield Dataset for Northeast China (2019–2024) Generated via a Mechanistically Interpretable, Label-free Framework
Abstract. In the face of escalating global food demand and increasing climate variability, precise and granular crop yield monitoring is indispensable for maintaining regional agricultural stability. However, current deep learning approaches for yield estimation are severely constrained by their heavy reliance on massive in situ labeled data, which limits their application in data-scarce regions. Furthermore, these models often overlook the essential temporal evolution logic of yield formation and lack a systematic discussion regarding the contribution patterns of different feature dimensions, resulting in a black-box nature of the underlying model mechanisms. To address these bottlenecks, this study proposes a label-free maize yield estimation framework that couples mechanistic models with deep learning. The framework’s core strength lies in a physiologically complete simulation database, using the WOFOST model to exhaustively cover 30 years of climate variability and habitat combinations across Northeast China (1.24 × 10⁶ km²). A Gated Recurrent Unit (GRU) network was then introduced for end-to-end modeling, accurately capturing the energy accumulation trajectory from vegetative to reproductive growth. Validation against 458 independent ground points (2022–2024) demonstrated robust generalization with an R² of 0.69, an RMSE of 1.21 t/ha, and an RRMSE of 13.71 %, despite using no ground data for training. Our analysis revealed that integrating photosynthetic intensity (LAImean), duration (LAD) and peak features (LAImax) across growth stages is critical for accuracy, while omitting early-stage features significantly impairs the model's ability to capture cumulative growth effects. Furthermore, the model successfully captured the spatiotemporal yield anomalies caused by the 2023 typhoon and flooding events. Ultimately, this study generated a 10-m resolution maize yield dataset (2019–2024) for Northeast China. The dataset exhibits consistent interannual stability, with the Root Mean Square Error (RMSE) ranging from 7.98 % to 22.21 % and the coefficient of determination (R2) remaining above 0.44 at the county level. By deeply coupling mechanistic simulation with data mining, this dataset provides detailed support for optimizing agricultural production and guiding farming practices. The Northeast China Maize Yield 10-m dataset is openly available at https://zenodo.org/records/19547014 (Hu et al., 2026).