NortheastChinaSoybeanYield20m: an annual soybean yield dataset at 20 m in Northeast China from 2019 to 2023

Xu, Jingyuan; Du, Xin; Dong, Taifeng; Li, Qiangzi; Zhang, Yuan; Wang, Hongyan; Xiao, Jing; Zhang, Jiashu; Shen, Yunqi; Dong, Yong

doi:10.5194/essd-2024-586

Preprints

https://doi.org/10.5194/essd-2024-586

Preprints

08 Jan 2025

| 08 Jan 2025

Status: a revised version of this preprint is currently under review for the journal ESSD.

NortheastChinaSoybeanYield20m: an annual soybean yield dataset at 20 m in Northeast China from 2019 to 2023

Jingyuan Xu, Xin Du, Taifeng Dong, Qiangzi Li, Yuan Zhang, Hongyan Wang, Jing Xiao, Jiashu Zhang, Yunqi Shen, and Yong Dong

Abstract. Accurate monitoring of crop yield is important for ensuring food security. However, exiting yield datasets with a coarse spatial resolution are inadequate for capturing small scale spatial heterogeneity. Current yield estimation methods, such as machine learning models or the assimilation of remotely sensed biophysical variables into crop growth models, depend heavily on ground observations and involve significant computational costs. To solve these problems, a hybrid framework coupling the World Food Studies Simulation Model (WOFOST) and the Gated Recurrent Unit model (GRU) was proposed to generate a 20 m soybean yield dataset in Northeast China from 2019 to 2023 (NortheastChindaSoybeanYield20m). A soybean growth dataset was first generated based on the WOFOST that simulated various production scenarios (climates, crop varieties, soil types and agro-managements). The GRU model was then trained for characterizing relationships between model simulated LAI and soybean yield. The trained model was then applied for soybean yield estimation in Northeast China using time series LAI of different growth stages derived from Sentinel-2. The accuracy of the dataset was evaluated by in-situ measured and statistical data. The overall accuracy was 287.44 kg ha^-1 and 272.36 kg ha^-1 in the root mean squared error (RMSE) for field and regional scale, respectively. Stable results were achieved through the years with mean relative error (MRE) on average of 11.46 % in municipal scale and 7.94 % in provincial scale. Results demonstrated that the model was able to capture spatial-temporal variation of soybean yield. The NortheastChinaSoybeanYield20m was able to capture spatial-temporal variation of soybean yield, which can be applied for optimizing soybean production distribution and guiding agricultural decision-making. The NortheastChinaSoybeanYield20m dataset can be downloaded from https://doi.org/10.5281/zenodo.14263103 (Xu et al., 2024).

Received: 10 Dec 2024 – Discussion started: 08 Jan 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Jingyuan Xu, Xin Du, Taifeng Dong, Qiangzi Li, Yuan Zhang, Hongyan Wang, Jing Xiao, Jiashu Zhang, Yunqi Shen, and Yong Dong

Status: final response (author comments only)

RC1:
'Comment on essd-2024-586', Anonymous Referee #1, 23 Jan 2025
I am very familiar with the WOFOST model and the dataset used by the author. It is not a good simulation project, not only because the simulation accuracy did not meet industry standards, but also because the author withheld many critical details and settings of the WOFOST in the manuscript, which makes it difficult for me to assess the rationality and scientific validity of the simulation. Earth System Science Data, as the name suggests, focuses on the application of datasets, but the author's professionalism in describing and processing the dataset is not good. Moreover, the description of CRU is severely inadequate. After reading the entire manuscript, I still do not understand the role of the CRU used by the author in this study.
The study spanned from 2019 to 2023, but the sampling data was only from 2022 and 2023 (Fig.1). The author should explain this issue in the text.

The soil data should be described in more detail, for example, which soil parameters were used in this study.

The author used statistical data from 1980 to 2022, but the study's time scale is from 2019 to 2023. This is confusing for the readers. Please provide an explanation.

The technology roadmap that needs improvement. 1) The author mentioned agro-management data in Figure 2, but it is not mentioned in Section 2.2 Data collections. 2) The sampling data mentioned in the Data collections section is not reflected in the figures, as well as meteorological data from National Meteorological Information Center. 3) The method of combining remote sensing data and model output through GRU is described too simplistically. 4) The author allocates a large proportion of the figures to how WOFOST conducts simulations, but this is not the focus of this study. The focus of this study should be on how to use models and remote sensing coupled for yield estimation, just as the author introduces in the research objective: "Designing a hybrid model coupling crop growth model and deep learning model for soybean yield estimation." The technology roadmap should more detailed display the research focus.

Which sub-model of PCSE did the author use? LINTUL3 or Wofost72_PP?

As far as I know, VAP is not included in the ERA5 dataset. How did the author obtain the VAP data?

Line 209-215: The description of the calculation process for soil parameters is too simplistic; a detailed calculation process should be provided. For example, which parameters from the Chinese soil database were used in the study, and what theories/formulas were utilized to calculate the SMW, SMFCF, SM0, and K0 required by the WOFOST model? Is Table 2 a lookup table? Where did it come from?

Line 215: The description of Table 3 is redundant. It suffices to directly list the values and sources of the WOFOST crop parameters. Table 4 should list all crop parameters in WOFOST, not just the main crop parameters.

Line 235: What’s the setting of the fertilizer application rate and timing in the WOFOST?

Line 244: After reading Section 3.2 Development of the Grated Recurrent Unit model (GRU), I am still unclear about the role of GRU in this study. The author's explanation of the principles of GRU is unclear. It does not directly describe how GRU combines the output of the WOFOST model with remote sensing data, as shown in the technical roadmap. Figure 3 lacks self-explanatory power, leaving it unclear what exactly the inputs and outputs of the GRU are.

Why is MODIS data mentioned again in Line 315? MODIS data was not mentioned in the data collection section.

As shown in Figures 6, 7, and A2, the model simulation accuracy is below industry standards.

By the way, Line 240:” 3.1.2 Multi-scenarios crop simulations”, author said:” The four different types of model parameters were arranged and combined to generate various simulation scenarios”. Where could I read the scenario settings and the results of this part in the manuscript?
Citation: https://doi.org/10.5194/essd-2024-586-RC1
- AC1: 'Reply on RC1', Jingyuan Xu, 01 Mar 2025
  
  We would like to sincerely thank your valuable comments and insightful suggestions. The feedback has greatly helped improve the quality of our manuscript. We have carefully addressed each comment and have provided detailed responses in the attached file.
  
  Citation: https://doi.org/10.5194/essd-2024-586-AC1
- AC2: 'Reply on RC1', Jingyuan Xu, 25 Apr 2025
  
  Dear Referee #1,
  
  Thank you again for your great efforts on our manuscript. We recently received comments from other reviewers and have updated our responses accordingly. Please refer to the supplement for our updated responses.
  
  Sincerely yours,
  
  Jingyuan Xu, on behalf of the co-authors
  
  Citation: https://doi.org/10.5194/essd-2024-586-AC2
RC2:
'Comment on essd-2024-586', Anonymous Referee #2, 17 Mar 2025

The overall structure of the article is clear and logically organized. The research demonstrates innovation by integrating crop growth models with deep learning algorithms for soybean yield estimation, representing a promising direction in agricultural remote sensing. The research objectives are well-defined, aiming to address existing limitations in soybean yield data (insufficient spatial resolution and reliance on ground observations), thereby supporting optimized soybean production distribution and agricultural decision-making.
Specific Comments:
1 Introduction: The section comprehensively highlights soybean's global food security significance and limitations of current yield estimation methods, establishing a solid research rationale. However, the comparative discussion of data-driven and knowledge-driven methods could be more concise to better emphasize core issues and proposed solutions. Additionally, enhancing explanations of environmental factors' mechanisms (e.g., how climatic conditions affect growth cycles and photosynthesis, or how soil properties constrain nutrient uptake and water retention) would provide a more systematic understanding of key yield determinants and their interactions.
2 Data Collection: The dataset (field measurements, meteorological/soil data, satellite imagery, crop distribution maps, and statistics) is comprehensive and representative. However, data processing steps (e.g., meteorological data interpolation, satellite image preprocessing) require more detailed technical descriptions to improve reproducibility. Furthermore, explicit clarification is needed regarding spatial alignment and scale conversion methods employed for integrating multi-resolution datasets.
3 Results: Results are effectively visualized through figures/tables demonstrating WOFOST model simulations, multi-scale estimation accuracy, and spatial yield patterns. The analysis appropriately discusses model accuracy, stability, and spatiotemporal pattern recognition capabilities. However, deeper interpretation of anomalies (e.g., regional/yearly estimation errors) is needed. Notably, the systematic overestimation in field-scale validation suggests potential model biases (e.g., systematic errors or overfitting), warranting further investigation.
4 Discussion: When discussing MODIS-Sentinel-2 complementarity, quantitative comparisons of their performance under varying conditions (weather/vegetation coverage) would strengthen data selection guidance. Future research directions could be expanded by aligning with emerging trends (e.g., integration with IoT/blockchain technologies, precision agriculture applications), thereby enhancing both theoretical depth and practical relevance for agricultural challenges.

Citation: https://doi.org/10.5194/essd-2024-586-RC2
- AC3: 'Reply on RC2', Jingyuan Xu, 25 Apr 2025
  
  Dear Referee #2,
  Thank you very much for your great efforts on our manuscript. Inspired by your valuable comments, we have made a major revision to our manuscript. Please refer to the supplement for our point-to-point responses to your comments.
  Sincerely yours,
  Jingyuan Xu, on behalf of the co-authors
  
  Citation: https://doi.org/10.5194/essd-2024-586-AC3
RC3:
'Comment on essd-2024-586', Anonymous Referee #3, 23 Mar 2025

This study presents a well-structured and logically organized framework for high-resolution soybean yield estimation. The combination of process-based modeling with deep learning offers a novel perspective for enhancing agricultural monitoring capabilities. The objectives are clearly articulated, with a strong focus on improving soybean yield data accuracy to support agricultural decision-making and production optimization. The methodological approach is rigorous, leveraging diverse production scenarios to train the GRU model and applying time-series Sentinel-2 data for large-scale yield estimation. The evaluation using in-situ measurements and government statistical data provides strong validation, and the reported accuracy metrics indicate reliable model performance across spatial and temporal scales. There are some suggestions as follows, which can be considered for further improvement of the manuscript.
The research is well-founded and presents significant innovations. However, the abstract and introduction sections could benefit from more professional and polished language to enhance readability and better highlight the study’s contributions. Refining the writing style would improve clarity, strengthen the articulation of the research objectives, and more effectively emphasize the novelty of the proposed hybrid framework.
Figure 1: where is the soybean classification map from? What is the accuracy?
Figure 5 appears blurry, which affects the clarity and readability of the presented data. I suggest organizing box plots and histograms as subfigures.
The discussion on the advancements of the proposed method is embedded within the “Limitations and future developments” section. To better highlight the strengths of this study, I recommend extracting this content into a standalone subsection. This would allow for a clearer and more structured presentation of the method’s advantages, making it easier for readers to appreciate its contributions in comparison to existing approaches.
The conclusion effectively summarizes the study but could be further refined to better highlight the innovation in dataset construction and its practical applications in agricultural management.

Citation: https://doi.org/10.5194/essd-2024-586-RC3
- AC4: 'Reply on RC3', Jingyuan Xu, 25 Apr 2025
  
  Dear Referee #3,
  
  Thank you very much for your great efforts on our manuscript. Inspired by your valuable comments, we have made a major revision to our manuscript. Please refer to the supplement for our point-to-point responses to your comments.
  
  Sincerely yours,
  
  Jingyuan Xu, on behalf of the co-authors
  
  Citation: https://doi.org/10.5194/essd-2024-586-AC4
EC1:
'Comment on essd-2024-586', Peng Zhu, 12 Jul 2025

The authors developed a deep learning model using a GRU architecture to predict crop yield, utilizing only two predictors: LAImean1 and LAImean2. Given the simplicity of these two predictors, it raises questions about how they can achieve high prediction accuracy. The authors should provide a more detailed explanation of the underlying reasons or mechanisms that enable such effective performance with just these two variables.

Citation: https://doi.org/10.5194/essd-2024-586-EC1
- AC5: 'Reply on EC1', Jingyuan Xu, 21 Aug 2025
  
  Dear Editor,
  Thank you very much for your great efforts on our manuscript. Inspired by your valuable comments, we have made a major revision to our manuscript. Please refer to the supplement for our response to your comments.
  Sincerely yours,
  Jingyuan Xu, on behalf of the co-authors
  
  Citation: https://doi.org/10.5194/essd-2024-586-AC5

Jingyuan Xu, Xin Du, Taifeng Dong, Qiangzi Li, Yuan Zhang, Hongyan Wang, Jing Xiao, Jiashu Zhang, Yunqi Shen, and Yong Dong

Data sets

NortheastChinaSoybeanYield20m: an annual soybean yield dataset at 20 m in Northeast China from 2019 to 2023 Jingyuan Xu, Xin Du, Taifeng Dong, Qiangzi Li, Yuan Zhang, Hongyan Wang, Jing Xiao, Jiashu Zhang, Yunqi Shen, and Yong Dong https://doi.org/10.5281/zenodo.14263103

Jingyuan Xu, Xin Du, Taifeng Dong, Qiangzi Li, Yuan Zhang, Hongyan Wang, Jing Xiao, Jiashu Zhang, Yunqi Shen, and Yong Dong

Viewed

Total article views: 1,702 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,186	470	46	1,702	42	83

HTML: 1,186
PDF: 470
XML: 46
Total: 1,702
BibTeX: 42
EndNote: 83

Views and downloads (calculated since 08 Jan 2025)

Month	HTML	PDF	XML	Total
Jan 2025	147	17	5	169
Feb 2025	68	13	0	81
Mar 2025	112	25	3	140
Apr 2025	88	43	5	136
May 2025	51	23	2	76
Jun 2025	46	16	6	68
Jul 2025	63	19	7	89
Aug 2025	86	29	4	119
Sep 2025	288	11	1	300
Oct 2025	38	14	0	52
Nov 2025	54	47	4	105
Dec 2025	33	76	3	112
Jan 2026	73	66	4	143
Feb 2026	38	61	2	101
Mar 2026	1	10	0	11

Cumulative views and downloads (calculated since 08 Jan 2025)

Month	HTML	PDF	XML	Total
Jan 2025	147	17	5	169
Feb 2025	68	13	0	81
Mar 2025	112	25	3	140
Apr 2025	88	43	5	136
May 2025	51	23	2	76
Jun 2025	46	16	6	68
Jul 2025	63	19	7	89
Aug 2025	86	29	4	119
Sep 2025	288	11	1	300
Oct 2025	38	14	0	52
Nov 2025	54	47	4	105
Dec 2025	33	76	3	112
Jan 2026	73	66	4	143
Feb 2026	38	61	2	101
Mar 2026	1	10	0	11

Viewed (geographical distribution)

Total article views: 1,673 (including HTML, PDF, and XML) Thereof 1,673 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 03 Mar 2026

Short summary

This study proposed a 20 m soybean yield dataset in Northeast China (NortheastChindaSoybeanYield20m) from 2019 to 2023 using a hybrid framework coupling crop growth model with deep learning algorithm. Stable results were achieved through the years. The overall accuracy of the dataset was 287.44 kg ha^-1 and 272.36 kg ha^-1 in the root mean squared error for field and regional scale, respectively. The study satisfied the urgent demands for precise control of crop yield information.


Total:	0
HTML:	0
PDF:	0
XML:	0