GlobalWheatYield4km: a global wheat yield dataset at 4-km resolution during 1982&ndash;2020 based on deep learning approach

Luo, Yuchuan; Zhang, Zhao; Cao, Juan; Zhang, Liangliang; Zhang, Jing; Han, Jichong; Zhuang, Huimin; Cheng, Fei; Xu, Jialu; Tao, Fulu

doi:https://doi.org/10.5194/essd-2022-423

Preprints

https://doi.org/10.5194/essd-2022-423

Preprints

13 Dec 2022

| 13 Dec 2022

Status: this discussion paper is a preprint. It has been under review for the journal Earth System Science Data (ESSD). The manuscript was not accepted for further review after discussion.

GlobalWheatYield4km: a global wheat yield dataset at 4-km resolution during 1982–2020 based on deep learning approach

Yuchuan Luo, Zhao Zhang, Juan Cao, Liangliang Zhang, Jing Zhang, Jichong Han, Huimin Zhuang, Fei Cheng, Jialu Xu, and Fulu Tao

Abstract. Accurate and spatially explicit information on global crop yield is paramount for guiding policy-making and ensuring food security. However, most public datasets are at coarse resolution in both space and time. Here, we used data-driven models to develop a 4-km dataset of global wheat yield (GlobalWheatYield4km) from 1982 to 2020. First, we proposed a phenology-based approach to map spatial distributions of spring and winter wheat. Then we determined the optimal grid-scale yield estimation model by comparing the performance of two data-driven models (i.e., Random Forest (RF) and Long Short-Term Memory (LSTM)), with publicly available data (i.e., satellite and climatic data from the Google Earth Engine (GEE) platform, soil properties, and subnational-level census data covering ~11000 political units). The results showed that GlobalWheatYield4km captured 82 % of yield variations with RMSE of 619.8 kg/ha across all subnational regions and years. In addition, our dataset had a higher accuracy (R² ~0.71) as compared with Spatial Production Allocation Model (SPAM) (R² ~ 0.49) across all subnational regions and three years. The GlobalWheatYield4km dataset might play important roles in modelling crop system and assessing climate impact over larger areas (DOI of the referenced dataset: https://doi.org/10.6084/m9.figshare.10025006; Luo et al., 2022b).

Received: 05 Dec 2022 – Discussion started: 13 Dec 2022

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 3396 KB)

Supplement (1402 KB)

Download & links

Yuchuan Luo, Zhao Zhang, Juan Cao, Liangliang Zhang, Jing Zhang, Jichong Han, Huimin Zhuang, Fei Cheng, Jialu Xu, and Fulu Tao

Status: closed

RC1:
'Comment on essd-2022-423', Anonymous Referee #1, 07 Mar 2023
This paper proposed using ML and DL methods to generate a global wheat yield dataset at 4-km resolution during 1982-2020. The generated dataset has a range of potential applications in the agricultural sciences. However, there are several sections in the paper that lack clarity and may require further elaboration. It is recommended that the author address these points before proceeding to the next stage.
Major comments
The paper incorporates a range of input factors, such as NDVI, LAI, climate, and soil properties, to estimate wheat yield. However, it is unclear which factors hold greater importance than others. To address this, I suggest that the authors consider conducting a feature importance analysis to provide readers with a better understanding of the relative significance of these input factors.

I am curious as to why the FAO annual country-level wheat yields were not utilized for training or validation in the study. It would be helpful if the authors could provide an explanation for this decision.

Minor comments
Abstract - “The results showed that GlobalWheatYield4km captured 82% of yield variations with RMSE of 619.8 kg/ha.” It is unknown the yield variations of which data the authors referred to. SMAP data?
Introduction - While the authors briefly mention the use of a phenology-based method to derive the global spatial distribution of wheat at the end of the Introduction section, it may be beneficial to introduce this method earlier in the paper. It would be helpful to explain the advantages and strengths of using a phenology-based approach over other methods.
Fig 1 - Add data source.
Section 2.2.3 - Please fix the typo in LINE 102 “maximum temperature (Tmin), minimum temperatures (Tmax)”.
Section 2.3 - please add a description of the Global Wheat Production Mapping System (GWPMS), though it was cited as Luo et al. (2022a).
Section 2.3.1 - “we compared the cropland map derived from the GFSAD1KCM with statistics”. It is unclear what statistics is used here.
Fig 4 - Add country name as figure title or legend.
Discussion - (i) “In future studies, we will attempt to map the 250 spatial distribution of wheat using remote sensing images with finer spatial resolutions.” Please explain more on which datasets of finer spatial resolutions you plan to use. (ii) I am also interested in whether the methodology employed in this study can be extended to other crop species. It may be worthwhile to include a paragraph in the paper that discusses the potential applications or challenges of using similar methods for other crops, providing readers with a broader perspective on the topic.
Citation: https://doi.org/10.5194/essd-2022-423-RC1
RC2: 'Comment on essd-2022-423', Anonymous Referee #2, 31 Mar 2023

This manuscript developed a machine-learning method to estimate production of wheat globally based on remote sensing and climate datasets. Basically, this topic is quite important, and the potential dataset is useful for global crop monitor and food safety. However, I found several important shortcomings which make me hard to accept it right now. First, the distribution map of wheat is very important for estimating crop production. However, this study failed to generate a robust global distribution map of wheat. Especially, the authors used AVHRR and GLASS datasets to mapping distribution of wheat globally. I do not think the spatial resolution of these two datasets is adequate to do this kind thing. The spatial resolution of AVHRR is 0.5 degree, which is highly larger than the existing crop pixels as well as wheat, and there are huge proportion of mixed pixels. The coarse resolution will underestimate planting areas over some regions, and overestimate in the other regions. In addition, it is quite difficult to identify spring wheat with other summer crops, and the authors did not provide any figures to show the capability. Therefore, I do not think the distribution generated by this study can identify the planting area of wheat very well. As the authors provided the validation (fig. 3), the large errors existed in the global map. For example, rRMSE over almost all regions are larger than 30%, and at several regions rRMSE reached to 45%. I do not believe this map can adequately identify the real distributions. The authors also did not provide the detailed results on the validations. Can the authors provide the accuracy both over temporal and spatial scales? In addition, what does the subnational scale mean in the manuscript？Are they province, city or county level? If most of them are province or city-level statistical area, the current accuracy is very poor.

Second, the authors used the machine learning methods to estimate wheat production from 1982 to 2020, a period of 39 years. During the past decades, the changes of crop variety substantially increase the yield and production. However, this study only used NDVI, climate, and soil data to drive models, and these three variables cannot effectively indicate changes of crop variety. The authors did not report the performance of machine learning models for reproducing the temporal changes of crop yield, and I doubted this study failed to do so. Therefore, the validation looks not too bad, but it mostly because of spatial variations.

Third, this study is very rough for representing several important things. There are detailed and long-term national planting area and production data available. I would like to see if the identified wheat area matches with national or global statistical areas. The authors also can compare the long-term trend of identified areas with statistical trends. The same comparison can be conducted for yield and production at the national scales. However, the authors did not take any efforts for this, and I am not sure the dataset reported in this manuscript is robust. Moreover, the manuscript also did not report the situations on subnational statistics. Subnation is province, city or county? How large proportion of subnational regions have been included for a given country and globally? Many important information is missed and I really did not enjoy to read the manuscript.
Forth, the writing is very poor, and many places make me confused. Even I am not sure how long durations of this dataset covers. From the title and many places, the authors said this dataset covers 1982-2020. However, Line 142 said this study only identified spatial distribution of wheat from 2006 to 2014. So, how the authors estimated the yield and production over the other years, or even not estimate? I felt it wastes time to read it as the authors did not carefully introduce the basic information. Another example of unclear writing is about spatial and temporal transfer. Fig. 2 shows these two transfers are used in this study based on optimal model, however, it is really hard to understand what these two transfers mean, and how the transfers are conducted. Last but not least, the authors only focused on the results and dataset, and did not mention much uncertainties or implications of this dataset at all, which make this manuscript lacks reading interests.

Citation: https://doi.org/10.5194/essd-2022-423-RC2
RC3:
'Comment on essd-2022-423', Anonymous Referee #3, 28 Apr 2023
Page 1 Line 20-21, the dataset used to train the models of GlobalWheatYield4km also use for the comparison here? Why the R2 for SPAM didn’t change here when you update the method comparing the SPAM with the survey data (for example, in previous version you compared the SPAM2010 with the survey yield data of 2010, but in the revised version, you compared the SPAM2010 with the averaged survey yield data from 2009 to 2011, the R2 should have some different here)?

Figure 9(a), the R2 should be larger than 0, so please update the range of y-axis here.
Citation: https://doi.org/10.5194/essd-2022-423-RC3

Status: closed

RC1:
'Comment on essd-2022-423', Anonymous Referee #1, 07 Mar 2023
This paper proposed using ML and DL methods to generate a global wheat yield dataset at 4-km resolution during 1982-2020. The generated dataset has a range of potential applications in the agricultural sciences. However, there are several sections in the paper that lack clarity and may require further elaboration. It is recommended that the author address these points before proceeding to the next stage.
Major comments
The paper incorporates a range of input factors, such as NDVI, LAI, climate, and soil properties, to estimate wheat yield. However, it is unclear which factors hold greater importance than others. To address this, I suggest that the authors consider conducting a feature importance analysis to provide readers with a better understanding of the relative significance of these input factors.

I am curious as to why the FAO annual country-level wheat yields were not utilized for training or validation in the study. It would be helpful if the authors could provide an explanation for this decision.

Minor comments
Abstract - “The results showed that GlobalWheatYield4km captured 82% of yield variations with RMSE of 619.8 kg/ha.” It is unknown the yield variations of which data the authors referred to. SMAP data?
Introduction - While the authors briefly mention the use of a phenology-based method to derive the global spatial distribution of wheat at the end of the Introduction section, it may be beneficial to introduce this method earlier in the paper. It would be helpful to explain the advantages and strengths of using a phenology-based approach over other methods.
Fig 1 - Add data source.
Section 2.2.3 - Please fix the typo in LINE 102 “maximum temperature (Tmin), minimum temperatures (Tmax)”.
Section 2.3 - please add a description of the Global Wheat Production Mapping System (GWPMS), though it was cited as Luo et al. (2022a).
Section 2.3.1 - “we compared the cropland map derived from the GFSAD1KCM with statistics”. It is unclear what statistics is used here.
Fig 4 - Add country name as figure title or legend.
Discussion - (i) “In future studies, we will attempt to map the 250 spatial distribution of wheat using remote sensing images with finer spatial resolutions.” Please explain more on which datasets of finer spatial resolutions you plan to use. (ii) I am also interested in whether the methodology employed in this study can be extended to other crop species. It may be worthwhile to include a paragraph in the paper that discusses the potential applications or challenges of using similar methods for other crops, providing readers with a broader perspective on the topic.
Citation: https://doi.org/10.5194/essd-2022-423-RC1
RC2: 'Comment on essd-2022-423', Anonymous Referee #2, 31 Mar 2023

This manuscript developed a machine-learning method to estimate production of wheat globally based on remote sensing and climate datasets. Basically, this topic is quite important, and the potential dataset is useful for global crop monitor and food safety. However, I found several important shortcomings which make me hard to accept it right now. First, the distribution map of wheat is very important for estimating crop production. However, this study failed to generate a robust global distribution map of wheat. Especially, the authors used AVHRR and GLASS datasets to mapping distribution of wheat globally. I do not think the spatial resolution of these two datasets is adequate to do this kind thing. The spatial resolution of AVHRR is 0.5 degree, which is highly larger than the existing crop pixels as well as wheat, and there are huge proportion of mixed pixels. The coarse resolution will underestimate planting areas over some regions, and overestimate in the other regions. In addition, it is quite difficult to identify spring wheat with other summer crops, and the authors did not provide any figures to show the capability. Therefore, I do not think the distribution generated by this study can identify the planting area of wheat very well. As the authors provided the validation (fig. 3), the large errors existed in the global map. For example, rRMSE over almost all regions are larger than 30%, and at several regions rRMSE reached to 45%. I do not believe this map can adequately identify the real distributions. The authors also did not provide the detailed results on the validations. Can the authors provide the accuracy both over temporal and spatial scales? In addition, what does the subnational scale mean in the manuscript？Are they province, city or county level? If most of them are province or city-level statistical area, the current accuracy is very poor.

Second, the authors used the machine learning methods to estimate wheat production from 1982 to 2020, a period of 39 years. During the past decades, the changes of crop variety substantially increase the yield and production. However, this study only used NDVI, climate, and soil data to drive models, and these three variables cannot effectively indicate changes of crop variety. The authors did not report the performance of machine learning models for reproducing the temporal changes of crop yield, and I doubted this study failed to do so. Therefore, the validation looks not too bad, but it mostly because of spatial variations.

Third, this study is very rough for representing several important things. There are detailed and long-term national planting area and production data available. I would like to see if the identified wheat area matches with national or global statistical areas. The authors also can compare the long-term trend of identified areas with statistical trends. The same comparison can be conducted for yield and production at the national scales. However, the authors did not take any efforts for this, and I am not sure the dataset reported in this manuscript is robust. Moreover, the manuscript also did not report the situations on subnational statistics. Subnation is province, city or county? How large proportion of subnational regions have been included for a given country and globally? Many important information is missed and I really did not enjoy to read the manuscript.
Forth, the writing is very poor, and many places make me confused. Even I am not sure how long durations of this dataset covers. From the title and many places, the authors said this dataset covers 1982-2020. However, Line 142 said this study only identified spatial distribution of wheat from 2006 to 2014. So, how the authors estimated the yield and production over the other years, or even not estimate? I felt it wastes time to read it as the authors did not carefully introduce the basic information. Another example of unclear writing is about spatial and temporal transfer. Fig. 2 shows these two transfers are used in this study based on optimal model, however, it is really hard to understand what these two transfers mean, and how the transfers are conducted. Last but not least, the authors only focused on the results and dataset, and did not mention much uncertainties or implications of this dataset at all, which make this manuscript lacks reading interests.

Citation: https://doi.org/10.5194/essd-2022-423-RC2
RC3:
'Comment on essd-2022-423', Anonymous Referee #3, 28 Apr 2023
Page 1 Line 20-21, the dataset used to train the models of GlobalWheatYield4km also use for the comparison here? Why the R2 for SPAM didn’t change here when you update the method comparing the SPAM with the survey data (for example, in previous version you compared the SPAM2010 with the survey yield data of 2010, but in the revised version, you compared the SPAM2010 with the averaged survey yield data from 2009 to 2011, the R2 should have some different here)?

Figure 9(a), the R2 should be larger than 0, so please update the range of y-axis here.
Citation: https://doi.org/10.5194/essd-2022-423-RC3

Yuchuan Luo, Zhao Zhang, Juan Cao, Liangliang Zhang, Jing Zhang, Jichong Han, Huimin Zhuang, Fei Cheng, Jialu Xu, and Fulu Tao

Supplement

https://doi.org/10.5194/essd-2022-423-supplement

Data sets

GlobalWheatYield4km: a global wheat yield dataset at 4-km resolution during 1982-2020 based on deep learning approaches Yuchuan Luo, Zhao Zhang, Juan Cao, Liangliang Zhang, Jing Zhang, Jichong Han, Huimin Zhuang, Fei Cheng, Jialu Xu, and Fulu Tao https://doi.org/10.6084/m9.figshare.10025006

Yuchuan Luo, Zhao Zhang, Juan Cao, Liangliang Zhang, Jing Zhang, Jichong Han, Huimin Zhuang, Fei Cheng, Jialu Xu, and Fulu Tao

Viewed

Total article views: 2,330 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
1,712	516	102	2,330	198	97	117

HTML: 1,712
PDF: 516
XML: 102
Total: 2,330
Supplement: 198
BibTeX: 97
EndNote: 117

Views and downloads (calculated since 13 Dec 2022)

Month	HTML	PDF	XML	Total
Dec 2022	198	58	6	262
Jan 2023	60	17	1	78
Feb 2023	56	25	0	81
Mar 2023	59	26	2	87
Apr 2023	75	26	4	105
May 2023	37	12	1	50
Jun 2023	23	10	3	36
Jul 2023	35	12	0	47
Aug 2023	34	14	0	48
Sep 2023	42	11	2	55
Oct 2023	43	20	8	71
Nov 2023	21	3	1	25
Dec 2023	24	18	2	44
Jan 2024	32	14	0	46
Feb 2024	40	10	4	54
Mar 2024	25	13	6	44
Apr 2024	24	4	4	32
May 2024	31	11	9	51
Jun 2024	49	6	4	59
Jul 2024	26	9	9	44
Aug 2024	19	7	11	37
Sep 2024	24	13	3	40
Oct 2024	23	8	0	31
Nov 2024	19	12	0	31
Dec 2024	21	14	0	35
Jan 2025	26	8	2	36
Feb 2025	30	10	0	40
Mar 2025	39	7	4	50
Apr 2025	39	23	2	64
May 2025	32	14	2	48
Jun 2025	44	21	0	65
Jul 2025	54	21	4	79
Aug 2025	61	9	0	70
Sep 2025	322	6	3	331
Oct 2025	25	24	5	54

Cumulative views and downloads (calculated since 13 Dec 2022)

Month	HTML	PDF	XML	Total
Dec 2022	198	58	6	262
Jan 2023	60	17	1	78
Feb 2023	56	25	0	81
Mar 2023	59	26	2	87
Apr 2023	75	26	4	105
May 2023	37	12	1	50
Jun 2023	23	10	3	36
Jul 2023	35	12	0	47
Aug 2023	34	14	0	48
Sep 2023	42	11	2	55
Oct 2023	43	20	8	71
Nov 2023	21	3	1	25
Dec 2023	24	18	2	44
Jan 2024	32	14	0	46
Feb 2024	40	10	4	54
Mar 2024	25	13	6	44
Apr 2024	24	4	4	32
May 2024	31	11	9	51
Jun 2024	49	6	4	59
Jul 2024	26	9	9	44
Aug 2024	19	7	11	37
Sep 2024	24	13	3	40
Oct 2024	23	8	0	31
Nov 2024	19	12	0	31
Dec 2024	21	14	0	35
Jan 2025	26	8	2	36
Feb 2025	30	10	0	40
Mar 2025	39	7	4	50
Apr 2025	39	23	2	64
May 2025	32	14	2	48
Jun 2025	44	21	0	65
Jul 2025	54	21	4	79
Aug 2025	61	9	0	70
Sep 2025	322	6	3	331
Oct 2025	25	24	5	54

Viewed (geographical distribution)

Total article views: 2,211 (including HTML, PDF, and XML) Thereof 2,211 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 29 Oct 2025

Download

Preprint (3396 KB)
Metadata XML

Short summary

We generated a 4-km dataset of global wheat yield (GlobalWheatYield4km) from 1982 to 2020 using a deep learning approach. The dataset was highly consistent with observed yields, which captured 82 % of yield variations with RMSE of 619.8 kg/ha across all subnational regions and years. Our GlobalWheatYield4km can be applied for many purposes, including large-scale agricultural system modeling and climate change impact assessments.


Total:	0
HTML:	0
PDF:	0
XML:	0

GlobalWheatYield4km: a global wheat yield dataset at 4-km resolution during 1982–2020 based on deep learning approach

Supplement

Data sets

Viewed

Viewed (geographical distribution)

Cited

1 citations as recorded by crossref.