the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
GlobalWheatYield4km: a global wheat yield dataset at 4-km resolution during 1982–2020 based on deep learning approach
Abstract. Accurate and spatially explicit information on global crop yield is paramount for guiding policy-making and ensuring food security. However, most public datasets are at coarse resolution in both space and time. Here, we used data-driven models to develop a 4-km dataset of global wheat yield (GlobalWheatYield4km) from 1982 to 2020. First, we proposed a phenology-based approach to map spatial distributions of spring and winter wheat. Then we determined the optimal grid-scale yield estimation model by comparing the performance of two data-driven models (i.e., Random Forest (RF) and Long Short-Term Memory (LSTM)), with publicly available data (i.e., satellite and climatic data from the Google Earth Engine (GEE) platform, soil properties, and subnational-level census data covering ~11000 political units). The results showed that GlobalWheatYield4km captured 82 % of yield variations with RMSE of 619.8 kg/ha across all subnational regions and years. In addition, our dataset had a higher accuracy (R2 ~0.71) as compared with Spatial Production Allocation Model (SPAM) (R2 ~ 0.49) across all subnational regions and three years. The GlobalWheatYield4km dataset might play important roles in modelling crop system and assessing climate impact over larger areas (DOI of the referenced dataset: https://doi.org/10.6084/m9.figshare.10025006; Luo et al., 2022b).
- Preprint
(3396 KB) - Metadata XML
-
Supplement
(1402 KB) - BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on essd-2022-423', Anonymous Referee #1, 07 Mar 2023
This paper proposed using ML and DL methods to generate a global wheat yield dataset at 4-km resolution during 1982-2020. The generated dataset has a range of potential applications in the agricultural sciences. However, there are several sections in the paper that lack clarity and may require further elaboration. It is recommended that the author address these points before proceeding to the next stage.
Major comments
- The paper incorporates a range of input factors, such as NDVI, LAI, climate, and soil properties, to estimate wheat yield. However, it is unclear which factors hold greater importance than others. To address this, I suggest that the authors consider conducting a feature importance analysis to provide readers with a better understanding of the relative significance of these input factors.
- I am curious as to why the FAO annual country-level wheat yields were not utilized for training or validation in the study. It would be helpful if the authors could provide an explanation for this decision.
Minor comments
Abstract - “The results showed that GlobalWheatYield4km captured 82% of yield variations with RMSE of 619.8 kg/ha.” It is unknown the yield variations of which data the authors referred to. SMAP data?
Introduction - While the authors briefly mention the use of a phenology-based method to derive the global spatial distribution of wheat at the end of the Introduction section, it may be beneficial to introduce this method earlier in the paper. It would be helpful to explain the advantages and strengths of using a phenology-based approach over other methods.
Fig 1 - Add data source.
Section 2.2.3 - Please fix the typo in LINE 102 “maximum temperature (Tmin), minimum temperatures (Tmax)”.
Section 2.3 - please add a description of the Global Wheat Production Mapping System (GWPMS), though it was cited as Luo et al. (2022a).
Section 2.3.1 - “we compared the cropland map derived from the GFSAD1KCM with statistics”. It is unclear what statistics is used here.
Fig 4 - Add country name as figure title or legend.
Discussion - (i) “In future studies, we will attempt to map the 250 spatial distribution of wheat using remote sensing images with finer spatial resolutions.” Please explain more on which datasets of finer spatial resolutions you plan to use. (ii) I am also interested in whether the methodology employed in this study can be extended to other crop species. It may be worthwhile to include a paragraph in the paper that discusses the potential applications or challenges of using similar methods for other crops, providing readers with a broader perspective on the topic.
Citation: https://doi.org/10.5194/essd-2022-423-RC1 -
RC2: 'Comment on essd-2022-423', Anonymous Referee #2, 31 Mar 2023
This manuscript developed a machine-learning method to estimate production of wheat globally based on remote sensing and climate datasets. Basically, this topic is quite important, and the potential dataset is useful for global crop monitor and food safety. However, I found several important shortcomings which make me hard to accept it right now. First, the distribution map of wheat is very important for estimating crop production. However, this study failed to generate a robust global distribution map of wheat. Especially, the authors used AVHRR and GLASS datasets to mapping distribution of wheat globally. I do not think the spatial resolution of these two datasets is adequate to do this kind thing. The spatial resolution of AVHRR is 0.5 degree, which is highly larger than the existing crop pixels as well as wheat, and there are huge proportion of mixed pixels. The coarse resolution will underestimate planting areas over some regions, and overestimate in the other regions. In addition, it is quite difficult to identify spring wheat with other summer crops, and the authors did not provide any figures to show the capability. Therefore, I do not think the distribution generated by this study can identify the planting area of wheat very well. As the authors provided the validation (fig. 3), the large errors existed in the global map. For example, rRMSE over almost all regions are larger than 30%, and at several regions rRMSE reached to 45%. I do not believe this map can adequately identify the real distributions. The authors also did not provide the detailed results on the validations. Can the authors provide the accuracy both over temporal and spatial scales? In addition, what does the subnational scale mean in the manuscript?Are they province, city or county level? If most of them are province or city-level statistical area, the current accuracy is very poor.
Second, the authors used the machine learning methods to estimate wheat production from 1982 to 2020, a period of 39 years. During the past decades, the changes of crop variety substantially increase the yield and production. However, this study only used NDVI, climate, and soil data to drive models, and these three variables cannot effectively indicate changes of crop variety. The authors did not report the performance of machine learning models for reproducing the temporal changes of crop yield, and I doubted this study failed to do so. Therefore, the validation looks not too bad, but it mostly because of spatial variations.
Third, this study is very rough for representing several important things. There are detailed and long-term national planting area and production data available. I would like to see if the identified wheat area matches with national or global statistical areas. The authors also can compare the long-term trend of identified areas with statistical trends. The same comparison can be conducted for yield and production at the national scales. However, the authors did not take any efforts for this, and I am not sure the dataset reported in this manuscript is robust. Moreover, the manuscript also did not report the situations on subnational statistics. Subnation is province, city or county? How large proportion of subnational regions have been included for a given country and globally? Many important information is missed and I really did not enjoy to read the manuscript.
Forth, the writing is very poor, and many places make me confused. Even I am not sure how long durations of this dataset covers. From the title and many places, the authors said this dataset covers 1982-2020. However, Line 142 said this study only identified spatial distribution of wheat from 2006 to 2014. So, how the authors estimated the yield and production over the other years, or even not estimate? I felt it wastes time to read it as the authors did not carefully introduce the basic information. Another example of unclear writing is about spatial and temporal transfer. Fig. 2 shows these two transfers are used in this study based on optimal model, however, it is really hard to understand what these two transfers mean, and how the transfers are conducted. Last but not least, the authors only focused on the results and dataset, and did not mention much uncertainties or implications of this dataset at all, which make this manuscript lacks reading interests.
Citation: https://doi.org/10.5194/essd-2022-423-RC2 -
RC3: 'Comment on essd-2022-423', Anonymous Referee #3, 28 Apr 2023
- Page 1 Line 20-21, the dataset used to train the models of GlobalWheatYield4km also use for the comparison here? Why the R2 for SPAM didn’t change here when you update the method comparing the SPAM with the survey data (for example, in previous version you compared the SPAM2010 with the survey yield data of 2010, but in the revised version, you compared the SPAM2010 with the averaged survey yield data from 2009 to 2011, the R2 should have some different here)?
- Figure 9(a), the R2 should be larger than 0, so please update the range of y-axis here.
Citation: https://doi.org/10.5194/essd-2022-423-RC3
Status: closed
-
RC1: 'Comment on essd-2022-423', Anonymous Referee #1, 07 Mar 2023
This paper proposed using ML and DL methods to generate a global wheat yield dataset at 4-km resolution during 1982-2020. The generated dataset has a range of potential applications in the agricultural sciences. However, there are several sections in the paper that lack clarity and may require further elaboration. It is recommended that the author address these points before proceeding to the next stage.
Major comments
- The paper incorporates a range of input factors, such as NDVI, LAI, climate, and soil properties, to estimate wheat yield. However, it is unclear which factors hold greater importance than others. To address this, I suggest that the authors consider conducting a feature importance analysis to provide readers with a better understanding of the relative significance of these input factors.
- I am curious as to why the FAO annual country-level wheat yields were not utilized for training or validation in the study. It would be helpful if the authors could provide an explanation for this decision.
Minor comments
Abstract - “The results showed that GlobalWheatYield4km captured 82% of yield variations with RMSE of 619.8 kg/ha.” It is unknown the yield variations of which data the authors referred to. SMAP data?
Introduction - While the authors briefly mention the use of a phenology-based method to derive the global spatial distribution of wheat at the end of the Introduction section, it may be beneficial to introduce this method earlier in the paper. It would be helpful to explain the advantages and strengths of using a phenology-based approach over other methods.
Fig 1 - Add data source.
Section 2.2.3 - Please fix the typo in LINE 102 “maximum temperature (Tmin), minimum temperatures (Tmax)”.
Section 2.3 - please add a description of the Global Wheat Production Mapping System (GWPMS), though it was cited as Luo et al. (2022a).
Section 2.3.1 - “we compared the cropland map derived from the GFSAD1KCM with statistics”. It is unclear what statistics is used here.
Fig 4 - Add country name as figure title or legend.
Discussion - (i) “In future studies, we will attempt to map the 250 spatial distribution of wheat using remote sensing images with finer spatial resolutions.” Please explain more on which datasets of finer spatial resolutions you plan to use. (ii) I am also interested in whether the methodology employed in this study can be extended to other crop species. It may be worthwhile to include a paragraph in the paper that discusses the potential applications or challenges of using similar methods for other crops, providing readers with a broader perspective on the topic.
Citation: https://doi.org/10.5194/essd-2022-423-RC1 -
RC2: 'Comment on essd-2022-423', Anonymous Referee #2, 31 Mar 2023
This manuscript developed a machine-learning method to estimate production of wheat globally based on remote sensing and climate datasets. Basically, this topic is quite important, and the potential dataset is useful for global crop monitor and food safety. However, I found several important shortcomings which make me hard to accept it right now. First, the distribution map of wheat is very important for estimating crop production. However, this study failed to generate a robust global distribution map of wheat. Especially, the authors used AVHRR and GLASS datasets to mapping distribution of wheat globally. I do not think the spatial resolution of these two datasets is adequate to do this kind thing. The spatial resolution of AVHRR is 0.5 degree, which is highly larger than the existing crop pixels as well as wheat, and there are huge proportion of mixed pixels. The coarse resolution will underestimate planting areas over some regions, and overestimate in the other regions. In addition, it is quite difficult to identify spring wheat with other summer crops, and the authors did not provide any figures to show the capability. Therefore, I do not think the distribution generated by this study can identify the planting area of wheat very well. As the authors provided the validation (fig. 3), the large errors existed in the global map. For example, rRMSE over almost all regions are larger than 30%, and at several regions rRMSE reached to 45%. I do not believe this map can adequately identify the real distributions. The authors also did not provide the detailed results on the validations. Can the authors provide the accuracy both over temporal and spatial scales? In addition, what does the subnational scale mean in the manuscript?Are they province, city or county level? If most of them are province or city-level statistical area, the current accuracy is very poor.
Second, the authors used the machine learning methods to estimate wheat production from 1982 to 2020, a period of 39 years. During the past decades, the changes of crop variety substantially increase the yield and production. However, this study only used NDVI, climate, and soil data to drive models, and these three variables cannot effectively indicate changes of crop variety. The authors did not report the performance of machine learning models for reproducing the temporal changes of crop yield, and I doubted this study failed to do so. Therefore, the validation looks not too bad, but it mostly because of spatial variations.
Third, this study is very rough for representing several important things. There are detailed and long-term national planting area and production data available. I would like to see if the identified wheat area matches with national or global statistical areas. The authors also can compare the long-term trend of identified areas with statistical trends. The same comparison can be conducted for yield and production at the national scales. However, the authors did not take any efforts for this, and I am not sure the dataset reported in this manuscript is robust. Moreover, the manuscript also did not report the situations on subnational statistics. Subnation is province, city or county? How large proportion of subnational regions have been included for a given country and globally? Many important information is missed and I really did not enjoy to read the manuscript.
Forth, the writing is very poor, and many places make me confused. Even I am not sure how long durations of this dataset covers. From the title and many places, the authors said this dataset covers 1982-2020. However, Line 142 said this study only identified spatial distribution of wheat from 2006 to 2014. So, how the authors estimated the yield and production over the other years, or even not estimate? I felt it wastes time to read it as the authors did not carefully introduce the basic information. Another example of unclear writing is about spatial and temporal transfer. Fig. 2 shows these two transfers are used in this study based on optimal model, however, it is really hard to understand what these two transfers mean, and how the transfers are conducted. Last but not least, the authors only focused on the results and dataset, and did not mention much uncertainties or implications of this dataset at all, which make this manuscript lacks reading interests.
Citation: https://doi.org/10.5194/essd-2022-423-RC2 -
RC3: 'Comment on essd-2022-423', Anonymous Referee #3, 28 Apr 2023
- Page 1 Line 20-21, the dataset used to train the models of GlobalWheatYield4km also use for the comparison here? Why the R2 for SPAM didn’t change here when you update the method comparing the SPAM with the survey data (for example, in previous version you compared the SPAM2010 with the survey yield data of 2010, but in the revised version, you compared the SPAM2010 with the averaged survey yield data from 2009 to 2011, the R2 should have some different here)?
- Figure 9(a), the R2 should be larger than 0, so please update the range of y-axis here.
Citation: https://doi.org/10.5194/essd-2022-423-RC3
Data sets
GlobalWheatYield4km: a global wheat yield dataset at 4-km resolution during 1982-2020 based on deep learning approaches Yuchuan Luo, Zhao Zhang, Juan Cao, Liangliang Zhang, Jing Zhang, Jichong Han, Huimin Zhuang, Fei Cheng, Jialu Xu, and Fulu Tao https://doi.org/10.6084/m9.figshare.10025006
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
960 | 331 | 77 | 1,368 | 135 | 64 | 65 |
- HTML: 960
- PDF: 331
- XML: 77
- Total: 1,368
- Supplement: 135
- BibTeX: 64
- EndNote: 65
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1