Tracking spatiotemporal dynamics of crop-specific areas through machine learning and statistics disaggregating

Li, Xiyu; Yu, Le; Du, Zhenrong; Liu, Xiaoxuan

doi:https://doi.org/10.5194/essd-2024-233

Preprints

https://doi.org/10.5194/essd-2024-233

Preprints

18 Jul 2024

| 18 Jul 2024

Status: this discussion paper is a preprint. It has been under review for the journal Earth System Science Data (ESSD). The manuscript was not accepted for further review after discussion.

Tracking spatiotemporal dynamics of crop-specific areas through machine learning and statistics disaggregating

Xiyu Li, Le Yu, Zhenrong Du, and Xiaoxuan Liu

Abstract. Mapping spatiotemporal dynamics of crop-specific areas is of great significance in addressing challenges faced by agricultural systems. But comparable multi-phase crop maps in year series have not yet been developed in most regions of the global. In this study, we developed a framework for updating annual crop-specific area maps at 10 km resolution based on crop statistics disaggregating, multi-source data integrating and machine learning, taking factors related with crop distribution in different regions and complex agricultural systems into accounts. Experiments were conducted in three study areas (Africa, China, and USA) respectively corresponding to three conditions of the information coverage of crop distribution (low, median, and high). In our framework, we collected related spatial indicator used in previous studies and trained random forest regression models to predict spatiotemporal dynamics of crop-specific areas based on them. Annual crop statistics were further disaggregated based on probabilistic layer and harmonized based on multiple constraints. Our framework is a good attempt to integrate two strategies (top-down and bottom-up), creating more possibility for crop mapping to integrate statistic with remote sensing. Finally, our results include maps of crop-specific areas covering 42 types from 1961–2022 in Africa, maps of crop-specific areas covering 14 types from 1980–2022 in China and maps of crop-specific areas covering 15 types from 2008–2022 in USA. Results show that our products has a relatively good consistency with independent reference map or statistics. Our products provide approximate estimates for spatiotemporal dynamics of crop-specific areas in multiple regions over several decades, which could be used as data basis for food security and environmental impact assessments.

Received: 14 Jun 2024 – Discussion started: 18 Jul 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 90979 KB)

Supplement (16759 KB)

Download & links

Preprint (90979 KB)
Metadata XML
Supplement (16759 KB)
BibTeX
EndNote

Xiyu Li, Le Yu, Zhenrong Du, and Xiaoxuan Liu

Status: closed

RC1:
'Comment on essd-2024-233', Qiangyi Yu, 13 Aug 2024
Dear Authors,
Mapping crop distribution is challenging, and it becomes even more difficult to obtain crop distribution maps at multiple time stages across a larger geographical scale. The hard work devoted to this effort should be highly acknowledged. However, after reading the manuscript, I have a few major concerns that merit discussion.
ESSD is a data journal that aims to publish original and high-quality datasets. It is primarily focused on the advancement of data, although innovation in methodology is also very important. Phrases such as “experiments were conducted in three study areas” seem to emphasize the methodological aspect while overshadowing the quality of the datasets. This may lead to doubts from users who seek these datasets for analysis.

Since ESSD emphasizes the reuse of high-quality data for earth system science, it is suggested to present datasets in a more coherent manner, particularly regarding spatial and temporal coverage. The currently submitted datasets appear to be more interest-oriented rather than comprehensive.

When publishing global continent- (or country-) specific datasets, it is advisable to involve researchers from those geographical areas in the data production process. This could significantly enhance the reliability, quality, and applicability of the data.

The introduction is lengthy but lacks clear logic. While you list the efforts made by various researchers in the crop area mapping community, you do not provide a review of the strengths and limitations of these works, especially concerning the methods used. For instance, even though most datasets do not offer extended temporal coverage, Jackson et al. (2019) did. What is the difference in performance between your current method and theirs? Is it possible to apply their method to update the map to 2022?

Machine learning has been applied, making the choice of basemaps critically important. Have you tested the different model performances when using various basemaps for training?

In the workflow, Part 1 (i.e., using machine learning to produce the probability map) appears objective, while Part 2 seems more subjective. It would be beneficial to briefly explain why this process is effective.

It appears that you did not conduct data validation; instead, you performed a few comparative analyses. Although this is an alternative approach, it suggests that there are several existing datasets, which diminishes the importance of your data.

The issues related to cropland area changes, multiple cropping, and climate change should be emphasized more prominently.

I did not find Backer-Reshef et al. (2023) in the reference list. Although the overall writing is good, more attention should be paid to further improve the manuscript.
Citation: https://doi.org/10.5194/essd-2024-233-RC1
RC2:
'Comment on essd-2024-233', Anonymous Referee #2, 24 Oct 2024
The authors aim to develop a framework for updating annual crop-specific area estimates and provide three case examples with varying levels of data availability. However, the proposed approach and resulting cropland products fall short of achieving the authors’ goal, as detailed below:
Major concerns:
Innovation in methodology: Several vague descriptions in the methodology are confusing and hinder my understanding of the overall approach. Please refer to the special comments below. From my perspective, the methodology lacks significant innovation, diminishing this study’s uniqueness and value. Addressing the temporal dynamics of spatial crop-specific area distribution would be highly valuable, which is also the focus of this study. Bridging the indicators and available land use maps to predict distribution for years without data is a potentially promising means. However, the predictive power weakens as the temporal distance from the available map grows. Advances in agricultural systems such as new varieties and improved management can further erode the reliability of such predictions. Instead of extrapolation, a more conservative strategy would be to interpolate the maps between two available years. As such, the 1960-2022 maps for Africa are less reliable because they are only based on one year information. More importantly, this method seems to assume a fixed crop distribution. The model learns from areas with existing base maps and predicts the probability using the indicators for other years. This restricts possible distributions to the same location as the base maps, and again, extrapolation may introduce great bias. The data sources for USA and China are at very fine resolution, however, this approach produced coarser resolution maps due to the limitation in indicators. The model also performed unsatisfactorily for a significant number of crop types, further reducing the overall value of this approach

Input data quality: SPAM2017 includes updates for the south Sahara. Additionally, a recently published database provides subnational crop-specific areas for SSA (https://datadryad.org/stash/dataset/doi:10.5061/dryad.vq83bk42w). USDA-quickstat offers county-level crop-specific areas, and China also provides county-level crop areas. Why not use these finer resolution datasets? Constraining the model with higher-resolution administrative data would significantly improve the accuracy of estimated distributions. Furthermore, SPAM has released the SPAM2020 dataset, which covers 46 crops, and this could be used as a more current base map.

Specific comments:
Paper structure: The introduction reviews the development of global cropland maps, but the transition to the cases of Africa, China, and the USA feels abrupt. I recommend adding some background information for a smoother transition. The USA is frequently mentioned along with Africa and China, but only Africa and China are analyzed in the results section. Clarify whether the USA is used solely for validation, or if results for the USA should be presented as well. Given that the study proposes a framework to develop annual crop-specific area maps, particularly for cropland distribution, the performance of the framework in terms of spatial distribution has not been thoroughly explored. The 1:1 comparison provides a high-level overview, but more detailed spatial comparisons and discussions of key patterns would add depth. For instance, the changes in the sugarcane intensity in central China—why did the southeast section of the spot disappear and then reappear? What indicators drove these patterns?
Line 115 and 117: The link for US is the home page of quickstat, please specify which categories were chosen. The link for China is a dead link.
Table 2: While more detailed information is available in the supplementary material, some critical information should be presented in the main paper for clarity. I recommend including the base map’s time period for China in Table 2.
Line 135: The description of how the base map for crops and regions not covered by China was reconstructed is unclear. Since SPAM2010 only provides data for a single year, is the reconstructed base map a time series, or is it limited to 2010?
Section 2.3: The role of cropland extent in allocating crop-specific areas is unclear. Does it constrain the distribution of each crop at a broader scale, or is it a time series reference for each crop? Since cropland extent includes all crops and may cover a much larger area than any individual crop, how does it interact with the base map? If there is any discrepancy between the cropland extent and SPAM, how is this handled?
Section 3: The process of allocating crop area using the Random Forest model is not well-explained. After the relationship between indicators and the base maps is learned, does the predicted distribution only appear in areas where the crop-specific area already exists in the base maps? It seems unlikely that the model allows crop areas to expand into the full cropland extent, or else every crop could be distributed everywhere. If the assumption is that crop-specific areas are fixed by the base map, this reinforces my earlier concern regarding the limitations of this approach.
Citation: https://doi.org/10.5194/essd-2024-233-RC2
RC3:
'Comment on essd-2024-233', Anonymous Referee #3, 30 Oct 2024
Long-term crop-specific area information is critical for both the agricultural and climate change communities. This paper integrates multiple data sources and employs a machine learning algorithm, alongside a statistical disaggregation method, to generate spatiotemporal dynamics of crop areas. However, the manuscript requires significant improvements and more detailed clarification in several key areas: strengthening validation using reliable and independent data sources, extending the dataset to a global scale or focusing on one continent (e.g., Africa), addressing the biased assumption that crop area is solely determined by biophysical and climate variables, and clarifying the necessity of mapping for the U.S.
Major concerns
The proposed method assumes that crop area or cultivation practices are only influenced by climate and biophysical variables such as climate, agro-systems, terrain, soil, suitability, and potential productivity. However, socio-economic factors like crop prices, market information, and agricultural policies also significantly affect cropping practices. This assumption introduces bias and does not reflect the real-world scenario.

As this is a data description paper, validation is crucial for users to assess the dataset's reliability. Although the paper conducts several cross-comparison experiments, the validation is insufficient in terms of data quality assessment utilizing independent, external data sources.

The selection of study areas—Africa, China, and the U.S.—seems inconsistent and may distract readers. While these regions are indeed significant in agricultural research, combining them in one paper without a clear rationale may confuse the focus. Additionally, it’s unclear if there is a need to generate low-resolution annual maps for the U.S., given that the Cropland Data Layer (CDL) already provides annual updates for more than 50 crop types at 30 m spatial resolution.

The process for selecting the base year—or more specifically, how the most appropriate year for model calibration is chosen—is not clearly explained. The selection of a different base year will affect model calibration process and in consequences affecting the qualification of spatial indicators and probability layers.

While the paper discusses model uncertainty, it would benefit from a more in-depth discussion of the dataset's limitations or uncertainties from the perspective of data users. This would provide a clearer understanding of potential challenges and constraints when using the dataset.
Citation: https://doi.org/10.5194/essd-2024-233-RC3

Status: closed

RC1:
'Comment on essd-2024-233', Qiangyi Yu, 13 Aug 2024
Dear Authors,
Mapping crop distribution is challenging, and it becomes even more difficult to obtain crop distribution maps at multiple time stages across a larger geographical scale. The hard work devoted to this effort should be highly acknowledged. However, after reading the manuscript, I have a few major concerns that merit discussion.
ESSD is a data journal that aims to publish original and high-quality datasets. It is primarily focused on the advancement of data, although innovation in methodology is also very important. Phrases such as “experiments were conducted in three study areas” seem to emphasize the methodological aspect while overshadowing the quality of the datasets. This may lead to doubts from users who seek these datasets for analysis.

Since ESSD emphasizes the reuse of high-quality data for earth system science, it is suggested to present datasets in a more coherent manner, particularly regarding spatial and temporal coverage. The currently submitted datasets appear to be more interest-oriented rather than comprehensive.

When publishing global continent- (or country-) specific datasets, it is advisable to involve researchers from those geographical areas in the data production process. This could significantly enhance the reliability, quality, and applicability of the data.

The introduction is lengthy but lacks clear logic. While you list the efforts made by various researchers in the crop area mapping community, you do not provide a review of the strengths and limitations of these works, especially concerning the methods used. For instance, even though most datasets do not offer extended temporal coverage, Jackson et al. (2019) did. What is the difference in performance between your current method and theirs? Is it possible to apply their method to update the map to 2022?

Machine learning has been applied, making the choice of basemaps critically important. Have you tested the different model performances when using various basemaps for training?

In the workflow, Part 1 (i.e., using machine learning to produce the probability map) appears objective, while Part 2 seems more subjective. It would be beneficial to briefly explain why this process is effective.

It appears that you did not conduct data validation; instead, you performed a few comparative analyses. Although this is an alternative approach, it suggests that there are several existing datasets, which diminishes the importance of your data.

The issues related to cropland area changes, multiple cropping, and climate change should be emphasized more prominently.

I did not find Backer-Reshef et al. (2023) in the reference list. Although the overall writing is good, more attention should be paid to further improve the manuscript.
Citation: https://doi.org/10.5194/essd-2024-233-RC1
RC2:
'Comment on essd-2024-233', Anonymous Referee #2, 24 Oct 2024
The authors aim to develop a framework for updating annual crop-specific area estimates and provide three case examples with varying levels of data availability. However, the proposed approach and resulting cropland products fall short of achieving the authors’ goal, as detailed below:
Major concerns:
Innovation in methodology: Several vague descriptions in the methodology are confusing and hinder my understanding of the overall approach. Please refer to the special comments below. From my perspective, the methodology lacks significant innovation, diminishing this study’s uniqueness and value. Addressing the temporal dynamics of spatial crop-specific area distribution would be highly valuable, which is also the focus of this study. Bridging the indicators and available land use maps to predict distribution for years without data is a potentially promising means. However, the predictive power weakens as the temporal distance from the available map grows. Advances in agricultural systems such as new varieties and improved management can further erode the reliability of such predictions. Instead of extrapolation, a more conservative strategy would be to interpolate the maps between two available years. As such, the 1960-2022 maps for Africa are less reliable because they are only based on one year information. More importantly, this method seems to assume a fixed crop distribution. The model learns from areas with existing base maps and predicts the probability using the indicators for other years. This restricts possible distributions to the same location as the base maps, and again, extrapolation may introduce great bias. The data sources for USA and China are at very fine resolution, however, this approach produced coarser resolution maps due to the limitation in indicators. The model also performed unsatisfactorily for a significant number of crop types, further reducing the overall value of this approach

Input data quality: SPAM2017 includes updates for the south Sahara. Additionally, a recently published database provides subnational crop-specific areas for SSA (https://datadryad.org/stash/dataset/doi:10.5061/dryad.vq83bk42w). USDA-quickstat offers county-level crop-specific areas, and China also provides county-level crop areas. Why not use these finer resolution datasets? Constraining the model with higher-resolution administrative data would significantly improve the accuracy of estimated distributions. Furthermore, SPAM has released the SPAM2020 dataset, which covers 46 crops, and this could be used as a more current base map.

Specific comments:
Paper structure: The introduction reviews the development of global cropland maps, but the transition to the cases of Africa, China, and the USA feels abrupt. I recommend adding some background information for a smoother transition. The USA is frequently mentioned along with Africa and China, but only Africa and China are analyzed in the results section. Clarify whether the USA is used solely for validation, or if results for the USA should be presented as well. Given that the study proposes a framework to develop annual crop-specific area maps, particularly for cropland distribution, the performance of the framework in terms of spatial distribution has not been thoroughly explored. The 1:1 comparison provides a high-level overview, but more detailed spatial comparisons and discussions of key patterns would add depth. For instance, the changes in the sugarcane intensity in central China—why did the southeast section of the spot disappear and then reappear? What indicators drove these patterns?
Line 115 and 117: The link for US is the home page of quickstat, please specify which categories were chosen. The link for China is a dead link.
Table 2: While more detailed information is available in the supplementary material, some critical information should be presented in the main paper for clarity. I recommend including the base map’s time period for China in Table 2.
Line 135: The description of how the base map for crops and regions not covered by China was reconstructed is unclear. Since SPAM2010 only provides data for a single year, is the reconstructed base map a time series, or is it limited to 2010?
Section 2.3: The role of cropland extent in allocating crop-specific areas is unclear. Does it constrain the distribution of each crop at a broader scale, or is it a time series reference for each crop? Since cropland extent includes all crops and may cover a much larger area than any individual crop, how does it interact with the base map? If there is any discrepancy between the cropland extent and SPAM, how is this handled?
Section 3: The process of allocating crop area using the Random Forest model is not well-explained. After the relationship between indicators and the base maps is learned, does the predicted distribution only appear in areas where the crop-specific area already exists in the base maps? It seems unlikely that the model allows crop areas to expand into the full cropland extent, or else every crop could be distributed everywhere. If the assumption is that crop-specific areas are fixed by the base map, this reinforces my earlier concern regarding the limitations of this approach.
Citation: https://doi.org/10.5194/essd-2024-233-RC2
RC3:
'Comment on essd-2024-233', Anonymous Referee #3, 30 Oct 2024
Long-term crop-specific area information is critical for both the agricultural and climate change communities. This paper integrates multiple data sources and employs a machine learning algorithm, alongside a statistical disaggregation method, to generate spatiotemporal dynamics of crop areas. However, the manuscript requires significant improvements and more detailed clarification in several key areas: strengthening validation using reliable and independent data sources, extending the dataset to a global scale or focusing on one continent (e.g., Africa), addressing the biased assumption that crop area is solely determined by biophysical and climate variables, and clarifying the necessity of mapping for the U.S.
Major concerns
The proposed method assumes that crop area or cultivation practices are only influenced by climate and biophysical variables such as climate, agro-systems, terrain, soil, suitability, and potential productivity. However, socio-economic factors like crop prices, market information, and agricultural policies also significantly affect cropping practices. This assumption introduces bias and does not reflect the real-world scenario.

As this is a data description paper, validation is crucial for users to assess the dataset's reliability. Although the paper conducts several cross-comparison experiments, the validation is insufficient in terms of data quality assessment utilizing independent, external data sources.

The selection of study areas—Africa, China, and the U.S.—seems inconsistent and may distract readers. While these regions are indeed significant in agricultural research, combining them in one paper without a clear rationale may confuse the focus. Additionally, it’s unclear if there is a need to generate low-resolution annual maps for the U.S., given that the Cropland Data Layer (CDL) already provides annual updates for more than 50 crop types at 30 m spatial resolution.

The process for selecting the base year—or more specifically, how the most appropriate year for model calibration is chosen—is not clearly explained. The selection of a different base year will affect model calibration process and in consequences affecting the qualification of spatial indicators and probability layers.

While the paper discusses model uncertainty, it would benefit from a more in-depth discussion of the dataset's limitations or uncertainties from the perspective of data users. This would provide a clearer understanding of potential challenges and constraints when using the dataset.
Citation: https://doi.org/10.5194/essd-2024-233-RC3

Xiyu Li, Le Yu, Zhenrong Du, and Xiaoxuan Liu

Supplement

https://doi.org/10.5194/essd-2024-233-supplement

Data sets

Tracking spatiotemporal dynamics of crop-specific areas through machine learning and statistics disaggregating Xiyu Li, Le Yu, Zhenrong Du, and Xiaoxuan Liu https://doi.org/10.6084/m9.figshare.26028769

Xiyu Li, Le Yu, Zhenrong Du, and Xiaoxuan Liu

Viewed

Total article views: 1,578 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
964	175	439	1,578	119	46	61

HTML: 964
PDF: 175
XML: 439
Total: 1,578
Supplement: 119
BibTeX: 46
EndNote: 61

Views and downloads (calculated since 18 Jul 2024)

Month	HTML	PDF	XML	Total
Jul 2024	131	33	11	175
Aug 2024	94	23	11	128
Sep 2024	50	9	2	61
Oct 2024	48	12	3	63
Nov 2024	44	4	0	48
Dec 2024	35	3	0	38
Jan 2025	43	2	1	46
Feb 2025	32	3	70	105
Mar 2025	17	5	112	134
Apr 2025	13	4	89	106
May 2025	14	4	94	112
Jun 2025	31	18	40	89
Jul 2025	49	36	3	88
Aug 2025	49	3	0	52
Sep 2025	301	10	3	314
Oct 2025	13	6	0	19

Cumulative views and downloads (calculated since 18 Jul 2024)

Month	HTML	PDF	XML	Total
Jul 2024	131	33	11	175
Aug 2024	94	23	11	128
Sep 2024	50	9	2	61
Oct 2024	48	12	3	63
Nov 2024	44	4	0	48
Dec 2024	35	3	0	38
Jan 2025	43	2	1	46
Feb 2025	32	3	70	105
Mar 2025	17	5	112	134
Apr 2025	13	4	89	106
May 2025	14	4	94	112
Jun 2025	31	18	40	89
Jul 2025	49	36	3	88
Aug 2025	49	3	0	52
Sep 2025	301	10	3	314
Oct 2025	13	6	0	19

Viewed (geographical distribution)

Total article views: 1,525 (including HTML, PDF, and XML) Thereof 1,525 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 19 Oct 2025

Download

Preprint (90979 KB)
Metadata XML

Short summary

We developed a new method to update detailed maps showing where different crops are grown over time, focusing on Africa, China, and the USA. Using various data sources and machine learning, we produced accurate maps at a 10 km resolution covering up to 42 crop types from 1961 to 2022. Our work bridges statistical data and satellite imagery, helping researchers and policymakers to address global agricultural challenges in food security and environmental impacts.


Total:	0
HTML:	0
PDF:	0
XML:	0