GRAIN – A Global Registry of Agricultural Irrigation Networks
Abstract. Despite supporting roughly 40 % of the world’s agricultural output, irrigation canal networks remain a critical data gap in global geospatial archives. Currently, there is no consistent geospatial database that documents the extent of surface-water irrigation canals. The GRAIN (Global Registry of Agricultural Irrigation Networks) dataset fills this gap by leveraging the potential of volunteered geographic information (VGI) from OpenStreetMap (OSM) and applying a machine learning based classification pipeline to distinguish canals from rivers and streams. A Random Forest classifier was trained on 20,000 samples of quality-controlled canal and river data using 5 engineered geometric and topographical features. The model achieved over 98 % training accuracy, translating to a ~93.6 % median recall on independent validation datasets for primary canals, with a mean positional offset of ~98 m. The GRAIN dataset includes land cover maps and OSM tags to assign canal use cases, identifying over 3.8 million km of agricultural irrigation canals in 95 countries. There is marked regional concentration of agricultural canals with hotspots identified in Europe, South and Southeast Asia, and North America. Agricultural canal distribution also varied widely by climatic zones with over 65 % in temperate and cold zones and approximately 22 % in arid regions. A canal density analysis normalized by cropland area highlighted smaller countries such as Finland, the Netherlands, New Zealand, and Egypt having the densest irrigation canals. While the global correlation between canal density and national cereal yields was found to be modest (r = 0.31), the GRAIN dataset suggests that the presence of well-developed surface-water infrastructure may positively influence agricultural productivity as seen in the Netherlands and New Zealand. GRAIN is now publicly available at https://doi.org/10.5281/zenodo.16786488 (Suresh and Hossain, 2025) under a CC-BY-4.0 licence. Designed as a community-driven resource, GRAIN data bridges a long-standing gap, and opens new possibilities for evaluating irrigation efficiency, supporting climate adaptation, guiding infrastructure investments, and extending the value of new satellite remote sensing missions on surface water such as the Surface Water and Ocean Topography (SWOT).
General Comments
This manuscript presents GRAIN, the first global-scale, open-access vector dataset of irrigation canal networks, developed by integrating OpenStreetMap volunteered geographic information (VGI) with a machine learning classification workflow. The study is technically rigorous, well-structured, and makes a significant contribution to global water and agricultural data infrastructure.
The methodological design, particularly the feature engineering and validation framework, is carefully implemented and clearly described. However, a few methodological clarifications would strengthen the paper:
The authors should address scale dependency in geometric features (straightness ratio, curvature, mean turning angle), since path segmentation and vertex density can strongly affect these metrics. Validation could include precision and F1-score, in addition to recall, to better characterize model reliability. The interpretation of canal density versus yield should acknowledge potential confounding factors. Some minor details on resampling, path length normalization, and feature computation should be made explicit for full reproducibility.
Specific Comments
Abstract: Consider including the minimum mapping unit or spatial resolution of the GRAIN dataset to provide immediate context for readers.
L90–105: It would be helpful to specify how GRAIN differentiates agricultural irrigation canals from other canal types (urban, navigation, drainage) early on, as this distinction defines the dataset’s uniqueness.
Table 1: The data sources are clearly summarized. Consider adding a table column for spatial/temporal (if applicable, may include year) resolution (e.g., ESA CCI 300 m, FAO 5′ grid, SRTM 30 m) instead of combining type and spatial resolution.
L146–153: Clarify whether OSM features were filtered for tagging quality or contributor density, as OSM completeness varies greatly by region.
L225–260: The geometric and topographic feature design is excellent. However, straightness ratio (SR), mean turning angle, and curvature index are inherently scale-dependent — their values can vary with path length or sampling density.
Please clarify the typical or average polyline length (Dpath) used in training and validation.
If canal and river polylines vary greatly in length (e.g., 100 m vs 10 km), SR values can be biased since shorter paths naturally appear straighter even in meandering systems.
Reporting the distribution (range, mean, median) of Dpath would help assess whether SR differences reflect true geometry or feature segmentation.
L230–250: Since both mean turning angle and curvature index depend on vertex spacing, please confirm whether all polylines were resampled to a uniform vertex interval before computing these metrics. Suggested insertion: “All polylines were resampled at uniform vertex spacing before computing geometric metrics to ensure that sinuosity differences reflect genuine geometry rather than digitization density or line length.”
L265–275: Consider moving feature importance rankings (e.g., straightness ratio > slope > elevation) to the main text instead of supplementary dataset. State explicitly whether hyperparameters (number of trees, max depth, etc.) were tuned or used as defaults. It would also be better if the authors had model comparison and selection.
L295–310: Please clarify: How were mixed land-cover pixels (e.g., cropland + urban) within the 1 km buffer handled (such as cities in Guangzhou and Shanghai)? Could seasonal cropland variability affect classification? If so, consider acknowledging this as a source of uncertainty. When propagating connectivity from cropland-linked canals, specify whether a distance threshold was applied to avoid overextension into non-irrigated zones.
L325–345: The use of recall and mean offset distance (MOD) is appropriate. However, for a full picture of model reliability, please report precision and F1-score as well.
L340–345: Provide sample sizes (km of validation canals) for each region (e.g., Nile Delta, Indira Gandhi Canal) and note the proportion of primary vs. secondary canals.
L345–350: Consider adding at least one tropical validation region (e.g., Southeast Asia) to test robustness under different geomorphological and land-cover contexts.
L355–385: Figures 7–9 effectively illustrate global canal distribution. Consider adding a continent-level summary table (total canal length and density) to complement global maps.
L385–395: The correlation between canal density and cereal yield (r = 0.31) is interesting but weak. Please emphasize that this is indicative, not causal, and may be influenced by other variables (e.g., irrigation efficiency, water management, input use). A partial correlation or multivariate regression would strengthen this analysis.
L410–430: The discussion nicely links GRAIN to SWOT and remote-sensing applications. Consider adding a brief quantitative uncertainty estimate (e.g., expected completeness by region or OSM coverage density).
L470–475: To enhance the reproducibility of the dataset generation process, it would be helpful if the authors could reorganize the code repository so that data acquisition is more automated and user-friendly. Specifically, the authors are encouraged to either include the necessary reference data (or sample subsets) within the repository’s assets folder, or update the code to automatically request and download required datasets from their original sources. In addition, please update the README file to provide clear, step-by-step guidance for users to reproduce the workflow, from data retrieval to model execution and dataset generation.