01 Feb 2023
 | 01 Feb 2023
Status: this preprint is currently under review for the journal ESSD.

CLIM4OMICS: a geospatially comprehensive climate and multi-OMICS database for Maize phenotype predictability in the U.S. and Canada

Parisa Sarzaeim, Francisco Munoz-Arriola, Diego Jarquin, Hasnat Aslam, and Natalia De Leon Gatti

Abstract. The performance of numerical, statistical, and data-driven diagnostic and predictive crop production modeling heavily relies on data quality for input and calibration/validation processes. This study presents a comprehensive database and the analytics used to consolidate it as a homogeneous, consistent, and multi-dimensional genotype, phenotypic, and environmental database for maize phenotype modeling, diagnostics, and prediction. The data used is obtained from the Genomes to Fields (G2F) initiative, which provides multi-year genomic (G), environmental (E), and phenotypic (P) datasets that can be used to train and test crop growth models to understand the genotype by environment (GxE) interaction phenomenon. A particular advantage of the G2F database is its diverse set of maize genotype DNA sequences (G2F-G), phenotypic measurements (G2F-P), station-based environmental time series (mainly, climatic data) observations collected during the maize growing season (G2F-E), and metadata for each field trials (G2F-M) across the U.S. and the province of Ontario in Canada. The construction of this comprehensive climate and genomic database incorporates the analytics for data quality control (QC) and consistency control (CC) to consolidate the digital representation of geospatially distributed environmental and genomic data required for phenotype predictive analytics and modeling the GxE interaction. The two-phase QC-CC pre-processing algorithm also includes a module to estimate environmental uncertainties. Generally, this data pipeline collects raw files, checks their formats, corrects data structures, and identifies and cures/imputes missing data. This pipeline uses machine learning techniques to fulfill the environmental time series gaps and quantifies the uncertainty introduced by using other data sources for gaps imputation in G2F-E, discards the missing values in G2F-P, and removes rare variants in G2F-G. Finally, an integrated and enhanced multi-dimensional database is generated. The analytics for improving the G2F database and the improved database called "CLIM4OMICS" follows the FAIR principles, and all the digital resources are available at (Sarzaeim, et al., 2023).

Parisa Sarzaeim et al.

Status: open (until 05 May 2023)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on essd-2023-11', Anonymous Referee #1, 29 Mar 2023 reply

Parisa Sarzaeim et al.

Data sets

CLImate for Maize OMICS: CLIM4OMICS Analytics and Database Parisa Sarzaeim, Hasnat Aslam, Francisco Munoz-Arriola

Model code and software

CLImate for Maize OMICS: CLIM4OMICS Analytics and Database Parisa Sarzaeim, Hasnat Aslam, Francisco Munoz-Arriola

Parisa Sarzaeim et al.


Total article views: 278 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
235 39 4 278 4 5
  • HTML: 235
  • PDF: 39
  • XML: 4
  • Total: 278
  • BibTeX: 4
  • EndNote: 5
Views and downloads (calculated since 01 Feb 2023)
Cumulative views and downloads (calculated since 01 Feb 2023)

Viewed (geographical distribution)

Total article views: 271 (including HTML, PDF, and XML) Thereof 271 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
Latest update: 30 Mar 2023
Short summary
A genomic, phenotypic, and climate database for maize phenotype predictability in the U.S. and Canada is introduced. The database encompasses data from 2014–2017 and an algorithmic for input data quality and consistency controls. Earth System modelers and breeders can use the CLIM4OMICS since it interconnects the climate and biological system sciences. CLIM4OMICS is designed to foster phenotype predictability.