the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
CLIM4OMICS: a geospatially comprehensive climate and multi-OMICS database for Maize phenotype predictability in the U.S. and Canada
Parisa Sarzaeim
Diego Jarquin
Hasnat Aslam
Natalia Leon Gatti
Abstract. The performance of numerical, statistical, and data-driven diagnostic and predictive crop production modeling heavily relies on data quality for input and calibration/validation processes. This study presents a comprehensive database and the analytics used to consolidate it as a homogeneous, consistent, and multi-dimensional genotype, phenotypic, and environmental database for maize phenotype modeling, diagnostics, and prediction. The data used is obtained from the Genomes to Fields (G2F) initiative, which provides multi-year genomic (G), environmental (E), and phenotypic (P) datasets that can be used to train and test crop growth models to understand the genotype by environment (GxE) interaction phenomenon. A particular advantage of the G2F database is its diverse set of maize genotype DNA sequences (G2F-G), phenotypic measurements (G2F-P), station-based environmental time series (mainly, climatic data) observations collected during the maize growing season (G2F-E), and metadata for each field trials (G2F-M) across the U.S. and the province of Ontario in Canada. The construction of this comprehensive climate and genomic database incorporates the analytics for data quality control (QC) and consistency control (CC) to consolidate the digital representation of geospatially distributed environmental and genomic data required for phenotype predictive analytics and modeling the GxE interaction. The two-phase QC-CC pre-processing algorithm also includes a module to estimate environmental uncertainties. Generally, this data pipeline collects raw files, checks their formats, corrects data structures, and identifies and cures/imputes missing data. This pipeline uses machine learning techniques to fulfill the environmental time series gaps and quantifies the uncertainty introduced by using other data sources for gaps imputation in G2F-E, discards the missing values in G2F-P, and removes rare variants in G2F-G. Finally, an integrated and enhanced multi-dimensional database is generated. The analytics for improving the G2F database and the improved database called "CLIM4OMICS" follows the FAIR principles, and all the digital resources are available at http://doi.org/10.5281/zenodo.7490246 (Sarzaeim, et al., 2023).
Parisa Sarzaeim et al.
Status: open (until 05 May 2023)
-
RC1: 'Comment on essd-2023-11', Anonymous Referee #1, 29 Mar 2023
reply
This study performed quality control (QC) and consistency control (CC) to identify and cure missing data in an existing dataset from Genomes to Fields (G2F) initiative. The manuscript provides detailed information on the procedures of applying the QC-CC to pre-process four sub-datasets: genomic (G), environmental (E), phenotypic (P), and metadata for each field trial (M). The program developed in this study can be useful to control input data quality for GxE model implementation. However, it is not clear to me that this is a significant advancement in the data, science, methods, or outcomes to warrant publication in this Journal.
Major concerns:
- The manuscript reads more like a manual instead of a scientific paper. There is plenty of information on how to apply QC step by step, however, there are few results of the developed data. Even in Section 4. Results and discussion, I can only find some examples of how the dataset looks in a table. Although it is geospatial data, there are no maps presenting the spatial pattern of data values. I cannot evaluate the data without the figures presenting the spatial and temporal changes in the data.
- The introduction mainly provides the background of QC and CC methods, but fewer contents of maize phenotype. What is the importance of developing maize data? What is the progress of this dataset compared with other datasets?
- Quality control is the key point in this study, which means the major effort is improving data quality by deleting or curing missing data in the original G2F dataset. No new dataset is developed in this study. The QC and CC used in this study are also commonly used methods, and I don’t think there is any improvement in the method.
- Using screenshots as figures (Figs 3-6) is not a good way to introduce data in a scientific paper. I suggest listing a table to only explain the head names in each sub-dataset.
- “Maize phenotype predictability” is mentioned in the title, but I cannot find any work related to predictability. Building relationships between maize phenotypes and other environmental and genomic factors may improve this study.
Minor:
- Section 2. How different dimensions of data connect. Which head is the key field in the database
- Line 211. Figs.
- Caption in Figure 4. Repeated “shows”.
- 3. Uncertainty. What about the errors in other sub-datasets, aside from climate?
- Table 1. What is the meaning of the “DEH1” under the “location” field? Do you think it is a good way to provide location information?
Citation: https://doi.org/10.5194/essd-2023-11-RC1
Parisa Sarzaeim et al.
Data sets
CLImate for Maize OMICS: CLIM4OMICS Analytics and Database Parisa Sarzaeim, Hasnat Aslam, Francisco Munoz-Arriola http://doi.org/10.5281/zenodo.7490246
Model code and software
CLImate for Maize OMICS: CLIM4OMICS Analytics and Database Parisa Sarzaeim, Hasnat Aslam, Francisco Munoz-Arriola http://doi.org/10.5281/zenodo.7490246
Parisa Sarzaeim et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
235 | 39 | 4 | 278 | 4 | 5 |
- HTML: 235
- PDF: 39
- XML: 4
- Total: 278
- BibTeX: 4
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1