the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
AIGD-PFT: The first AI-driven Global Daily gap-free 4 km Phytoplankton Functional Type products from 1998 to 2023
Abstract. Long time series of spatiotemporally continuous phytoplankton functional type (PFT) products are essential for understanding marine ecosystems, global biogeochemical cycles, and effective marine management. In this study, by integrating artificial intelligence (AI) technology with multi-source marine big data, we have developed a Spatial–Temporal–Ecological Ensemble model based on Deep Learning (STEE-DL), and then generated the first AI-driven Global Daily gap-free 4 km PFTs product from 1998 to 2023 (AIGD-PFT), significantly enhancing the accuracy and spatiotemporal coverage of quantifying eight major PFTs (i.e., Diatoms, Dinoflagellates, Haptophytes, Pelagophytes, Cryptophytes, Green Algae, Prokaryotes, and Prochlorococcus). The input data encompass physical oceanographic, biogeochemical, spatiotemporal information, and ocean color data (OC-CCI v6.0) that have been gap-filled using a Discrete Cosine Transform with a Penalized Least Square (DCT-PLS) approach. The STEE-DL model utilizes an ensemble strategy with 100 ResNet models, applying Monte Carlo and bootstrapping methods to estimate optimal PFT values and assess model uncertainty through ensemble means and standard deviations. The model's performance was validated using multiple cross-validation strategies—random, spatial-block, and temporal-block—combined with in-situ data, demonstrating STEE-DL's robustness and generalization capability. The daily updates and seamless nature of the AIGD-PFT product capture the complex dynamics of coastal regions effectively. Finally, through a comparative analysis using a triple-collocation (TC) approach, the competitive advantages of the AIGD-PFT product over existing products were validated. The AIGD-PFT product not only provides the foundation for detailed analyses of PFT trends, interannual variability, and the impacts of climate change on phytoplankton composition across various temporal and spatial scales, but also has the potential to facilitate precise quantification of marine carbon flux and enhances the accuracy of biogeochemical models. A video demonstration is available at https://doi.org/10.5446/67366 (Zhang and Shen, 2024a). The complete product dataset (1998–2023) can be freely downloaded at https://doi.org/10.11888/RemoteSen.tpdc.301164 (Zhang and Shen, 2024b).
- Preprint
(17247 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on essd-2024-122', Anonymous Referee #1, 19 Jun 2024
General comments:
The paper by Zhang et al. presents the first AI-driven product for Phytoplankton Function Types (PFT) for the global ocean (AIGD-PFT). The AIGD-PFT consists of a L4 gap-free product including 8 PFT at daily and 4-km resolution for the period 1998-2023. AIGD-PFT is generated using an extended ensemble modelling approach (STEE-DL), which is based on machine and deep learning technologies and includes 100 models. Each model is built on statistical relationships between the physical environment and phytoplankton community and incorporates in situ HPLC data, ocean colour satellite observations whose missing data have been reconstructed throughout a cost-efficient DCT-PLS method, physical data from reanalysis and biogeochemical inputs from hindcast simulations.
Overall, the study falls within the scope of ESDD, methods are robust, and the manuscript is well written and detailed. Moreover, I believe that the AIGD-PFT product will be a very useful tool for all scientists interested in detecting climate-induced changes in the phytoplankton community. Therefore, I recommend this paper for publication, although I feel that some clarifications should be addressed to strengthen the way it is presented.
Specific comments:
- Authors present the AIGD-PFT as the product with the longest time span, covering 26 years (i.e., 1998-2023). However, I double checked the data sets used to create it and found some discrepancies that need to be clarified. In particular, except for the ESA-OC-CCI data set, which covers the whole period, I found that SST data from https://doi.org/10.48670/moi-00169 and biogeochemical variables from https://doi.org/10.48670/moi-00019 are available until October 2022 and December 2022, respectively, while SSS from https://doi.org/10.48670/moi-00016 is available from January 2022 to June 2024. So, I am not sure how authors create a 26-year product using some data sets that do not cover the same period.
- As reported in Sect. 2.2.3, all physical and biogeochemical data have been resampled to a 4 km resolution, and I believe that this was done to match the high spatial resolution of the ESA-OC-CCI product. However, any time data are resampled to a higher resolution, a greater but false accuracy is introduced due to the assumption that all new pixels have the same value when it may only be true for one pixel. This is why, as far as I know, the remapping direction is typically from high to low resolution. I would therefore ask authors to discuss this choice and, if possible, include a reference to previous works applying the same strategy. An interesting paper that may help the discussion can be found at https://journals.ametsoc.org/view/journals/apme/60/11/JAMC-D-20-0259.1.xml.
- Page 8, line 151: The sentence needs to be reworded because, as reported in the Product Guide (https://docs.pml.space/share/s/fzNSPb4aQaSDvO7xBNOCIw), the latest ESA-OC-CCI product (v6.0) also merges observations from OLCI-3A and OLCI-3B.
- I found the method used by authors to fill OC data gaps well described in Sect. 2.2.2. However, I think that specifying the number of available data before and after the filling procedure would be interesting and emphasize the effort authors have made. This information could also be presented by replacing Figure 3 with two Hovmöller diagrams showing the number of observations before and after the filling as function of time and latitude.
- The choice to include the 8 PFTs as listed in the manuscript should be justified. I think that adding reference(s) should be enough to do that.
- The definition of ResNet models (i.e., residual neural networks) is given in Sect. 2.3.1, but I think it should be provided earlier as they are mentioned before Sect. 2.3.1.
- I suggest authors to go through the manuscript and split some long sentences to make the text more readable. For example, the second sentence in the abstract, which starts on line 2 and ends on line 14, can be split into at least three sentences.
- I found some errors in the reference list (e.g., Zhang and Shen, 2024a,b,c). Please, check them carefully against the references as cited in the abstract and main text.
To conclude, I would like to mention that, as stated by authors, model interpretability is beyond the scope of this manuscript and will be a focus of a future work. I look forward to that. So, keep up the good progress!
Citation: https://doi.org/10.5194/essd-2024-122-RC1 -
AC1: 'Reply on RC1', Yuan Zhang, 19 Jul 2024
Dear Reviewer,
Thank you for your positive and constructive comments, which surely encourage us to further enhance our research quality. We carefully revised our manuscript and provided a point-by-point response in the supplement.
Thank you again for your reviewing and valuable comments.
-
RC2: 'Comment on essd-2024-122', Anonymous Referee #2, 05 Jul 2024
The manuscript and datasets submitted by Zhang et al. proposed a thorough scheme AIGD-PFT using deep learning techniques to retrieve seamlessly eight phytoplankton functional types (PFTs) chlorophyll a concentrations on the global scale. The AIGD-PFT is built based on an extensive global in situ pigment data set and CMEMS products including satellite ocean color, physical and biogeochemical data sets based on model simulations covering the year from 1998 to 2023. All CMEMS data were preprocessed to have the same spatial resolution. Before performing the deep learning ensemble for PFT retrievals, a gap-filling technique DCT-PLS was firstly applied to all the global CMEMS products to generate seamless data on the global scale. The STEE-DL model were trained and established based on ResNet models using Monte Carlo and bootstrapping methods to finally estimate the PFT chlorophyll a concentration with corresponding model uncertainty assessment. Products were intercompared with other PFT data based on different methods and model simulations and showed outstanding performance.
This work demonstrated thoroughly the seamless PFT products on the global scale over the last 26 years and has shown high potential of machine learning/deep learning techniques in ocean color applications, and here especially for PFT information retrievals. This study delivered the first gap-free global PFT products. I find it significant and the study has put a big step forward for the phytoplankton group estimation using multiple products based on big-data deep learning methods. However, I have several comments and suggestions (listed below) that the authors may consider to hopefully help improve further the quality of this work.
Abstract:
‘PFT values’ here indicate PFT chlorophyll a concentration, correct? This should be clarified in the beginning and kept consistent through the whole ms.
L23-25 Have the time series and impact of climate change been reflected here? Otherwise it is not proper to put such statement here but can be more on a perspective tone.
Intro
L43: Put also reference for DPA, Vidussi et al. 2001
L54-55: I think there are a few more references in this regard, e.g. El Hourany et al 2024, Li et al 2023 deep learning for pigments
Sect 2.2.1 Indicate how many data were finally collected from all these sources
L161 DINEOF – I think the original studies should be cited here too.
L 175 Normalisation: the dataset is standardized by dividing by the spatial mean, for each day or all 30 days together?
L186-189: high missing values – not proper, high missing rates?
Seems that the authors have cut the data based on latitudes as there is a straight cutoff in the maps?
L195: Remove the ‘.’ or use comma after Table 2.
L198-199: SSS – This CMEMS product contains data from 2019 to 2024 only. I suppose you used the physical analysis hindcast too. Should be both cited.
L205: Resampling from lower resolution to high res might cause irreal data filling
Standardisation – is this step conflicting with the normalisation step 2 of the DCT-PLS?
L210-218: any basis/ references for these transformations?
L225 Is the STEE-DL model different from that in Zhang et al. 2023? Why did not the authors use that approach but developed the current STEE-DL instead? Any advantages?
L232-233: reads strange. Rephrase the sentence - This setup decreases the dimensionality of features from 19 to 16, and then to 10, before a final fully connected layer maps these features to an output value for predicting the target variable.
L245- put example references for statistical methods
L253: Does this show how the matchups between the in situ data and CMEMS products were extracted? I would indicate the number of the data points too - also later in the stats
L288-L292: put this together this paragraph with the above one, or using bullets to describe the three CV procedures more clearly.
L386: not sure if it is appropriate to call them ecological types.
L402-404: High missing rates in high latitudes limit the application there. Can the authors indicate the range of the latitudes for these seamless PFT products?
Fig 12: Though it is demonstrated in the video, maybe yearly mean maps here can better demonstrate the whole global ocean - a daily product cannot cover both polar regions.
Fig 13 and uncertainty: I see all data were log transformed, how were these uncertainties calculated in the original conc.?
L504-505: From Fig 13 the model uncertainties one can see already large uncertainties for certain PFT in some regions, such as diatoms and cryptophytes with very low chla values (<0.01 mg m-3) in the gyres but with uncertainty larger than 0.1 mg m-3 and also for Prochlorococcus in high latitudes (almost not existing) with very high uncertainty.
L512 Discussion: How easy is it to apply the STEE-DL model to future datasets? I find it might be difficult to apply it as one has to prepare and preprocess all input data and fill the gaps using the DCT-PLS. That might be an obstacle to put it into operational. The authors should discuss on this point too.
Are the authors planning to publish the codes of AIGD-PFT in the future, so that the others can test it with their own prepared data sets?
Citation: https://doi.org/10.5194/essd-2024-122-RC2 -
AC2: 'Reply on RC2', Yuan Zhang, 19 Jul 2024
Dear Reviewer,
Thank you for your positive and constructive comments, which surely encourage us to further enhance our research quality. We carefully revised our manuscript and provided a point-by-point response in the supplement.
Thank you again for your reviewing and valuable comments.
-
AC2: 'Reply on RC2', Yuan Zhang, 19 Jul 2024
Status: closed
-
RC1: 'Comment on essd-2024-122', Anonymous Referee #1, 19 Jun 2024
General comments:
The paper by Zhang et al. presents the first AI-driven product for Phytoplankton Function Types (PFT) for the global ocean (AIGD-PFT). The AIGD-PFT consists of a L4 gap-free product including 8 PFT at daily and 4-km resolution for the period 1998-2023. AIGD-PFT is generated using an extended ensemble modelling approach (STEE-DL), which is based on machine and deep learning technologies and includes 100 models. Each model is built on statistical relationships between the physical environment and phytoplankton community and incorporates in situ HPLC data, ocean colour satellite observations whose missing data have been reconstructed throughout a cost-efficient DCT-PLS method, physical data from reanalysis and biogeochemical inputs from hindcast simulations.
Overall, the study falls within the scope of ESDD, methods are robust, and the manuscript is well written and detailed. Moreover, I believe that the AIGD-PFT product will be a very useful tool for all scientists interested in detecting climate-induced changes in the phytoplankton community. Therefore, I recommend this paper for publication, although I feel that some clarifications should be addressed to strengthen the way it is presented.
Specific comments:
- Authors present the AIGD-PFT as the product with the longest time span, covering 26 years (i.e., 1998-2023). However, I double checked the data sets used to create it and found some discrepancies that need to be clarified. In particular, except for the ESA-OC-CCI data set, which covers the whole period, I found that SST data from https://doi.org/10.48670/moi-00169 and biogeochemical variables from https://doi.org/10.48670/moi-00019 are available until October 2022 and December 2022, respectively, while SSS from https://doi.org/10.48670/moi-00016 is available from January 2022 to June 2024. So, I am not sure how authors create a 26-year product using some data sets that do not cover the same period.
- As reported in Sect. 2.2.3, all physical and biogeochemical data have been resampled to a 4 km resolution, and I believe that this was done to match the high spatial resolution of the ESA-OC-CCI product. However, any time data are resampled to a higher resolution, a greater but false accuracy is introduced due to the assumption that all new pixels have the same value when it may only be true for one pixel. This is why, as far as I know, the remapping direction is typically from high to low resolution. I would therefore ask authors to discuss this choice and, if possible, include a reference to previous works applying the same strategy. An interesting paper that may help the discussion can be found at https://journals.ametsoc.org/view/journals/apme/60/11/JAMC-D-20-0259.1.xml.
- Page 8, line 151: The sentence needs to be reworded because, as reported in the Product Guide (https://docs.pml.space/share/s/fzNSPb4aQaSDvO7xBNOCIw), the latest ESA-OC-CCI product (v6.0) also merges observations from OLCI-3A and OLCI-3B.
- I found the method used by authors to fill OC data gaps well described in Sect. 2.2.2. However, I think that specifying the number of available data before and after the filling procedure would be interesting and emphasize the effort authors have made. This information could also be presented by replacing Figure 3 with two Hovmöller diagrams showing the number of observations before and after the filling as function of time and latitude.
- The choice to include the 8 PFTs as listed in the manuscript should be justified. I think that adding reference(s) should be enough to do that.
- The definition of ResNet models (i.e., residual neural networks) is given in Sect. 2.3.1, but I think it should be provided earlier as they are mentioned before Sect. 2.3.1.
- I suggest authors to go through the manuscript and split some long sentences to make the text more readable. For example, the second sentence in the abstract, which starts on line 2 and ends on line 14, can be split into at least three sentences.
- I found some errors in the reference list (e.g., Zhang and Shen, 2024a,b,c). Please, check them carefully against the references as cited in the abstract and main text.
To conclude, I would like to mention that, as stated by authors, model interpretability is beyond the scope of this manuscript and will be a focus of a future work. I look forward to that. So, keep up the good progress!
Citation: https://doi.org/10.5194/essd-2024-122-RC1 -
AC1: 'Reply on RC1', Yuan Zhang, 19 Jul 2024
Dear Reviewer,
Thank you for your positive and constructive comments, which surely encourage us to further enhance our research quality. We carefully revised our manuscript and provided a point-by-point response in the supplement.
Thank you again for your reviewing and valuable comments.
-
RC2: 'Comment on essd-2024-122', Anonymous Referee #2, 05 Jul 2024
The manuscript and datasets submitted by Zhang et al. proposed a thorough scheme AIGD-PFT using deep learning techniques to retrieve seamlessly eight phytoplankton functional types (PFTs) chlorophyll a concentrations on the global scale. The AIGD-PFT is built based on an extensive global in situ pigment data set and CMEMS products including satellite ocean color, physical and biogeochemical data sets based on model simulations covering the year from 1998 to 2023. All CMEMS data were preprocessed to have the same spatial resolution. Before performing the deep learning ensemble for PFT retrievals, a gap-filling technique DCT-PLS was firstly applied to all the global CMEMS products to generate seamless data on the global scale. The STEE-DL model were trained and established based on ResNet models using Monte Carlo and bootstrapping methods to finally estimate the PFT chlorophyll a concentration with corresponding model uncertainty assessment. Products were intercompared with other PFT data based on different methods and model simulations and showed outstanding performance.
This work demonstrated thoroughly the seamless PFT products on the global scale over the last 26 years and has shown high potential of machine learning/deep learning techniques in ocean color applications, and here especially for PFT information retrievals. This study delivered the first gap-free global PFT products. I find it significant and the study has put a big step forward for the phytoplankton group estimation using multiple products based on big-data deep learning methods. However, I have several comments and suggestions (listed below) that the authors may consider to hopefully help improve further the quality of this work.
Abstract:
‘PFT values’ here indicate PFT chlorophyll a concentration, correct? This should be clarified in the beginning and kept consistent through the whole ms.
L23-25 Have the time series and impact of climate change been reflected here? Otherwise it is not proper to put such statement here but can be more on a perspective tone.
Intro
L43: Put also reference for DPA, Vidussi et al. 2001
L54-55: I think there are a few more references in this regard, e.g. El Hourany et al 2024, Li et al 2023 deep learning for pigments
Sect 2.2.1 Indicate how many data were finally collected from all these sources
L161 DINEOF – I think the original studies should be cited here too.
L 175 Normalisation: the dataset is standardized by dividing by the spatial mean, for each day or all 30 days together?
L186-189: high missing values – not proper, high missing rates?
Seems that the authors have cut the data based on latitudes as there is a straight cutoff in the maps?
L195: Remove the ‘.’ or use comma after Table 2.
L198-199: SSS – This CMEMS product contains data from 2019 to 2024 only. I suppose you used the physical analysis hindcast too. Should be both cited.
L205: Resampling from lower resolution to high res might cause irreal data filling
Standardisation – is this step conflicting with the normalisation step 2 of the DCT-PLS?
L210-218: any basis/ references for these transformations?
L225 Is the STEE-DL model different from that in Zhang et al. 2023? Why did not the authors use that approach but developed the current STEE-DL instead? Any advantages?
L232-233: reads strange. Rephrase the sentence - This setup decreases the dimensionality of features from 19 to 16, and then to 10, before a final fully connected layer maps these features to an output value for predicting the target variable.
L245- put example references for statistical methods
L253: Does this show how the matchups between the in situ data and CMEMS products were extracted? I would indicate the number of the data points too - also later in the stats
L288-L292: put this together this paragraph with the above one, or using bullets to describe the three CV procedures more clearly.
L386: not sure if it is appropriate to call them ecological types.
L402-404: High missing rates in high latitudes limit the application there. Can the authors indicate the range of the latitudes for these seamless PFT products?
Fig 12: Though it is demonstrated in the video, maybe yearly mean maps here can better demonstrate the whole global ocean - a daily product cannot cover both polar regions.
Fig 13 and uncertainty: I see all data were log transformed, how were these uncertainties calculated in the original conc.?
L504-505: From Fig 13 the model uncertainties one can see already large uncertainties for certain PFT in some regions, such as diatoms and cryptophytes with very low chla values (<0.01 mg m-3) in the gyres but with uncertainty larger than 0.1 mg m-3 and also for Prochlorococcus in high latitudes (almost not existing) with very high uncertainty.
L512 Discussion: How easy is it to apply the STEE-DL model to future datasets? I find it might be difficult to apply it as one has to prepare and preprocess all input data and fill the gaps using the DCT-PLS. That might be an obstacle to put it into operational. The authors should discuss on this point too.
Are the authors planning to publish the codes of AIGD-PFT in the future, so that the others can test it with their own prepared data sets?
Citation: https://doi.org/10.5194/essd-2024-122-RC2 -
AC2: 'Reply on RC2', Yuan Zhang, 19 Jul 2024
Dear Reviewer,
Thank you for your positive and constructive comments, which surely encourage us to further enhance our research quality. We carefully revised our manuscript and provided a point-by-point response in the supplement.
Thank you again for your reviewing and valuable comments.
-
AC2: 'Reply on RC2', Yuan Zhang, 19 Jul 2024
Data sets
AIGD-PFT: The first AI-driven Global Daily gap-free 4 km Phytoplankton Functional Type products from 1998 to 2023 Yuan Zhang, Fang Shen, Renhu Li, Mengyu Li, Zhaoxin Li, Songyu Chen, and Xuerong Sun https://doi.org/10.11888/RemoteSen.tpdc.301164
Video supplement
AIGD-PFT: The first AI-driven Global Daily gap-free 4 km Phytoplankton Functional Type products from 1998 to 2023 Yuan Zhang, Fang Shen, Renhu Li, Mengyu Li, Zhaoxin Li, Songyu Chen, and Xuerong Sun https://doi.org/10.5446/67366
Video abstract
AIGD-PFT: The first AI-driven Global Daily gap-free 4 km Phytoplankton Functional Type products from 1998 to 2023 Yuan Zhang, Fang Shen, Renhu Li, Mengyu Li, Zhaoxin Li, Songyu Chen, and Xuerong Sun https://doi.org/10.5446/67366
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
804 | 129 | 46 | 979 | 36 | 34 |
- HTML: 804
- PDF: 129
- XML: 46
- Total: 979
- BibTeX: 36
- EndNote: 34
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1