the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Benchmark of plankton images classification: emphasizing features extraction over classifier complexity
Abstract. Plankton imaging devices produce vast datasets, the processing of which can be largely accelerated through machine learning. This is a challenging task due to the diversity of plankton, the prevalence of non-biological classes, and the rarity of many classes. Most existing studies rely on small, unpublished datasets that often lack realism in size, class diversity and proportions. We therefore also lack a systematic, realistic benchmark of plankton image classification approaches. To address this gap, we leverage both existing and newly published, large, and realistic plankton imaging datasets from widely used instruments. We evaluate different classification approaches: a classical Random Forest classifier applied to handcrafted features, various Convolutional Neural Networks (CNN), and a combination of both. This work aims to provide reference datasets, baseline results, and insights to guide future endeavors in plankton image classification. Overall, CNN outperformed the classical approach but only significantly for uncommon classes. Larger CNN, which should provide richer features, did not perform better than small ones; and features of small ones could even be further compressed without affecting classification performance. Finally, we highlight that the nature of the classifier is of little importance compared to the content of the features. Our findings suggest that small CNNs are sufficient to extract relevant information to classify small grayscale plankton images. This has consequences for operational classification models, which can afford to be small and quick. On the other hand, this opens the possibility for further development of the imaging systems to provide larger and richer images.
Competing interests: Emma Amblard was employed by Fotonower. Guillaume Boniface-Chang was employed by Google Research, London. Gabriel Dulac-Arnold was employed by Google Research, Paris. Ben Woodward was employed by CVision AI.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(1511 KB) - Metadata XML
-
Supplement
(368 KB) - BibTeX
- EndNote
Status: open (until 06 Oct 2025)
- RC1: 'Comment on essd-2025-309', Kaisa Kraft, 03 Sep 2025 reply
-
RC2: 'Comment on essd-2025-309', Jeffrey Ellen, 08 Sep 2025
reply
While I do not think the paper is original, because the concept is simple, it is well overdue and very clearly presented. The data quality is Excellent in terms of diversity of instruments represented, but only good in terms of size of datasets (although that is limited public data sets). Most of my comments are based on opinions and setup, not the actual methodology of the manuscript itself.
Line 26 - I am not following the logic of how the finding of small CNNs performing sufficiently well at classification of small grayscale plankton images enables imaging systems to provide larger images. The quality of the images is dependent upon the physics/optics and electrical engineering, not the ML algorithm.
Line 50 - I disagree that the software pipelines have not progressed as fast as the hardware. I think the case made in Malde et al. is that the analysis is not keeping up with either the software or hardware. For example, if a trawled plankton camera classifies ~1 million images in an hour at high accuracy, what is the new analysis that is now possible? I think the argument is that these measures are distilled/averaged/summarized to fit old analysis techniques rather than giving rise to new ones.
With respect to sofware pipelines in general, in other applications real-time image classification is happening, but even in-situ, where a plankton camera system needs to be more ruggedized, state of the art is real-time image classification.
As a counterexample, the 2022 paper from Bi et al "Temporal characteristics of plankton indicators in coastal waters: High-frequency data from PlanktonScope"(https://doi.org/10.1016/j.seares.2022.102283)
"The system was deployed to process in situ images collected by PlanktonScope and to provide near real time plankton density, specifically mysid shrimp data because mysid swarms could clog cooling water intake." The difference between real-time data and real-time analysis is addressed briefly in discussion section 5.2Line 66 - this is contradictory to line 24, where CNNs of small size are sufficient. I think it is a more useful argument to mention that many CNNs take a fixed region pixel size, and plankton are highly diverse as to their relative scales as opposed to say, classifiying images of jungle mammals, where there is less orders of magnitude of size difference.
Line 80 - "It also means that no universal set of features can be produced to identify all plankton traits across instruments" - I disagree, I think that the importance of feature selection is overstated: it reduces computational overhead but costs accuracy. Most pre-CNN algorithms, when properly trained, will learn to weight the redundant or noisy features closes to zero. Also, I think whether or not a universal set of hand-crafted features can exist is moot, because, as put in Irisson et al. 2022 (https://doi.org/10.1146/annurev-marine-041921-013023) """This really is the main progress that CNNs bring: One can forgo the considerable domain expertise required to craft appropriate features, use a pretrained feature extractor, and get results that are equally good, if not better."""
Line 255 - Ellen et al. 2019 specifically uses a narrow distribution of values for a non-homogeneous background.
Section 2 - very strong.
Line 279 - Any indication that 10/5 epochs were sufficient?
Fig 1 - Beautiful figure.
Table 4 - The color scale is hard for me to discriminate between values <5% apart.
Line 324 - Why choose a ranom classifier as the baseline? Shouldn't the baseline be the classifier that always chooses the dominant class? In the provided example, 90% seems like a more appropriate benchmark to beat than 81%.
Line 361 - +15.1% seems better than "only slightly"
Line 500 - Another way to improve accuracy is more training data, and there is a result that shows that using plankton images from different instruments in conjunction with
"In summary, our recommendations for training a CNN to classify plankton images begin with assembling as many annotated plankton images as possible, even if images are from seemingly disparate sources."
(Ellen and Ohman 2024 - doi: 10.1002/lom3.10648)Line 529 - I do not see that conclusion about geometric metadata features in Ellen et al. 2019: """Combining each individually with the geometric metadata provides a boost in performance. One possible explanation for this result is that the geometric metadata includes information about the original region of interest size, which on its own did not prove valuable, but size given a depth or temperature may have discriminative value. """
Line 557 - Space after period is missing. "set.Since"
Line 578 - I thought the wording of the 4 questions in lines 160-164 was particularly well done. I was disappointed to not find a corresponding set of concisely worded conclusions.
Citation: https://doi.org/10.5194/essd-2025-309-RC2
Data sets
ISIISNet : plankton images captured with the ISIIS (In-situ Ichthyoplankton Imaging System) Thelma Panaïotis et al. https://doi.org/10.17882/101950
FlowCAMNet : plankton images captured with the FlowCAM Laetitia Jalabert et al. https://doi.org/10.17882/101961
UVP6Net : plankton images captured with the UVP6 Marc Picheral et al. https://doi.org/10.17882/101948
ZooCAMNet : plankton images captured with the ZooCAM Jean-Baptiste Romagnan et al. https://doi.org/10.17882/101928
ZooScanNet: plankton images captured with the ZooScan Amanda Elineau et al. https://doi.org/10.17882/55741
Model code and software
ThelmaPana/plankton_classif Thelma Panaïotis and Emma Amblard https://doi.org/10.5281/zenodo.15406618
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
775 | 126 | 29 | 930 | 34 | 33 | 37 |
- HTML: 775
- PDF: 126
- XML: 29
- Total: 930
- Supplement: 34
- BibTeX: 33
- EndNote: 37
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
There is a huge number of studies conducted on the topic of plankton classification. The authors of the manuscript are trying to provide a baseline comparison dataset that different methods could be compared against, and a dataset that could be used in order to enable better comparison of results between the numerous studies. The aims of the study have good grounds, giving justification for the study. The authors present an interesting case with a well-designed manuscript. There are two minor topics of improvements: the authors emphasize and acknowledge only large datasets, in my opinion, smaller efforts in trying to provide public datasets should also be given some acknowledgement in the introduction, and another minor issue is in the way the authors present alternative methods, with for example no mentioning on open set classification methods and e.g., autoencoders, that have been demonstrated as rather promising methods for different tasks.
Specific comments with numbers referring to lines:
53-55: Could add some references to demonstrate the imbalance
57-58: True, but also, if a class has very distinguishable morphology, the required number of training images for the class to perform well will be much less. See e.g. Kraft et al. 2022 https://doi.org/10.3389/fmars.2022.867695
83-84: There is a mention of plankton traits, but the topic of traits has not been touched on previously in the introduction and would require a description earlier.
120-122: Two recent review articles are covered, but you are missing a third, even more recent review on plankton classification methods, Eerola et al. 2024 https://doi.org/10.1007/s10462-024-10745-y , in particular fig 3.
135-136: This is especially true for studies that concentrate on machine learning methods. If looking at the studies that concentrate more on the implications of the results in ecological, taxonomical, or operational contexts, the class-specific metrics are more often published. How did you come up with the list in Table S1? The 10 most cited are based on the Irisson et al. 2022, right? Is the rest cited in this paper the total of citations in your manuscript (sorry, I was too lazy to count them)? If so, are you sure they are representative of the entire literature on the topic, as you are not covering all recent publications with plankton recognition using CNNs? I would not draw this type of conclusion unless you have tried to ensure you cover all recent publications. A number of studies has been published since Irisson et al. 2022.
297-299: Why is this? It goes against the gut feeling, so in addition to references, could you also mention why?
335: change the term pure into something else e.g., how false positive-free the predicted plankton classes are
Table 4: It would be good to add the class-specific n for each class, as it most often will/can, but not exclusively, affect/explain the class-specific performance. And also, to the tables with other datasets. What means the Plankton? Do the colors mean something? Please add more information to the table caption. Yes, after scrolling down, there is also non plankton. Could the word classes be added after those, i.e., Plankton classes and Non plankton classes? I didn’t find the corresponding tables for the other datasets, where are they?
Header 3.3: Please change this from revealing the results into something like Model performance on small classes
Figure 2: Why did you choose to show accuracy and not F1 score in the first panel (the same comment goes also for the subsequent figures)? What is the Random classifier? It was mentioned in a paragraph starting from line 350, but it would require a better explanation.
375-385: Wouldn’t it be important to find a harmonic mean between precision and recall rather than emphasize the importance of precision and detection of rare classes over recall?
Header 3.4: I would rephrase this as well rather to be i.e., Model performance of a small CNN in plankton image classification
Header: 3.5: Importance of features and classifier
420-421: You mean recall and precision? You did not show F1 in the figures you are referring to.
421-423: This is in line with the results from Kraft et al. 2022 where there was almost no confusion between different taxonomical groups.
467-469: A lightweight CNN has proven to reach very good performance in classifying plankton also previously, e.g., Kraft et al. 2022.
470-475: Yes, I do agree with this partly, however, for example, the MATLAB codes available for IFCB data processing and classification purposes are still easier to adopt by new groups, as they don’t actually require much knowledge of any programming. That is why so many groups with IFCB still actively use the MATLAB-based RF implementation https://github.com/hsosik/ifcb-analysis
510-511: The comment comes a bit out of the blue and without context. If you want to add this information, I suggest rephrasing and tying it better to the content.
545-547: Do you have statistics/ figure to support this? If it is said like this, the results should be added as supplementary, otherwise, this phrase should be removed.
548-550: The concept dataset shift is indeed a problem. However, the nature of plankton makes it very hard to have a representative distribution when classifying real datasets. The training data is ideally constructed based on data from multiple occasions, seasons, and covering several years. Still, when classifying data, the samples to be classified are from a specific time point, i.e., a spring sample, which will not have the same data distribution as the training data, as the class composition and the share of very heterogeneous images vary. So, an interesting question is, how much does the class distribution actually matter in the case of plankton?
555-561: Open set classification methods are also a very promising approach to classify plankton, as plankton data includes a lot of difficult-to-classify images that normally end up in very heterogeneous classes with little common good features to describe those classes, and which often have very poor performance. Still, often those classes, at least in the case of phytoplankton instruments triggered by chlorophyll a, contain phytoplankton, i.e., are of interest, but which are difficult or impossible to identify taxonomically well (e.g., a small flagellate).
570-572: I don’t see this as a new and interesting thing, but rather an already well-known fact. I didn’t see the point of showing accuracy in figures as well, as the fact that it is a poor metric is already known.