Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems
For marine ecologists and climate scientists, this provides a large-scale benchmark and shows that current biological foundation models fail on plankton, highlighting the need for domain-specific data.
The authors created Planktonzilla-17M, a unified dataset of 17.4 million plankton images from 13 imaging systems, and found that supervised classifiers trained with taxonomic lineage text match or exceed CLIP-style training, while BioCLIP and BioCLIP2 perform poorly on plankton.
Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical for understanding ocean health and climate feedbacks. Existing classification models perform well on individual collections but fail to generalize across instruments and environments due to isolated training datasets and inconsistent labels. To address this, we introduce Planktonzilla-17M, a unified dataset consolidating publicly available plankton image collections spanning thirteen imaging systems. It comprises 17.4 million images with standardized taxonomy and geo-environmental metadata, including 3.74 million plankton images spanning over 602 taxonomic classes, of which 201 are identified at the species level, making it the largest and most comprehensive plankton image dataset to date. Using this large-scale dataset, we perform a controlled comparison between supervised and CLIP-style image--text training on a shared ViT backbone. We find that a supervised classifier matches or exceeds CLIP-style training when trained using taxonomic lineage as text. We further observe that BioCLIP and BioCLIP2 perform poorly on plankton in zero-shot and few-shot settings. Leveraging Planktonzilla-17M improves plankton classification performance, highlighting the limitations of current biological foundation models in marine imaging domains.