CVFeb 3Code
Vision Transformers for Zero-Shot Clustering of Animal Images: A Comparative Benchmarking StudyHugo Markoff, Stefan Hein Bengtson, Michael Ørsted
Manual labeling of animal images remains a significant bottleneck in ecological research, limiting the scale and efficiency of biodiversity monitoring efforts. This study investigates whether state-of-the-art Vision Transformer (ViT) foundation models can reduce thousands of unlabeled animal images directly to species-level clusters. We present a comprehensive benchmarking framework evaluating five ViT models combined with five dimensionality reduction techniques and four clustering algorithms, two supervised and two unsupervised, across 60 species (30 mammals and 30 birds), with each test using a random subset of 200 validated images per species. We investigate when clustering succeeds at species-level, where it fails, and whether clustering within the species-level reveals ecologically meaningful patterns such as sex, age, or phenotypic variation. Our results demonstrate near-perfect species-level clustering (V-measure: 0.958) using DINOv3 embeddings with t-SNE and supervised hierarchical clustering methods. Unsupervised approaches achieve competitive performance (0.943) while requiring no prior species knowledge, rejecting only 1.14% of images as outliers requiring expert review. We further demonstrate robustness to realistic long-tailed distributions of species and show that intentional over-clustering can reliably extract intra-specific variation including age classes, sexual dimorphism, and pelage differences. We introduce an open-source benchmarking toolkit and provide recommendations for ecologists to select appropriate methods for sorting their specific taxonomic groups and data.
CVJun 15, 2020Code
3D-ZeF: A 3D Zebrafish Tracking Benchmark DatasetMalte Pedersen, Joakim Bruslund Haurum, Stefan Hein Bengtson et al.
In this work we present a novel publicly available stereo based 3D RGB dataset for multi-object zebrafish tracking, called 3D-ZeF. Zebrafish is an increasingly popular model organism used for studying neurological disorders, drug addiction, and more. Behavioral analysis is often a critical part of such research. However, visual similarity, occlusion, and erratic movement of the zebrafish makes robust 3D tracking a challenging and unsolved problem. The proposed dataset consists of eight sequences with a duration between 15-120 seconds and 1-10 free moving zebrafish. The videos have been annotated with a total of 86,400 points and bounding boxes. Furthermore, we present a complexity score and a novel open-source modular baseline system for 3D tracking of zebrafish. The performance of the system is measured with respect to two detectors: a naive approach and a Faster R-CNN based fish head detector. The system reaches a MOTA of up to 77.6%. Links to the code and dataset is available at the project page https://vap.aau.dk/3d-zef
CVJan 7, 2025
AutoFish: Dataset and Benchmark for Fine-grained Analysis of FishStefan Hein Bengtson, Daniel Lehotský, Vasiliki Ismiroglou et al.
Automated fish documentation processes are in the near future expected to play an essential role in sustainable fisheries management and for addressing challenges of overfishing. In this paper, we present a novel and publicly available dataset named AutoFish designed for fine-grained fish analysis. The dataset comprises 1,500 images of 454 specimens of visually similar fish placed in various constellations on a white conveyor belt and annotated with instance segmentation masks, IDs, and length measurements. The data was collected in a controlled environment using an RGB camera. The annotation procedure involved manual point annotations, initial segmentation masks proposed by the Segment Anything Model (SAM), and subsequent manual correction of the masks. We establish baseline instance segmentation results using two variations of the Mask2Former architecture, with the best performing model reaching an mAP of 89.15%. Additionally, we present two baseline length estimation methods, the best performing being a custom MobileNetV2-based regression model reaching an MAE of 0.62cm in images with no occlusion and 1.38cm in images with occlusion. Link to project page: https://vap.aau.dk/autofish/.