Fine-Grained Classification: Connecting Metadata via Cross-Contrastive Pre-Training
This addresses the problem of distinguishing highly similar object categories in computer vision, though it appears incremental as it builds on existing multimodal methods.
The paper tackles fine-grained visual classification by integrating image, text, and metadata through cross-contrastive pre-training, achieving an 84.44% top-1 accuracy on NABirds, which is a 7.83% improvement over the baseline.
Fine-grained visual classification aims to recognize objects belonging to many subordinate categories of a supercategory, where appearance alone often fails to distinguish highly similar classes. We propose a unified framework that integrates image, text, and metadata via cross-contrastive pre-training. We first align the three modality encoders in a shared embedding space and then fine-tune the image and metadata encoders for classification. On NABirds, our approach improves over the baseline by 7.83% and achieves 84.44% top-1 accuracy, outperforming strong multimodal methods.