CV LGApr 29, 2025

Fine-Grained Classification: Connecting Metadata via Cross-Contrastive Pre-Training

arXiv:2504.20322v2h-index: 22025 12th International Conference on Soft Computing & Machine Intelligence (ISCMI)

Originality Incremental advance

AI Analysis

This addresses the problem of distinguishing highly similar object categories in computer vision, though it appears incremental as it builds on existing multimodal methods.

The paper tackles fine-grained visual classification by integrating image, text, and metadata through cross-contrastive pre-training, achieving an 84.44% top-1 accuracy on NABirds, which is a 7.83% improvement over the baseline.

Fine-grained visual classification aims to recognize objects belonging to many subordinate categories of a supercategory, where appearance alone often fails to distinguish highly similar classes. We propose a unified framework that integrates image, text, and metadata via cross-contrastive pre-training. We first align the three modality encoders in a shared embedding space and then fine-tune the image and metadata encoders for classification. On NABirds, our approach improves over the baseline by 7.83% and achieves 84.44% top-1 accuracy, outperforming strong multimodal methods.

View on arXiv PDF

Similar