CVMar 15, 2025

TLAC: Two-stage LMM Augmented CLIP for Zero-Shot Classification

Ans Munir, Faisal Z. Qureshi, Muhammad Haris Khan, Mohsen Ali

arXiv:2503.12206v21 citationsh-index: 2Has Code2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Originality Highly original

AI Analysis

This addresses the problem of high computational cost and time for adapting CLIP to new datasets, offering a training-free solution for zero-shot classification across diverse domains.

The paper tackles the limitation of CLIP requiring fine-tuning for zero-shot classification by introducing training-free methods (SLAC and TLAC) that use Large Multimodal Models (LMMs) to identify objects and CLIP for classification, achieving 83.44% accuracy on ImageNet and outperforming previous state-of-the-art by 6.75%.

Contrastive Language-Image Pretraining (CLIP) has shown impressive zero-shot performance on image classification. However, state-of-the-art methods often rely on fine-tuning techniques like prompt learning and adapter-based tuning to optimize CLIP's performance. The necessity for fine-tuning significantly limits CLIP's adaptability to novel datasets and domains. This requirement mandates substantial time and computational resources for each new dataset. To overcome this limitation, we introduce simple yet effective training-free approaches, Single-stage LMM Augmented CLIP (SLAC) and Two-stage LMM Augmented CLIP (TLAC), that leverages powerful Large Multimodal Models (LMMs), such as Gemini, for image classification. The proposed methods leverages the capabilities of pre-trained LMMs, allowing for seamless adaptation to diverse datasets and domains without the need for additional training. Our approaches involve prompting the LMM to identify objects within an image. Subsequently, the CLIP text encoder determines the image class by identifying the dataset class with the highest semantic similarity to the LLM predicted object. Our models achieved superior accuracy on 9 of 11 base-to-novel datasets, including ImageNet, SUN397, and Caltech101, while maintaining a strictly training-free paradigm. Our TLAC model achieved an overall accuracy of 83.44%, surpassing the previous state-of-the-art few-shot methods by a margin of 6.75%. Compared to other training-free approaches, our TLAC method achieved 83.6% average accuracy across 13 datasets, a 9.7% improvement over the previous methods. Our Code is available at https://github.com/ans92/TLAC

View on arXiv PDF Code

Similar