LGCVJul 23, 2024

Improved Few-Shot Image Classification Through Multiple-Choice Questions

arXiv:2407.16145v12 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses the challenge of adapting VQA models for few-shot image classification, which is important for applications requiring flexible visual understanding with limited labeled data, though it is incremental as it builds on existing VQA and few-shot learning approaches.

The paper tackles the problem of poor zero-shot performance in visual question answering (VQA) models for image classification by proposing a training-free few-shot method that uses multiple-choice questions to extract prompt-specific latent representations, which are combined into an overall image embedding and decoded via class prototypes from labeled examples. The method outperforms pure visual encoders and zero-shot VQA baselines on datasets like MiniImageNet, Caltech-UCSD Birds, and CIFAR-100, with particular strength in settings with diverse visual attributes like clothing.

Through a simple multiple choice language prompt a VQA model can operate as a zero-shot image classifier, producing a classification label. Compared to typical image encoders, VQA models offer an advantage: VQA-produced image embeddings can be infused with the most relevant visual information through tailored language prompts. Nevertheless, for most tasks, zero-shot VQA performance is lacking, either because of unfamiliar category names, or dissimilar pre-training data and test data distributions. We propose a simple method to boost VQA performance for image classification using only a handful of labeled examples and a multiple-choice question. This few-shot method is training-free and maintains the dynamic and flexible advantages of the VQA model. Rather than relying on the final language output, our approach uses multiple-choice questions to extract prompt-specific latent representations, which are enriched with relevant visual information. These representations are combined to create a final overall image embedding, which is decoded via reference to latent class prototypes constructed from the few labeled examples. We demonstrate this method outperforms both pure visual encoders and zero-shot VQA baselines to achieve impressive performance on common few-shot tasks including MiniImageNet, Caltech-UCSD Birds, and CIFAR-100. Finally, we show our approach does particularly well in settings with numerous diverse visual attributes such as the fabric, article-style, texture, and view of different articles of clothing, where other few-shot approaches struggle, as we can tailor our image representations only on the semantic features of interest.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes