Improved Few-Shot Image Classification Through Multiple-Choice Questions
This work addresses the challenge of adapting VQA models for few-shot image classification, which is important for applications requiring flexible visual understanding with limited labeled data, though it is incremental as it builds on existing VQA and few-shot learning approaches.
The paper tackles the problem of poor zero-shot performance in visual question answering (VQA) models for image classification by proposing a training-free few-shot method that uses multiple-choice questions to extract prompt-specific latent representations, which are combined into an overall image embedding and decoded via class prototypes from labeled examples. The method outperforms pure visual encoders and zero-shot VQA baselines on datasets like MiniImageNet, Caltech-UCSD Birds, and CIFAR-100, with particular strength in settings with diverse visual attributes like clothing.
Through a simple multiple choice language prompt a VQA model can operate as a zero-shot image classifier, producing a classification label. Compared to typical image encoders, VQA models offer an advantage: VQA-produced image embeddings can be infused with the most relevant visual information through tailored language prompts. Nevertheless, for most tasks, zero-shot VQA performance is lacking, either because of unfamiliar category names, or dissimilar pre-training data and test data distributions. We propose a simple method to boost VQA performance for image classification using only a handful of labeled examples and a multiple-choice question. This few-shot method is training-free and maintains the dynamic and flexible advantages of the VQA model. Rather than relying on the final language output, our approach uses multiple-choice questions to extract prompt-specific latent representations, which are enriched with relevant visual information. These representations are combined to create a final overall image embedding, which is decoded via reference to latent class prototypes constructed from the few labeled examples. We demonstrate this method outperforms both pure visual encoders and zero-shot VQA baselines to achieve impressive performance on common few-shot tasks including MiniImageNet, Caltech-UCSD Birds, and CIFAR-100. Finally, we show our approach does particularly well in settings with numerous diverse visual attributes such as the fabric, article-style, texture, and view of different articles of clothing, where other few-shot approaches struggle, as we can tailor our image representations only on the semantic features of interest.