CV AI CLNov 28, 2024

Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features

Chancharik Mitra, Brandon Huang, Tianning Chai, Zhiqiu Lin, Assaf Arbelle, Rogerio Feris, Leonid Karlinsky, Trevor Darrell, Deva Ramanan, Roei Herzig

arXiv:2412.00142v311.37 citationsh-index: 20

Originality Incremental advance

AI Analysis

This addresses a bottleneck in applying LMMs to classification tasks like image classification and VQA, offering an incremental improvement for researchers and practitioners in vision-language AI.

The paper tackled the challenge of extracting useful features from generative Large Multimodal Models (LMMs) for vision-language classification tasks, proposing Sparse Attention Vectors (SAVs) that achieve state-of-the-art performance with few-shot examples.

Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks. Despite strong performance, LMMs' generative outputs are not specialized for vision-language classification tasks (i.e., tasks with vision-language inputs and discrete labels) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for these tasks is the extraction of useful features from generative LMMs. To overcome this, we propose an approach that leverages multimodal feature extraction from the LMM's latent space. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 5% of the heads) in LMMs as strong feature representations. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of vision-language classification tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.

View on arXiv PDF

Similar