CL CVMar 17, 2024

Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

Igor Sterner, Weizhe Lin, Jinghong Chen, Bill Byrne

arXiv:2403.11317v13.44 citationsh-index: 12

Originality Incremental advance

AI Analysis

This work addresses the problem of multimodal integration for researchers, revealing that a common assumption in the field may be flawed, though it is incremental as it focuses on a specific comparison.

The paper compared two approaches for integrating images into large language models for few-shot visual question answering: using image captions versus directly mapping visual embeddings. It found that for a 3B parameter model, direct mapping does not consistently outperform captions, with performance depending on the regime (zero-shot vs. few-shot) and example selection.

Two approaches have emerged to input images into large language models (LLMs). The first is to caption images into natural language. The second is to map image feature embeddings into the domain of the LLM and pass the mapped embeddings directly to the LLM. The majority of recent few-shot multimodal work reports performance using architectures that employ variations of one of these two approaches. But they overlook an important comparison between them. We design a controlled and focused experiment to compare these two approaches to few-shot visual question answering (VQA) with LLMs. Our findings indicate that for Flan-T5 XL, a 3B parameter LLM, connecting visual embeddings directly to the LLM embedding space does not guarantee improved performance over using image captions. In the zero-shot regime, we find using textual image captions is better. In the few-shot regimes, how the in-context examples are selected determines which is better.

View on arXiv PDF

Similar