CLAICVJun 24, 2024

Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts

arXiv:2406.16851v330 citations
Originality Synthesis-oriented
AI Analysis

This work addresses a critical limitation in VLMs for applications requiring long-context processing, highlighting their inability to filter distractions compared to language models, which is incremental as it benchmarks existing models without proposing a new solution.

The authors tackled the problem of evaluating long-context extractive reasoning in vision language models (VLMs) by introducing LoCoVQA, a dynamic benchmark generator that adds distractor images to tasks like mathematical reasoning and VQA. They found that VLMs rapidly lose performance with longer visual contexts, often showing a logarithmic decay trend, indicating they struggle to ignore irrelevant information.

We present LoCoVQA, a dynamic benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images. Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking logarithmic decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries -- a task that is quite easy for language models (LMs) in the text domain -- demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes