CVCLNov 21, 2024

VAGUE: Visual Contexts Clarify Ambiguous Expressions

arXiv:2411.14137v34 citationsh-index: 7Has Code
Originality Synthesis-oriented
AI Analysis

This addresses a critical gap in multimodal reasoning for AI systems, though it is incremental as it focuses on benchmarking rather than proposing a new method.

The paper tackles the problem of multimodal AI systems struggling to integrate visual context to resolve ambiguous expressions, by introducing the VAGUE benchmark with 1.6K ambiguous textual expressions paired with images, and finds that existing models perform far below human accuracy, failing to effectively reason with visual cues.

Human communication often relies on visual cues to resolve ambiguity. While humans can intuitively integrate these cues, AI systems often find it challenging to engage in sophisticated multimodal reasoning. We introduce VAGUE, a benchmark evaluating multimodal AI systems' ability to integrate visual context for intent disambiguation. VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations, where the correct answer is only apparent with visual context. The dataset spans both staged, complex (Visual Commonsense Reasoning) and natural, personal (Ego4D) scenes, ensuring diversity. Our experiments reveal that existing multimodal AI models struggle to infer the speaker's true intent. While performance consistently improves from the introduction of more visual cues, the overall accuracy remains far below human performance, highlighting a critical gap in multimodal reasoning. Analysis of failure cases demonstrates that current models fail to distinguish true intent from superficial correlations in the visual scene, indicating that they perceive images but do not effectively reason with them. We release our code and data at https://hazel-heejeong-nam.github.io/vague/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes