V-FAT: Benchmarking Visual Fidelity Against Text-bias

arXiv:2601.04897v13 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the critical issue of visual grounding reliability in multimodal AI systems for researchers and developers, though it appears incremental as a diagnostic tool rather than a solution.

The paper tackles the problem of multimodal large language models relying too heavily on linguistic shortcuts rather than genuine visual understanding, introducing the V-FAT benchmark with 4,026 VQA instances across six domains to systematically measure this text bias, finding that 12 frontier models experience significant visual collapse under high linguistic dominance.

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes