CVAICLLGNov 26, 2025

Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis

arXiv:2511.21397v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the impact of distractors on multimodal reasoning for AI researchers, showing incremental insights by extending known inverse scaling effects to visual domains.

The study investigated how visual distractors affect test-time scaling in vision-language models, finding that adding visual distractors reduces accuracy without increasing reasoning length, and proposed a prompting strategy to mitigate bias-driven predictions.

How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes