CVOct 23, 2025

Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward

Jing Bi, Guangyu Sun, Ali Vosoughi, Chen Chen, Chenliang Xu

arXiv:2510.20696v110.23 citationsh-index: 12

Originality Incremental advance

AI Analysis

This addresses visual reasoning challenges in AI, offering incremental improvements for multimodal model development.

The paper tackles the problem of visual hallucinations and over-reliance on textual priors in multimodal large language models by proposing an agent-based architecture that integrates LLM reasoning with lightweight visual modules, achieving significant gains such as +10.3 on MMMU and +6.0 on MathVista over a baseline.

Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.

View on arXiv PDF

Similar