CVAISep 30, 2025

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

arXiv:2509.25848v218 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the problem of visual forgetting in multimodal reasoning for VLMs, which is an incremental improvement in enhancing model reliability on visual tasks.

The study found that while multimodal reasoning in Vision-Language Models enhances logical inference and performance on complex tasks, it can impair perceptual grounding, leading to recognition failures on basic visual questions due to visual forgetting. They proposed Vision-Anchored Policy Optimization (VAPO), which steers reasoning toward visually grounded trajectories, and their model VAPO-Thinker-7B achieved new state-of-the-art results on multiple benchmarks.

Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes