CVCLSep 15, 2025

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

arXiv:2509.12132v119 citationsh-index: 6EMNLP
Originality Incremental advance
AI Analysis

This work addresses a critical bottleneck in training visual reasoning models for applications requiring detailed visual analysis, though it is incremental as it builds on existing slow-thinking and VLM frameworks.

The paper tackled the problem of limited visual reflection in vision-language models (VLMs) for visual reasoning, where attention to visual information diminishes with longer responses, and proposed Reflection-V, which improved performance across multiple benchmarks by enhancing visual reflection through data construction and reinforcement learning rewards.

Recent advances in text-only "slow-thinking" reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models (\textbf{VRMs}). owever, such transfer faces critical challenges: Effective "slow thinking" in VRMs requires \textbf{visual reflection}, the ability to check the reasoning process based on visual information. Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention to visual information diminishes rapidly with longer generated responses. To address this challenge, we propose a new VRM \textbf{Reflection-V}, which enhances visual reflection based on reasoning data construction for cold-start and reward design for reinforcement learning (RL). Firstly, we construct vision-centered reasoning data by leveraging an agent that interacts between VLMs and reasoning LLMs, enabling cold-start learning of visual reflection patterns. Secondly, a visual attention based reward model is employed during RL to encourage reasoning based on visual information. Therefore, \textbf{Reflection-V} demonstrates significant improvements across multiple visual reasoning benchmarks. Furthermore, \textbf{Reflection-V} maintains a stronger and more consistent reliance on visual information during visual reasoning, indicating effective enhancement in visual reflection capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes