CVCLJan 1

From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning

arXiv:2601.00215v1h-index: 17Has Code
Originality Incremental advance
AI Analysis

This addresses the bottleneck of visual perception in MLLMs for tasks like visual puzzles, offering an incremental improvement through reward-driven reinforcement learning.

The paper tackled the problem of multimodal large language models (MLLMs) generating reasoning that lacks integration of visual information, limiting their ability to solve visual puzzles, and showed that converting images into textual descriptions improved performance by 26.7% for Claude 3.5 and 23.6% for Claude 3.7, while their reinforcement learning approach achieved 5.56% improvements over the base model on Qwen-2.5-VL-7B.

Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7. To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes