CVAISep 29, 2025

VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding

arXiv:2509.24776v13 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of perception-grounded reasoning in multimodal AI, offering a scalable and auditable solution, though it appears incremental as it builds on existing MLLM methods.

The paper tackles the problem of multimodal large language models struggling to ground reasoning in perceptual evidence by proposing VTPerception-R1, a two-stage framework that decouples perception from reasoning, resulting in significant improvements in reasoning accuracy and robustness across diverse tasks.

Multimodal large language models (MLLMs) often struggle to ground reasoning in perceptual evidence. We present a systematic study of perception strategies-explicit, implicit, visual, and textual-across four multimodal benchmarks and two MLLMs. Our findings show that explicit perception, especially when paired with textual cues, consistently yields the best improvements, particularly for smaller models. Based on this insight, we propose VTPerception-R1, a unified two-stage framework that decouples perception from reasoning. Stage 1 introduces perception-augmented fine-tuning, and Stage 2 applies perception-aware reinforcement learning with novel visual, textual, and consistency rewards. Experiments demonstrate that VTPerception-R1 significantly improves reasoning accuracy and robustness across diverse tasks, offering a scalable and auditable solution for perception-grounded multimodal reasoning. Our code is available at: https://github.com/yizhuoDi/VTPerceprion-R1.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes