CVAIDec 27, 2025

Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

arXiv:2512.22545v1h-index: 12Has Code
Originality Highly original
AI Analysis

This addresses the issue of weak coherence and grounding in multimodal reasoning for AI applications, representing a strong specific gain rather than a broad paradigm shift.

The paper tackles the problem of unreliable reasoning in multimodal LLMs by introducing SR-MCR, a lightweight framework that aligns reasoning using self-referential cues, resulting in state-of-the-art performance with an average accuracy of 81.4% on visual benchmarks.

Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes