CVDec 8, 2025

MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

arXiv:2512.07203v1h-index: 3
Originality Highly original
AI Analysis

This addresses the issue of descriptive bias in multimodal models for AI researchers, offering a novel pre-training approach that enhances visual grounding.

The paper tackles the problem of multimodal pre-training being biased towards surface linguistic cues by introducing MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs, resulting in consistent zero-shot gains across diverse benchmarks and improved robustness under supervised fine-tuning.

Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes