CLCVMay 28

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

arXiv:2605.2949631.0h-index: 20
Predicted impact top 39% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners training vision-language models, this work diagnoses a key bottleneck and offers simple, effective fixes to balance perception and reasoning improvements.

The paper identifies a perception-reasoning asymmetry in vision-language model post-training, where reasoning improves more than perception. It proposes loss reweighting for SFT (up to 18.2% gain) and perception-aware rewards for RL (up to 6.0% gain) to mitigate this imbalance.

Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without groundtruth perception rewards, a reliable surrogate reward provide useful signal, yielding gains of 3.2 points. Together, our results comprehensively diagnose asymmetric optimization and suggest concrete interventions to balance perception and reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes