Reinforced Attention Learning

Bangzheng Li, Jianmo Ni, Chen Qu, Ian Miao, Liu Yang, Xingyu Fu, Muhao Chen, Derek Zhiyuan Cheng

arXiv:2602.04884v11.11 citationsh-index: 7

Originality Highly original

AI Analysis

This work addresses the challenge of improving perception and grounding in MLLMs, offering a novel approach for multimodal post-training that could benefit AI systems handling complex multimodal inputs.

The paper tackled the problem of limited gains from post-training with Reinforcement Learning (RL) in Multimodal Large Language Models (MLLMs) by proposing Reinforced Attention Learning (RAL), a policy-gradient framework that optimizes internal attention distributions, resulting in consistent gains across diverse image and video benchmarks over baselines like GRPO.

Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.

View on arXiv PDF

Similar