AICLJan 29

Self-Compression of Chain-of-Thought via Multi-Agent Reinforcement Learning

arXiv:2601.21919v16 citationsh-index: 5
Originality Highly original
AI Analysis

This addresses the inference overhead problem for users of Large Reasoning Models, offering a novel method for compressing reasoning chains with significant performance gains.

The paper tackles the problem of redundant reasoning in Large Reasoning Models, which creates inference overhead and bottlenecks deployment, by proposing a multi-agent reinforcement learning framework that selectively penalizes redundant chunks while preserving essential logic. The result is a reduction in response length by 11.1% to 39.0% and an accuracy boost of 4.33% to 10.02% across model scales.

The inference overhead induced by redundant reasoning undermines the interactive experience and severely bottlenecks the deployment of Large Reasoning Models. Existing reinforcement learning (RL)-based solutions tackle this problem by coupling a length penalty with outcome-based rewards. This simplistic reward weighting struggles to reconcile brevity with accuracy, as enforcing brevity may compromise critical reasoning logic. In this work, we address this limitation by proposing a multi-agent RL framework that selectively penalizes redundant chunks, while preserving essential reasoning logic. Our framework, Self-Compression via MARL (SCMA), instantiates redundancy detection and evaluation through two specialized agents: \textbf{a Segmentation Agent} for decomposing the reasoning process into logical chunks, and \textbf{a Scoring Agent} for quantifying the significance of each chunk. The Segmentation and Scoring agents collaboratively define an importance-weighted length penalty during training, incentivizing \textbf{a Reasoning Agent} to prioritize essential logic without introducing inference overhead during deployment. Empirical evaluations across model scales demonstrate that SCMA reduces response length by 11.1\% to 39.0\% while boosting accuracy by 4.33\% to 10.02\%. Furthermore, ablation studies and qualitative analysis validate that the synergistic optimization within the MARL framework fosters emergent behaviors, yielding more powerful LRMs compared to vanilla RL paradigms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes