CLAILGDec 19, 2025

Reinforcement Learning for Chain of Thought Compression with One-Domain-to-All Generalization

arXiv:2601.06052v23 citationsh-index: 6
Originality Highly original
AI Analysis

This addresses the computational inefficiency and cost issues in chain-of-thought reasoning for users of large language models, offering a novel compression approach with cross-domain generalization.

The paper tackles the problem of inefficiently long chain-of-thought reasoning in large language models, which increases cost and latency without reliable accuracy gains, by proposing a reinforcement learning compression method that reduces response length by 20-40% while maintaining or improving accuracy and generalizing across domains.

Chain-of-thought reasoning in large language models can trigger an "overthinking trap": longer rollouts raise cost and latency yet often yield unreliable accuracy gains. Existing methods use global, static controls that may suppress needed reasoning. We propose mastery-gated, sample-level, soft reinforcement learning compression that penalizes long rollouts only when the model already solves the problem and has produced a shorter rollout. Across benchmarks, it cuts response length by 20-40% with comparable or higher accuracy and generalizes across domains: a model trained on math spontaneously shortens unseen tasks (code, instruction following, general-knowledge QA) without hurting accuracy. We further show two-way transfer between non-agent CoT and tool-use agents: non-agent training reduces SWE-Bench Verified rounds by 13%, while compressing a thinking agent cuts SWE trajectories by 67% tokens and 52% rounds and shortens non-agent outputs by up to 44%. Compression is thus not cosmetic brevity, but an inherent computation policy -- what to keep, and what to forget.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes