Concise Reasoning via Reinforcement Learning
This work addresses computational cost and efficiency issues for users of reasoning models, though it appears incremental as it builds on existing RL methods.
The paper tackles the problem of excessive token usage in large language models by showing that reinforcement learning training inherently encourages lengthy responses, and it demonstrates that a secondary RL phase can significantly reduce response length while maintaining or improving accuracy.
Despite significant advancements in large language models (LLMs), a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherently from RL-based optimization during training. This finding questions the prevailing assumption that longer responses inherently improve reasoning accuracy. Instead, we uncover a natural correlation between conciseness and accuracy that has been largely overlooked. We show that introducing a secondary phase of RL training, using a very small set of problems, can significantly reduce chains of thought while maintaining or even enhancing accuracy. Additionally, we demonstrate that, while GRPO shares some interesting properties of PPO, it suffers from collapse modes, which limit its reliability for concise reasoning. Finally, we validate our conclusions through extensive experimental results.