LACONIC: Length-Aware Constrained Reinforcement Learning for LLM
This addresses the issue of inefficient inference for users and developers of LLMs by providing robust length control, though it is an incremental improvement over prior heuristic approaches.
The paper tackled the problem of excessive response lengths in reinforcement learning for large language models, which increases inference latency and computational overhead, by proposing LACONIC, a method that enforces a target token budget during training, resulting in over 50% reduction in output length while preserving or improving task performance.
Reinforcement learning (RL) has enhanced the capabilities of large language models (LLMs) through reward-driven training. Nevertheless, this process can introduce excessively long responses, inflating inference latency and computational overhead. Prior length-control approaches typically rely on fixed heuristic reward shaping, which can misalign with the task objective and require brittle tuning. In this work, we propose LACONIC, a reinforcement learning method that enforces a target token budget during training. Specifically, we update policy models using an augmented objective that combines the task reward with a length-based cost. To balance brevity and task performance, the cost scale is adaptively adjusted throughout training. This yields robust length control while preserving task reward. We provide a theoretical guarantee that support the method. Across mathematical reasoning models and datasets, LACONIC preserves or improves pass@1 while reducing output length by over 50%. It maintains out-of-domain performance on general knowledge and multilingual benchmarks with 44% fewer tokens. Moreover, LACONIC integrates into standard RL-tuning with no inference changes and minimal deployment overhead.