LGAICLOct 16, 2025

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

UW
arXiv:2510.15110v126 citationsh-index: 45
Originality Incremental advance
AI Analysis

This work addresses efficiency in reasoning language models for AI applications, offering a method to reduce token usage without sacrificing accuracy, though it is incremental as it builds on existing RL and length penalty techniques.

The paper tackles the problem of reasoning language models generating unnecessarily long outputs by proposing DLER, a reinforcement learning training recipe that uses a simple truncation length penalty, achieving over 70% reduction in output length while surpassing baseline accuracy and improving test-time scaling with 28% higher accuracy and lower latency compared to DeepSeek-R1-7B.

Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes