CLMay 4, 2025

Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study

arXiv:2505.02142v1h-index: 11
Originality Incremental advance
AI Analysis

This addresses the need for more cost-effective methods to improve reasoning in LLMs, though it is incremental as it builds on existing offline RL techniques.

The study tackled the problem of high computational costs in online reinforcement learning for enhancing large language models' reasoning by exploring simpler offline RL methods like DPO and LD-DPO, resulting in an average performance improvement of 3.3% across benchmarks and a 10.1% increase on Arena-Hard.

Despite significant advances in long-context reasoning by large language models (LLMs), primarily through Online Reinforcement Learning (RL) methods, these approaches incur substantial computational costs and complexity. In contrast, simpler and more economical Offline RL methods remain underexplored. To address this gap, we investigate the effectiveness of Offline RL methods, specifically Direct Preference Optimization (DPO) and its length-desensitized variant LD-DPO, in enhancing the reasoning capabilities of LLMs. Extensive experiments across multiple reasoning benchmarks demonstrate that these simpler Offline RL methods substantially improve model performance, achieving an average enhancement of 3.3\%, with a particularly notable increase of 10.1\% on the challenging Arena-Hard benchmark. Furthermore, we analyze DPO's sensitivity to output length, emphasizing that increasing reasoning length should align with semantic richness, as indiscriminate lengthening may adversely affect model performance. We provide comprehensive descriptions of our data processing and training methodologies, offering empirical evidence and practical insights for developing more cost-effective Offline RL approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes