LGCLMay 24, 2025

Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs

arXiv:2505.18573v219 citationsh-index: 4Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work addresses incremental improvements in training efficiency and exploration for reinforcement learning applied to large language models.

The paper tackles inefficiency in reinforcement learning for large language models by dynamically allocating rollouts based on question difficulty and using adaptive temperature to maintain exploration, resulting in improved response precision while preserving exploratory ability.

Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model's exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://github.com/LiaoMengqi/E3-RL4LLMs

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes