LGMay 6

Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

arXiv:2605.0447773.6h-index: 2
AI Analysis

For practitioners training LLMs with human feedback, DEPO offers a scalable method to improve sample efficiency, though it is an incremental improvement over existing exploration strategies.

The paper tackles the exploration challenge in online RLHF for LLMs, proposing DEPO which uses historical data to construct uncertainty bonuses. DEPO achieves improved sample efficiency, consistently outperforming strong baselines across benchmarks.

Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in this setting is exploration, which requires algorithms that enable the LLMs to generate informative comparisons that improve sample-efficiency in online RLHF. Existing exploration strategies often derive bonuses via on-policy expectations, which are difficult to estimate reliably from the limited historical preference data available during training; as a result, the policy can prematurely down-weight under-explored regions that may contain high-value behaviors. In this paper, we propose data-dependent exploration for preference optimization (DEPO), a simple and scalable method that leverages historical data to construct an extra uncertainty bonus for high-uncertainty regions, encouraging exploration toward potentially high-value data. Theoretically, we provide a data-dependent regret bound for the proposed algorithm, showing that it adapts to the hardness of the learning task itself and can be tighter than worst-case bounds in practice. Empirically, the proposed method consistently outperforms strong baselines across benchmarks, demonstrating improved sample efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes