CLLGJan 4, 2025

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

arXiv:2501.03262v993 citationsh-index: 5
Originality Highly original
AI Analysis

This addresses computational and memory overhead issues in RLHF for aligning large language models, offering a more efficient and stable alternative to current critic-based and critic-free methods.

The paper tackles the problem of inaccurate advantage estimation and instability in critic-free reinforcement learning from human feedback (RLHF) algorithms by introducing REINFORCE++ with global advantage normalization, resulting in superior stability and performance that outperforms existing methods, including PPO, in complex settings.

Reinforcement Learning from Human Feedback~(RLHF) plays a crucial role in aligning Large Language Models~(LLMs). The dominant algorithm, Proximal Policy Optimization~(PPO), employs a critic network to estimate advantages, which introduces significant computational and memory overhead. To address this, a family of critic-free algorithms (e.g., GRPO, RLOO) has emerged. However, these methods typically rely on \textit{prompt-level (local)} advantage normalization, which suffers from inaccurate advantage estimation, a tendency to overfit, and, as we show, is a theoretically biased estimator. To solve these challenges, we introduce REINFORCE++, a critic-free framework centered on \textbf{Global Advantage Normalization}. By normalizing advantages across the entire global batch rather than small, prompt-specific groups, our method provides a more stable and theoretically sound, \textit{effectively unbiased} estimate (whose bias vanishes as batch size increases). We introduce two variants: REINFORCE++, a highly efficient and general algorithm ($k \ge 1$) for general-domain RLHF, and REINFORCE++ /w baseline, a robust group-sampling variant ($k > 1$) for complex reasoning tasks. Our empirical evaluation demonstrates that each variant shows superior stability and performance in its respective domain, outperforming existing methods and even PPO in complex agentic settings.

Code Implementations5 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes