LGSep 30, 2025

OPPO: Accelerating PPO-based RLHF via Pipeline Overlap

Kaizhuo Yan, Yingjie Yu, Yifan Yu, Haizhong Zheng, Fan Lai

arXiv:2509.25762v12 citationsh-index: 3

Originality Highly original

AI Analysis

This work addresses training bottlenecks for researchers and practitioners using RLHF to align LLMs, offering a model-agnostic solution that is incremental but provides substantial efficiency gains.

The paper tackles inefficiencies in PPO-based RLHF training for aligning large language models by introducing OPPO, a framework that accelerates training by overlapping pipeline execution, achieving speedups of 1.8×-2.8× and GPU utilization improvements of 1.4×-2.1× without compromising convergence.

Proximal Policy Optimization (PPO)-based reinforcement learning from human feedback (RLHF) is a widely adopted paradigm for aligning large language models (LLMs) with human preferences. However, its training pipeline suffers from substantial inefficiencies due to sequential multi-model dependencies (e.g., reward model depends on actor outputs) and long-tail response lengths, where a few long responses straggle the stage completion. We present OPPO, a novel, lightweight, and model-agnostic PPO-based RLHF framework that improves training efficiency by overlapping pipeline execution. OPPO introduces two novel techniques: (1) Intra-step overlap, which streams upstream model outputs (e.g., actor model) in right-sized chunks, enabling the downstream model (e.g., reward) to begin prefill while the upstream continues decoding; and (2) Inter-step overlap, which adaptively overcommits a few prompts and defers long generations to future steps, mitigating tail latency without discarding partial work. OPPO integrates easily with existing PPO implementations with a few lines of code change. Extensive evaluations show that OPPO accelerates PPO-based RLHF training by $1.8 \times-2.8 \times$ and improves GPU utilization by $1.4 \times-2.1 \times$ without compromising training convergence.

View on arXiv PDF

Similar