LG AI CLFeb 16, 2025

Simplify RLHF as Reward-Weighted SFT: A Variational Method

Yuhao Du, Zhuo Li, Pengyu Cheng, Zhihong Chen, Yuejiao Xie, Xiang Wan, Anningzhe Gao

arXiv:2502.11026v223.913 citationsh-index: 19

Originality Incremental advance

AI Analysis

This addresses the problem of high implementation and computational costs in RLHF for AI researchers and practitioners, though it appears incremental as it builds on existing simplifications like DPO and A-LoL.

The paper tackles the complexity and instability of Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models by proposing VAR, a variational method that simplifies RLHF into reward-weighted supervised fine-tuning, achieving competitive performance on alignment benchmarks.

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption. Even with recent simplifications, such as Direct Preference Optimization (DPO) and Advantage Leftover Lunch (A-LoL), the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called $\textbf{V}$ariational $\textbf{A}$lignment with $\textbf{R}$e-weighting ($\textbf{VAR}$). More specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into a reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. On comprehensive alignment and generation benchmarks, our VAR method has numerically achieved competitive performance in LLM alignment helpfulness and harmlessness.

View on arXiv PDF

Similar