CLAISep 18, 2023

Stabilizing RLHF through Advantage Model and Selective Rehearsal

arXiv:2309.10202v122 citationsh-index: 43
Originality Incremental advance
AI Analysis

This addresses stabilization problems in RLHF for aligning LLMs with human preferences, which is an incremental improvement over existing methods.

The paper tackled instability issues in RLHF training for LLMs, such as reward hacking and catastrophic forgetting, by introducing an Advantage Model and Selective Rehearsal, resulting in increased stability, higher reward scores, and improved win rates.

Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences using RLHF remains a significant challenge. This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting. In this technical report, we propose two innovations to stabilize RLHF training: 1) Advantage Model, which directly models advantage score i.e., extra reward compared to the expected rewards and regulates score distributions across tasks to prevent reward hacking. 2) Selective Rehearsal, which mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing. Our experimental analysis on public and proprietary datasets reveals that the proposed methods not only increase stability in RLHF training but also achieve higher reward scores and win rates.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes