LGAIJun 24, 2024

WARP: On the Benefits of Weight Averaged Rewarded Policies

arXiv:2406.16768v144 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses alignment challenges in large language models for AI safety and performance, though it is incremental as it builds on existing RLHF methods.

The paper tackles the trade-off between reward optimization and knowledge retention in RLHF by introducing WARP, a weight averaging strategy that iteratively merges policies to refine the KL-reward Pareto front, achieving superior rewards at fixed KL in experiments with GEMMA policies.

Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes