LGCLApr 15, 2024

Learn Your Reference Model for Real Good Alignment

arXiv:2404.09656v455 citationsh-index: 7ICLR
AI Analysis

This addresses overoptimization in LLM alignment, a key problem for AI safety and performance, though it appears incremental as it builds on existing offline alignment methods.

The paper tackles overoptimization in offline alignment of Large Language Models by proposing Trust Region methods that dynamically update the reference policy during training. Results show these methods effectively mitigate overoptimization, maintaining strong performance in tasks like dialogue and summarization, with significant improvements on benchmarks like AlpacaEval 2 and Arena-Hard using Llama3.

Despite the fact that offline methods for Large Language Models (LLMs) alignment do not require a direct reward model, they remain susceptible to overoptimization. This issue arises when the trained model deviates excessively from the reference policy, leading to a decrease in sample quality. We propose a new paradigm of offline alignment methods, called Trust Region (including variants TR-DPO, TR-IPO, TR-KTO), which dynamically updates the reference policy throughout the training process. Our results show that TR alignment methods effectively mitigate overoptimization, enabling models to maintain strong performance even when substantially deviating from the initial reference policy. We demonstrate the efficacy of these approaches not only through toy examples that exhibit reduced overoptimization, but also through direct, side-by-side comparisons in specific tasks such as helpful and harmless dialogue, as well as summarization, where they surpass conventional methods. Additionally, we report significant improvements in general-purpose assistant setups with the Llama3 model on the AlpacaEval 2 and Arena-Hard benchmarks, highlighting the advantages of Trust Region methods over classical approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes