CLAIJun 14, 2024

Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

arXiv:2406.10216v2134 citations
Originality Incremental advance
AI Analysis

This addresses the generalization issue in reward models for LLMs, which is crucial for reliable human alignment, but it is an incremental improvement over existing methods.

The paper tackles the problem of reward models in RLHF having limited generalization to unseen prompts, which causes reward over-optimization and performance decline. The result shows that regularizing hidden states improves reward model accuracy on out-of-distribution tasks and alleviates over-optimization, offering a more robust preference learning paradigm.

Reward models trained on human preference data have been proven to effectively align Large Language Models (LLMs) with human intent within the framework of reinforcement learning from human feedback (RLHF). However, current reward models have limited generalization capabilities to unseen prompts and responses, which can lead to an unexpected phenomenon known as reward over-optimization, resulting in a decline in actual performance due to excessive optimization of rewards. While previous research has advocated for constraining policy optimization, our study introduces a novel approach to enhance the reward model's generalization ability against distribution shifts by regularizing the hidden states. Specifically, we retain the base model's language model head and incorporate a suite of text-generation losses to preserve the hidden states' text-generation capabilities, while concurrently learning a reward head behind the same hidden states. Our experimental results demonstrate that the introduced regularization technique markedly improves the accuracy of learned reward models across a variety of out-of-distribution (OOD) tasks and effectively alleviates the over-optimization issue in RLHF, offering a more reliable and robust preference learning paradigm.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes