LGAICLDMFeb 15, 2024

Reward Generalization in RLHF: A Topological Perspective

arXiv:2402.10184v78 citationsh-index: 13ACL
Originality Incremental advance
AI Analysis

This work addresses data efficiency and generalization issues in RLHF for AI alignment, offering a novel topological approach that is incremental but provides specific gains.

The paper tackles the problem of low data efficiency and unreliable generalization in reinforcement learning from human feedback (RLHF) by introducing a theory of reward generalization from a topological perspective, showing that reward modeling from tree-structured preference information reduces reward uncertainty by up to Θ(log n/log log n) times and achieves an average win rate of 65% against baselines on NLP tasks.

Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theory of reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks to model the impact of dataset topologies on reward generalization. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $Θ(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that it achieves an average win rate of 65% against baselines, thus improving reward generalization for free via topology design, while reducing the amount of data requiring annotation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes