LGCLFeb 22, 2024

Generalizing Reward Modeling for Out-of-Distribution Preference Learning

arXiv:2402.14760v26 citationsh-index: 1ECML/PKDD
AI Analysis

This addresses the problem of limited human feedback for aligning LLMs with preferences across diverse distributions, though it is incremental as it builds on existing RLHF and meta-learning approaches.

The paper tackles out-of-distribution preference learning for large language models by optimizing a general reward model via meta-learning, achieving superior performance on two text generation tasks across 20 held-out domains compared to strong baselines.

Preference learning (PL) with large language models (LLMs) aims to align the LLMs' generations with human preferences. Previous work on reinforcement learning from human feedback (RLHF) has demonstrated promising results in in-distribution PL. However, due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging. Thus, out-of-distribution (OOD) PL is practically useful for enhancing the generalization ability of LLMs with limited preference feedback. This work addresses OOD PL by optimizing a general reward model through a meta-learning approach. During meta-training, a bilevel optimization algorithm is utilized to learn a reward model capable of guiding policy learning to align with human preferences across various distributions. When encountering a test distribution, the meta-test procedure conducts regularized policy optimization using the learned reward model for PL. We theoretically demonstrate the convergence rate of the bilevel optimization algorithm under reasonable assumptions. Additionally, we conduct experiments on two text generation tasks across 20 held-out domains and outperform a variety of strong baselines across various evaluation metrics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes