LGAIMar 8, 2024

Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

arXiv:2403.05171v225 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the problem of reward overoptimization in RLHF for LLMs, which is an incremental improvement in mitigating exploitation of reward model inaccuracies.

The paper tackled reward overoptimization in Reinforcement Learning from Human Feedback for Large Language Models by introducing Adversarial Policy Optimization with lightweight uncertainty estimation, resulting in enhanced performance on Anthropic HH and TL;DR datasets as evaluated through human-assisted evaluation.

We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes