LG AIMar 8, 2024

Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Xiaoying Zhang, Jean-Francois Ton, Wei Shen, Hongning Wang, Yang Liu

arXiv:2403.05171v221.127 citationsh-index: 19

Originality Incremental advance

AI Analysis

This addresses the problem of reward overoptimization in RLHF for LLMs, which is an incremental improvement in mitigating exploitation of reward model inaccuracies.

The paper tackled reward overoptimization in Reinforcement Learning from Human Feedback for Large Language Models by introducing Adversarial Policy Optimization with lightweight uncertainty estimation, resulting in enhanced performance on Anthropic HH and TL;DR datasets as evaluated through human-assisted evaluation.

We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.

View on arXiv PDF

Similar