Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
This work highlights a critical vulnerability in RLHF for AI safety, showing that standard regularization may fail under heavy-tailed reward misspecification, which is incremental as it builds on existing concerns about reward hacking.
The paper investigates the effectiveness of KL divergence regularization in RLHF when reward functions are misspecified, showing that while light-tailed errors allow high utility, heavy-tailed errors can lead to policies achieving arbitrarily high reward without actual utility gains, a phenomenon termed catastrophic Goodhart. It finds current reward models have light-tailed errors but warns that heavy-tailed errors in future applications could increase reward hacking risks.
When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error. It is common to mitigate this by regularizing the policy with KL divergence from a base model, with the hope that balancing reward with regularization will achieve desirable outcomes despite this reward misspecification. We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility. However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model--a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error. However, the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.