Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking
This addresses the issue of reward misalignment in RL for AI safety, offering a more efficient alternative to learning rewards from scratch, though it is incremental as it builds on existing preference-based methods.
The paper tackles the problem of reward hacking in reinforcement learning by proposing Preference-Based Reward Repair (PBRR), an automated framework that repairs human-specified proxy reward functions using additive corrections learned from human preferences. The result shows that PBRR outperforms baselines on reward-hacking benchmarks, requiring substantially fewer preferences to learn high-performing policies.
Human-designed reward functions for reinforcement learning (RL) agents are frequently misaligned with the humans' true, unobservable objectives, and thus act only as proxies. Optimizing for a misspecified proxy reward function often induces reward hacking, resulting in a policy misaligned with the human's true objectives. An alternative is to perform RL from human feedback, which involves learning a reward function from scratch by collecting human preferences over pairs of trajectories. However, building such datasets is costly. To address the limitations of both approaches, we propose Preference-Based Reward Repair (PBRR): an automated iterative framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences. A manually specified reward function can yield policies that are highly suboptimal under the ground-truth objective, yet corrections on only a few transitions may suffice to recover optimal performance. To identify and correct for those transitions, PBRR uses a targeted exploration strategy and a new preference-learning objective. We prove in tabular domains PBRR has a cumulative regret that matches, up to constants, that of prior preference-based RL methods. In addition, on a suite of reward-hacking benchmarks, PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches, requiring substantially fewer preferences to learn high performing policies.