AILGOct 14, 2025

Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking

arXiv:2510.13036v1h-index: 54
Originality Incremental advance
AI Analysis

This addresses the issue of reward misalignment in RL for AI safety, offering a more efficient alternative to learning rewards from scratch, though it is incremental as it builds on existing preference-based methods.

The paper tackles the problem of reward hacking in reinforcement learning by proposing Preference-Based Reward Repair (PBRR), an automated framework that repairs human-specified proxy reward functions using additive corrections learned from human preferences. The result shows that PBRR outperforms baselines on reward-hacking benchmarks, requiring substantially fewer preferences to learn high-performing policies.

Human-designed reward functions for reinforcement learning (RL) agents are frequently misaligned with the humans' true, unobservable objectives, and thus act only as proxies. Optimizing for a misspecified proxy reward function often induces reward hacking, resulting in a policy misaligned with the human's true objectives. An alternative is to perform RL from human feedback, which involves learning a reward function from scratch by collecting human preferences over pairs of trajectories. However, building such datasets is costly. To address the limitations of both approaches, we propose Preference-Based Reward Repair (PBRR): an automated iterative framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences. A manually specified reward function can yield policies that are highly suboptimal under the ground-truth objective, yet corrections on only a few transitions may suffice to recover optimal performance. To identify and correct for those transitions, PBRR uses a targeted exploration strategy and a new preference-learning objective. We prove in tabular domains PBRR has a cumulative regret that matches, up to constants, that of prior preference-based RL methods. In addition, on a suite of reward-hacking benchmarks, PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches, requiring substantially fewer preferences to learn high performing policies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes