LGMay 19, 2025

Action-Dependent Optimality-Preserving Reward Shaping

arXiv:2505.12611v11 citationsh-index: 6AAMAS
Originality Incremental advance
AI Analysis

This addresses a specific limitation in reward shaping for RL agents in complex, sparse-reward environments like Montezuma's Revenge, representing an incremental improvement over existing potential-based methods.

The paper tackles the problem of reward hacking in reinforcement learning when using intrinsic motivation for exploration in sparse-reward environments, introducing Action-Dependent Optimality Preserving Shaping (ADOPS) which preserves optimal policies while allowing action-dependent intrinsic rewards, and demonstrates its effectiveness in Montezuma's Revenge where existing methods struggle.

Recent RL research has utilized reward shaping--particularly complex shaping rewards such as intrinsic motivation (IM)--to encourage agent exploration in sparse-reward environments. While often effective, ``reward hacking'' can lead to the shaping reward being optimized at the expense of the extrinsic reward, resulting in a suboptimal policy. Potential-Based Reward Shaping (PBRS) techniques such as Generalized Reward Matching (GRM) and Policy-Invariant Explicit Shaping (PIES) have mitigated this. These methods allow for implementing IM without altering optimal policies. In this work we show that they are effectively unsuitable for complex, exploration-heavy environments with long-duration episodes. To remedy this, we introduce Action-Dependent Optimality Preserving Shaping (ADOPS), a method of converting intrinsic rewards to an optimality-preserving form that allows agents to utilize IM more effectively in the extremely sparse environment of Montezuma's Revenge. We also prove ADOPS accommodates reward shaping functions that cannot be written in a potential-based form: while PBRS-based methods require the cumulative discounted intrinsic return be independent of actions, ADOPS allows for intrinsic cumulative returns to be dependent on agents' actions while still preserving the optimal policy set. We show how action-dependence enables ADOPS's to preserve optimality while learning in complex, sparse-reward environments where other methods struggle.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes