AI LGMay 15

Imperfect World Models are Exploitable

Logan Mondal Bhamidipaty, Esmeralda S. Whitammer, David Abel, Mykel J. Kochenderfer, Subramanian Ramamoorthy

arXiv:2605.1596045.8

AI Analysis

This work provides a theoretical foundation for understanding and mitigating risks in model-based RL, relevant to researchers and practitioners concerned with safety and reliability.

The paper defines model exploitation in reinforcement learning as a situation where a world model incorrectly prefers one policy over another, contrary to the true environment. It proves that exploitation is unavoidable on large policy sets and establishes a formal connection to reward hacking, while also identifying a safe horizon within which a relaxed notion of exploitation can be avoided.

We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.

View on arXiv PDF

Similar