AILGMay 15

Imperfect World Models are Exploitable

arXiv:2605.1596045.8
AI Analysis

This work provides a theoretical foundation for understanding and mitigating risks in model-based RL, relevant to researchers and practitioners concerned with safety and reliability.

The paper defines model exploitation in reinforcement learning as a situation where a world model incorrectly prefers one policy over another, contrary to the true environment. It proves that exploitation is unavoidable on large policy sets and establishes a formal connection to reward hacking, while also identifying a safe horizon within which a relaxed notion of exploitation can be avoided.

We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes