AIMay 12, 2025

Explainable Reinforcement Learning Agents Using World Models

Georgia Tech
arXiv:2505.08073v22 citationsh-index: 5
AI Analysis

This addresses the problem of explainability in reinforcement learning for non-AI experts, enabling better understanding and potential control of agent behavior, though it is incremental as it builds on existing Model-Based RL and World Models.

The paper tackles the challenge of explaining reinforcement learning agents' decisions by introducing a method that uses World Models and a Reverse World Model to generate counterfactual trajectories and show users what the world should have been like for different actions, resulting in significantly increased user understanding of the agent policy.

Explainable AI (XAI) systems have been proposed to help people understand how AI systems produce outputs and behaviors. Explainable Reinforcement Learning (XRL) has an added complexity due to the temporal nature of sequential decision-making. Further, non-AI experts do not necessarily have the ability to alter an agent or its policy. We introduce a technique for using World Models to generate explanations for Model-Based Deep RL agents. World Models predict how the world will change when actions are performed, allowing for the generation of counterfactual trajectories. However, identifying what a user wanted the agent to do is not enough to understand why the agent did something else. We augment Model-Based RL agents with a Reverse World Model, which predicts what the state of the world should have been for the agent to prefer a given counterfactual action. We show that explanations that show users what the world should have been like significantly increase their understanding of the agent policy. We hypothesize that our explanations can help users learn how to control the agents execution through by manipulating the environment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes