AIMar 19, 2025

Behaviour Discovery and Attribution for Explainable Reinforcement Learning

Rishav Rishav, Somjit Nath, Vincent Michalski, Samira Ebrahimi Kahou

MILA

arXiv:2503.14973v29.63 citationsh-index: 26Has CodeTrans. Mach. Learn. Res.

Originality Incremental advance

AI Analysis

This work addresses the problem of building trust in RL agents for high-stakes applications like robotics, healthcare, and finance by providing fine-grained, behavior-centric explanations, though it is incremental as it builds on existing explainability methods.

The paper tackles the problem of understanding reinforcement learning (RL) agent decisions by addressing the gap in explainability methods that miss recurring strategies across multiple decisions, proposing a fully offline, reward-free framework for behavior discovery and segmentation. The result shows that the approach discovers meaningful behaviors and outperforms trajectory-level baselines in fidelity, human preference, and cluster coherence across four diverse offline RL environments.

Building trust in reinforcement learning (RL) agents requires understanding why they make certain decisions, especially in high-stakes applications like robotics, healthcare, and finance. Existing explainability methods often focus on single states or entire trajectories, either providing only local, step-wise insights or attributing decisions to coarse, episodelevel summaries. Both approaches miss the recurring strategies and temporally extended patterns that actually drive agent behavior across multiple decisions. We address this gap by proposing a fully offline, reward-free framework for behavior discovery and segmentation, enabling the attribution of actions to meaningful and interpretable behavior segments that capture recurring patterns appearing across multiple trajectories. Our method identifies coherent behavior clusters from state-action sequences and attributes individual actions to these clusters for fine-grained, behavior-centric explanations. Evaluations on four diverse offline RL environments show that our approach discovers meaningful behaviors and outperforms trajectory-level baselines in fidelity, human preference, and cluster coherence. Our code is publicly available.

View on arXiv PDF

Similar