Reconstructing Actions To Explain Deep Reinforcement Learning
This work addresses the problem of explainability in deep reinforcement learning for researchers and practitioners, offering a novel method for behavior-level attributions, though it is incremental in adapting existing attribution techniques to RL.
The paper tackles the challenge of explaining deep reinforcement learning actions by proposing action reconstruction functions to mimic network behavior, enabling more complex explainability questions and quantitative evaluation via an agreement metric. Experiments on Atari games show perturbation-based attribution methods are more suitable for reconstructing actions and have greater agreement than attention-based methods, with demonstrations on Pac-Man.
Feature attribution has been a foundational building block for explaining the input feature importance in supervised learning with Deep Neural Network (DNNs), but face new challenges when applied to deep Reinforcement Learning (RL).We propose a new approach to explaining deep RL actions by defining a class of \emph{action reconstruction} functions that mimic the behavior of a network in deep RL. This approach allows us to answer more complex explainability questions than direct application of DNN attribution methods, which we adapt to \emph{behavior-level attributions} in building our action reconstructions. It also allows us to define \emph{agreement}, a metric for quantitatively evaluating the explainability of our methods. Our experiments on a variety of Atari games suggest that perturbation-based attribution methods are significantly more suitable in reconstructing actions to explain the deep RL agent than alternative attribution methods, and show greater \emph{agreement} than existing explainability work utilizing attention. We further show that action reconstruction allows us to demonstrate how a deep agent learns to play Pac-Man game.