LGNov 25, 2025

Attention Trajectories as a Diagnostic Axis for Deep Reinforcement Learning

Charlotte Beylier, Hannah Selder, Arthur Fleig, Simon M. Hofmann, Nico Scherf

arXiv:2511.20591v24.1

Originality Incremental advance

AI Analysis

This provides a diagnostic tool for researchers to understand and improve agent behavior beyond performance metrics, though it is incremental in applying saliency analysis to reinforcement learning.

The paper tackles the problem of interpreting deep reinforcement learning agents' decision processes by introducing a methodology to analyze attention trajectories during training, revealing algorithm-specific biases and unintended strategies in Atari and custom environments.

While deep reinforcement learning agents demonstrate high performance across domains, their internal decision processes remain difficult to interpret when evaluated only through performance metrics. In particular, it is poorly understood which input features agents rely on, how these dependencies evolve during training, and how they relate to behavior. We introduce a scientific methodology for analyzing the learning process through quantitative analysis of saliency. This approach aggregates saliency information at the object and modality level into hierarchical attention profiles, quantifying how agents allocate attention over time, thereby forming attention trajectories throughout training. Applied to Atari benchmarks, custom Pong environments, and muscle-actuated biomechanical user simulations in visuomotor interactive tasks, this methodology uncovers algorithm-specific attention biases, reveals unintended reward-driven strategies, and diagnoses overfitting to redundant sensory channels. These patterns correspond to measurable behavioral differences, demonstrating empirical links between attention profiles, learning dynamics, and agent behavior. To assess robustness of the attention profiles, we validate our findings across multiple saliency methods and environments. The results establish attention trajectories as a promising diagnostic axis for tracing how feature reliance develops during training and for identifying biases and vulnerabilities invisible to performance metrics alone.

View on arXiv PDF

Similar