LG AIJun 19, 2025

VRAIL: Vectorized Reward-based Attribution for Interpretable Learning

arXiv:2506.16014v44.1h-index: 1

Originality Incremental advance

AI Analysis

This addresses the need for more interpretable and stable reinforcement learning methods, particularly for value-based RL, though it appears incremental as it builds on existing reward shaping techniques.

The paper tackles the problem of improving training stability and interpretability in reinforcement learning by proposing VRAIL, a bi-level framework that uses vectorized reward-based attribution; results on the Taxi-v3 environment show improved convergence compared to standard DQN without environment modifications.

We propose VRAIL (Vectorized Reward-based Attribution for Interpretable Learning), a bi-level framework for value-based reinforcement learning (RL) that learns interpretable weight representations from state features. VRAIL consists of two stages: a deep learning (DL) stage that fits an estimated value function using state features, and an RL stage that uses this to shape learning via potential-based reward transformations. The estimator is modeled in either linear or quadratic form, allowing attribution of importance to individual features and their interactions. Empirical results on the Taxi-v3 environment demonstrate that VRAIL improves training stability and convergence compared to standard DQN, without requiring environment modifications. Further analysis shows that VRAIL uncovers semantically meaningful subgoals, such as passenger possession, highlighting its ability to produce human-interpretable behavior. Our findings suggest that VRAIL serves as a general, model-agnostic framework for reward shaping that enhances both learning and interpretability.

View on arXiv PDF

Similar