LGMar 10

Overcoming Valid Action Suppression in Unmasked Policy Gradient Algorithms

arXiv:2603.09090v167.41 citationsh-index: 16
Predicted impact top 31% in LG · last 90 daysOriginality Highly original
AI Analysis

This addresses a failure mode in reinforcement learning for environments with state-dependent action validity, offering a solution for deploying algorithms without requiring oracle masks.

The paper identified that unmasked policy gradient algorithms systematically suppress valid actions at unvisited states due to gradient propagation from visited states where those actions are invalid, and proved an exponential decay bound for softmax policies, showing that entropy regularization trades off between protecting valid actions and sample efficiency, with experiments confirming suppression and feasibility classification enabling deployment without oracle masks.

In reinforcement learning environments with state-dependent action validity, action masking consistently outperforms penalty-based handling of invalid actions, yet existing theory only shows that masking preserves the policy gradient theorem. We identify a distinct failure mode of unmasked training: it systematically suppresses valid actions at states the agent has not yet visited. This occurs because gradients pushing down invalid actions at visited states propagate through shared network parameters to unvisited states where those actions are valid. We prove that for softmax policies with shared features, when an action is invalid at visited states but valid at an unvisited state $s^*$, the probability $π(a \mid s^*)$ is bounded by exponential decay due to parameter sharing and the zero-sum identity of softmax logits. This bound reveals that entropy regularization trades off between protecting valid actions and sample efficiency, a tradeoff that masking eliminates. We validate empirically that deep networks exhibit the feature alignment condition required for suppression, and experiments on Craftax, Craftax-Classic, and MiniHack confirm the predicted exponential suppression and demonstrate that feasibility classification enables deployment without oracle masks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes