LGMay 29

Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments

Nishant Kumar, Enrique Areyan Viqueira, Amy Greenwald

arXiv:2605.3089645.3h-index: 2

AI Analysis

This paper identifies a fundamental failure mode for policy gradient methods in environments with discontinuous rewards, which is a common problem in real-world applications like digital advertising auctions.

Policy gradient methods, including actor-critic, fail in repeated auctions due to "zero collapse." This occurs when agents overshoot optimal bids into zero-reward regions, becoming trapped by uninformative gradients. The problem is exacerbated by policy stochasticity and biased value estimates.

Bidding in repeated auctions is a central challenge for reinforcement learning (RL), combining continuous control with the strategic complexities of digital advertising. While policy gradient and value-based methods seem well-suited for these settings, they often struggle with the discontinuous, "cliff-like" nature of auction reward landscapes. In a first-price auction, for example, a bidder receives zero reward until they cross a specific threshold, after which the reward decreases as the bid increases. This creates a landscape of flat, zero-reward regions separated by sharp boundaries. We identify a fundamental failure mode in this setting termed "zero collapse." We show that stochastic exploration and gradient-based updates can cause policies to overshoot optimal high-reward regions and enter flat, zero-reward regimes. Once there, the lack of an informative gradient signal makes recovery extremely sample-inefficient, effectively trapping the agent. We find that actor-critic methods are particularly susceptible, as biased value estimates can accelerate this movement toward unstable regions. Our contributions include: (1) a mechanistic explanation of how discontinuous rewards lead to vanishing signals and zero collapse; (2) an analysis of the interaction between policy stochasticity and step size; and (3) an empirical demonstration of this phenomenon across REINFORCE and actor-critic variants. We propose practical mitigation strategies involving initialization and architectural choices to improve stability. Finally, we introduce a formal RL framework for auction environments highlighting their unique structural properties.

View on arXiv PDF

Similar