LGOct 21, 2025

Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

arXiv:2510.18340v12 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses a theoretical gap for researchers and practitioners using policy-based reinforcement learning in settings like large language models, where undiscounted rewards are common, though it is incremental as it builds on classical methods.

The paper tackles the problem of analyzing policy gradient algorithms for undiscounted total-reward Markov Decision Processes (MDPs), which lack existing theoretical guarantees, and provides a new analysis based on classifying states and introducing a transient visitation measure to establish convergence.

The classical policy gradient method is the theoretical and conceptual foundation of modern policy-based reinforcement learning (RL) algorithms. Most rigorous analyses of such methods, particularly those establishing convergence guarantees, assume a discount factor $γ< 1$. In contrast, however, a recent line of work on policy-based RL for large language models uses the undiscounted total-reward setting with $γ= 1$, rendering much of the existing theory inapplicable. In this paper, we provide analyses of the policy gradient method for undiscounted expected total-reward infinite-horizon MDPs based on two key insights: (i) the classification of the MDP states into recurrent and transient states is invariant over the set of policies that assign strictly positive probability to every action (as is typical in deep RL models employing a softmax output layer) and (ii) the classical state visitation measure (which may be ill-defined when $γ= 1$) can be replaced with a new object that we call the transient visitation measure.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes