LGMLApr 23, 2025

Natural Policy Gradient for Average Reward Non-Stationary RL

arXiv:2504.16415v11 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the lack of theoretical understanding for policy-based methods in non-stationary RL, which is incremental but important for practical applications.

The paper tackles the problem of non-stationary reinforcement learning in the average-reward setting by proposing the first model-free policy-based algorithms, achieving a dynamic regret bound of ̃O(|S|^{1/2}|A|^{1/2}Δ_T^{1/6}T^{5/6}).

We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of $Δ_T$. Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods despite their flexibility in practice are not theoretically well understood in non-stationary RL. We propose and analyze the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient method with a restart based exploration for change and a novel interpretation of learning rates as adapting factors. Further, we present a bandit-over-RL based parameter-free algorithm BORL-NS-NAC that does not require prior knowledge of the variation budget $Δ_T$. We present a dynamic regret of $\tilde{\mathscr O}(|S|^{1/2}|A|^{1/2}Δ_T^{1/6}T^{5/6})$ for both algorithms, where $T$ is the time horizon, and $|S|$, $|A|$ are the sizes of the state and action spaces. The regret analysis leverages a novel adaptation of the Lyapunov function analysis of NAC to dynamic environments and characterizes the effects of simultaneous updates in policy, value function estimate and changes in the environment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes