Value Mirror Descent for Reinforcement Learning

arXiv:2604.0603931.4

Predicted impact top 45% in OC · last 90 daysOriginality Highly original

AI Analysis

This provides a new method for reinforcement learning with improved sample efficiency and bounded policy divergence, addressing challenges in offline training and enabling effective online learning.

The paper tackles the problem of computing optimal value functions in reinforcement learning by introducing value mirror descent (VMD), which integrates mirror descent into value iteration, achieving near-optimal sample complexity of Õ(|S||A|(1-γ)^{-3}ε^{-2}) for general convex regularizers and Õ(|S||A|(1-γ)^{-5}ε^{-1}) under strong convexity.

Value iteration-type methods have been extensively studied for computing a nearly optimal value function in reinforcement learning (RL). Under a generative sampling model, these methods can achieve sharper sample complexity than policy optimization approaches, particularly in their dependence on the discount factor. In practice, they are often employed for offline training or in simulated environments. In this paper, we consider discounted Markov decision processes with state space S, action space A, discount factor $Î³\in(0,1)$ and costs in $[0,1]$. We introduce a novel value optimization method, termed value mirror descent (VMD), which integrates mirror descent from convex optimization into the classical value iteration framework. In the deterministic setting with known transition kernels, we show that VMD converges linearly. For the stochastic setting with a generative model, we develop a stochastic variant, SVMD, which incorporates variance reduction commonly used in stochastic value iteration-type methods. For RL problems with general convex regularizers, SVMD attains a near-optimal sample complexity of $\tilde{O}(|S||A|(1-Î³)^{-3}Îµ^{-2})$. Moreover, we establish that the Bregman divergence between the generated and optimal policies remains bounded throughout the iterations. This property is absent in existing stochastic value iteration-type methods but is important for enabling effective online (continual) learning following offline training. Under a strongly convex regularizer, SVMD achieves sample complexity of $\tilde{O}(|S||A|(1-Î³)^{-5}Îµ^{-1})$, improving performance in the high-accuracy regime. Furthermore, we prove convergence of the generated policy to the optimal policy. Overall, the proposed method, its analysis, and the resulting guarantees, constitute new contributions to the RL and optimization literature.

View on arXiv PDF

Similar