OCLGSep 25, 2024

Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action

arXiv:2409.17138v14 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses a fundamental problem in reinforcement learning for researchers and practitioners by providing theoretical guarantees for policy optimization in complex MDPs, with incremental improvements in sample complexity for specific domains like inventory systems.

The paper tackles the challenge of global convergence in policy gradient methods for finite-horizon MDPs with general state and action spaces by developing a framework that ensures the Kurdyka-Lojasiewicz condition, leading to convergence to globally optimal policies with a non-asymptotic rate. It demonstrates applications in control and operations models, achieving an ε-optimal policy with a sample complexity of Õ(ε⁻¹) and polynomial in the planning horizon.

Policy gradient methods are widely used in reinforcement learning. Yet, the nonconvexity of policy optimization imposes significant challenges in understanding the global convergence of policy gradient methods. For a class of finite-horizon Markov Decision Processes (MDPs) with general state and action spaces, we develop a framework that provides a set of easily verifiable assumptions to ensure the Kurdyka-Lojasiewicz (KL) condition of the policy optimization. Leveraging the KL condition, policy gradient methods converge to the globally optimal policy with a non-asymptomatic rate despite nonconvexity. Our results find applications in various control and operations models, including entropy-regularized tabular MDPs, Linear Quadratic Regulator (LQR) problems, stochastic inventory models, and stochastic cash balance problems, for which we show an $ε$-optimal policy can be obtained using a sample size in $\tilde{\mathcal{O}}(ε^{-1})$ and polynomial in terms of the planning horizon by stochastic policy gradient methods. Our result establishes the first sample complexity for multi-period inventory systems with Markov-modulated demands and stochastic cash balance problems in the literature.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes