LG AIOct 18, 2023

Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes

arXiv:2310.11677v219.224 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work provides a more efficient algorithm for reinforcement learning practitioners dealing with complex, high-dimensional decision-making problems, though it is incremental as it builds on existing natural policy gradient methods.

The paper tackles the problem of sample-efficient learning for infinite-horizon discounted reward Markov Decision Processes by proposing the Accelerated Natural Policy Gradient (ANPG) algorithm, which achieves O(ε^{-2}) sample complexity and O(ε^{-1}) iteration complexity, improving the state-of-the-art sample complexity by a log(1/ε) factor.

We consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves $\mathcal{O}({ε^{-2}})$ sample complexity and $\mathcal{O}(ε^{-1})$ iteration complexity with general parameterization where $ε$ defines the optimality error. This improves the state-of-the-art sample complexity by a $\log(\frac{1}ε)$ factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $\mathcal{O}(ε^{-\frac{1}{2}})$ and simultaneously matches their state-of-the-art iteration complexity.

View on arXiv PDF

Similar