LGApr 2, 2024

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

Swetha Ganesh, Washim Uddin Mondal, Vaneet Aggarwal

arXiv:2404.02108v213.411 citationsh-index: 8AISTATS

Originality Highly original

AI Analysis

This work addresses the challenge of efficient reinforcement learning in continuous or large-scale environments for researchers and practitioners, representing a significant advancement rather than an incremental improvement.

The authors tackled the problem of optimizing policy gradient algorithms for infinite-horizon average reward Markov Decision Processes, achieving expected regret bounds of $ ilde{\mathcal{O}}(T^{2/3})$ and $ ilde{\mathcal{O}}(\sqrt{T})$, which improve upon the previous state-of-the-art $ ilde{\mathcal{O}}(T^{3/4})$ and match the theoretical lower bound.

We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T^{2/3})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the state-of-the-art $\tilde{\mathcal{O}}(T^{3/4})$ regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately $L$-smooth, a result that was previously assumed in earlier works.

View on arXiv PDF

Similar