Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles
This addresses a practical limitation for researchers and practitioners in RL by enabling more efficient policy optimization without requiring expensive mixing time estimates, though it is incremental as it builds on existing frameworks.
The paper tackles the problem of global convergence in average-reward reinforcement learning by eliminating the need for oracle knowledge of mixing time, which is difficult to estimate in large state spaces, and achieves a tight dependency of O(sqrt(tau_mix)) and outperforms state-of-the-art methods in experiments.
In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of $\mathcal{O}\left( \sqrt{τ_{mix}} \right)$known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings.