Efficiently Solving MDPs with Stochastic Mirror Descent
This provides a unified, model-free method for reinforcement learning practitioners to compute near-optimal policies with improved or matched sample efficiency, though it is incremental in nature.
The paper tackles the problem of solving infinite-horizon Markov decision processes (MDPs) efficiently using a primal-dual stochastic mirror descent framework, achieving sample complexities of $\widetilde{O}(t_{mix}^2 A_{tot} \epsilon^{-2})$ for average-reward MDPs and $\widetilde{O}((1-\gamma)^{-4} A_{tot} \epsilon^{-2})$ for discounted MDPs, with linear runtime.
We present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with $A_{tot}$ total state-action pairs and mixing time bound $t_{mix}$ our method computes an $ε$-optimal policy with an expected $\widetilde{O}(t_{mix}^2 A_{tot} ε^{-2})$ samples from the state-transition matrix, removing the ergodicity dependence of prior art. When applied to a $γ$-discounted MDP with $A_{tot}$ total state-action pairs our method computes an $ε$-optimal policy with an expected $\widetilde{O}((1-γ)^{-4} A_{tot} ε^{-2})$ samples, matching the previous state-of-the-art up to a $(1-γ)^{-1}$ factor. Both methods are model-free, update state values and policies simultaneously, and run in time linear in the number of samples taken. We achieve these results through a more general stochastic mirror descent framework for solving bilinear saddle-point problems with simplex and box domains and we demonstrate the flexibility of this framework by providing further applications to constrained MDPs.