LGJun 28, 2023

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

arXiv:2306.16394v120.436 citationsh-index: 19

Originality Highly original

AI Analysis

This work addresses efficient learning in average-reward MDPs, a core problem in RL, with theoretical guarantees that are novel for weakly communicating MDPs.

The paper tackles the problem of model-free reinforcement learning for average-reward Markov Decision Processes, achieving optimal regret bounds in the online setting and improved sample complexity in the simulator setting, with results like O~(S^5A^2sp(h*)√T) regret and O~(SAsp^2(h*)/ε^2 + S^2Asp(h*)/ε) sample complexity.

We develop several provably efficient model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov Decision Processes (MDPs). We consider both online setting and the setting with access to a simulator. In the online setting, we propose model-free RL algorithms based on reference-advantage decomposition. Our algorithm achieves $\widetilde{O}(S^5A^2\mathrm{sp}(h^*)\sqrt{T})$ regret after $T$ steps, where $S\times A$ is the size of state-action space, and $\mathrm{sp}(h^*)$ the span of the optimal bias function. Our results are the first to achieve optimal dependence in $T$ for weakly communicating MDPs. In the simulator setting, we propose a model-free RL algorithm that finds an $ε$-optimal policy using $\widetilde{O} \left(\frac{SA\mathrm{sp}^2(h^*)}{ε^2}+\frac{S^2A\mathrm{sp}(h^*)}ε \right)$ samples, whereas the minimax lower bound is $Ω\left(\frac{SA\mathrm{sp}(h^*)}{ε^2}\right)$. Our results are based on two new techniques that are unique in the average-reward setting: 1) better discounted approximation by value-difference estimation; 2) efficient construction of confidence region for the optimal bias function with space complexity $O(SA)$.

View on arXiv PDF

Similar