LG MLMay 25, 2019

Large Scale Markov Decision Processes with Changing Rewards

arXiv:1905.10649v15.415 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of decision-making under uncertainty in dynamic environments for reinforcement learning applications, offering incremental improvements in scalability and regret bounds.

The paper tackles the problem of Markov Decision Processes with unknown and adversarially changing rewards, providing an algorithm that achieves a state-of-the-art regret bound of O(√(τ(ln|S|+ln|A|)T)ln(T)) and extends it to large-scale settings with a linear approximation for computational efficiency, achieving the first Õ(√T) regret bound in this context.

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves state-of-the-art regret bound of $O( \sqrt{τ(\ln|S|+\ln|A|)T}\ln(T))$, where $S$ is the state space, $A$ is the action space, $τ$ is the mixing time of the MDP, and $T$ is the number of periods. The algorithm's computational complexity is polynomial in $|S|$ and $|A|$ per period. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension $d\ll|S|$, we propose a modified algorithm with computational complexity polynomial in $d$. We also prove a regret bound for this modified algorithm, which to the best of our knowledge this is the first $\tilde{O}(\sqrt{T})$ regret bound for large scale MDPs with changing rewards.

View on arXiv PDF

Similar