LG MLDec 3, 2019

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu

arXiv:1912.01192v523.4119 citations

Originality Incremental advance

AI Analysis

This addresses a challenging reinforcement learning problem with practical implications for scenarios with limited feedback, though it is incremental relative to existing methods.

The paper tackles the problem of learning in episodic finite-horizon Markov decision processes with unknown transition, bandit feedback, and adversarial losses, achieving an efficient algorithm with $\mathcal{ ilde{O}}(L|X|\sqrt{|A|T})$ regret, which matches prior work in an easier setting.

We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an $\textit{upper occupancy bound}$.

View on arXiv PDF

Similar