ML LG OCDec 13, 2019

Provably Efficient Reinforcement Learning with Aggregated States

Shi Dong, Benjamin Van Roy, Zhengyuan Zhou

arXiv:1912.06366v217.335 citations

Originality Highly original

AI Analysis

This provides a provably efficient method for reinforcement learning with value function approximation, addressing a foundational challenge in AI for applications like robotics and control, though it is incremental in extending existing regret bounds to aggregated states.

The paper tackles the problem of reinforcement learning with aggregated states by establishing a regret bound for an optimistic Q-learning variant, achieving $ ilde{\mathcal{O}}(\sqrt{H^5 M K} + \varepsilon H K)$ without dependence on the number of states or actions, with asymptotic per-period regret no greater than $\varepsilon$.

We establish that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret $\tilde{\mathcal{O}}(\sqrt{H^5 M K} + εHK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $ε$ is the largest difference between any pair of optimal state-action values associated with a common aggregate state. Notably, this regret bound does not depend on the number of states or actions and indicates that asymptotic per-period regret is no greater than $ε$, independent of horizon. To our knowledge, this is the first such result that applies to reinforcement learning with nontrivial value function approximation without any restrictions on transition probabilities.

View on arXiv PDF

Similar