LG AI OC MLFeb 17, 2020

Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity

Simon S. Du, Jason D. Lee, Gaurav Mahajan, Ruosong Wang

arXiv:2002.07125v122.150 citations

Originality Highly original

AI Analysis

This provides tight bounds on approximation error and sample complexity for reinforcement learning algorithms, addressing a foundational issue in the field with implications for efficient learning in deterministic and stochastic settings.

The paper tackles the problem of agnostic Q-learning with function approximation in deterministic systems, proposing a novel recursion-based algorithm that achieves optimal policy learning with O(dim_E) trajectories when the approximation error δ is O(ρ/√dim_E), where ρ is the action gap and dim_E is the Eluder dimension, and it settles an open problem from prior work.

The current paper studies the problem of agnostic $Q$-learning with function approximation in deterministic systems where the optimal $Q$-function is approximable by a function in the class $\mathcal{F}$ with approximation error $δ\ge 0$. We propose a novel recursion-based algorithm and show that if $δ= O\left(ρ/\sqrt{\dim_E}\right)$, then one can find the optimal policy using $O\left(\dim_E\right)$ trajectories, where $ρ$ is the gap between the optimal $Q$-value of the best actions and that of the second-best actions and $\dim_E$ is the Eluder dimension of $\mathcal{F}$. Our result has two implications: 1) In conjunction with the lower bound in [Du et al., ICLR 2020], our upper bound suggests that the condition $δ= \widetildeΘ\left(ρ/\sqrt{\mathrm{dim}_E}\right)$ is necessary and sufficient for algorithms with polynomial sample complexity. 2) In conjunction with the lower bound in [Wen and Van Roy, NIPS 2013], our upper bound suggests that the sample complexity $\widetildeΘ\left(\mathrm{dim}_E\right)$ is tight even in the agnostic setting. Therefore, we settle the open problem on agnostic $Q$-learning proposed in [Wen and Van Roy, NIPS 2013]. We further extend our algorithm to the stochastic reward setting and obtain similar results.

View on arXiv PDF

Similar