Q-learning with Nearest Neighbors
This work addresses the sample efficiency challenge in reinforcement learning for continuous domains, offering theoretical guarantees that are incremental improvements over existing bounds.
The paper tackles the problem of model-free reinforcement learning for continuous-state Markov Decision Processes with unknown transitions, using a single sample path, by providing tight finite sample analysis for the Nearest Neighbor Q-Learning algorithm. It establishes an upper bound sample complexity of $ ilde{O}ig(1/\varepsilon^{d+3}ig)$ and a lower bound of $ ilde{\Omega}ig(1/\varepsilon^{d+2}ig)$ for well-behaved MDPs.
We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a $d$-dimensional state space and the discounted factor $γ\in (0,1)$, given an arbitrary sample path with "covering time" $ L $, we establish that the algorithm is guaranteed to output an $\varepsilon$-accurate estimate of the optimal Q-function using $\tilde{O}\big(L/(\varepsilon^3(1-γ)^7)\big)$ samples. For instance, for a well-behaved MDP, the covering time of the sample path under the purely random policy scales as $ \tilde{O}\big(1/\varepsilon^d\big),$ so the sample complexity scales as $\tilde{O}\big(1/\varepsilon^{d+3}\big).$ Indeed, we establish a lower bound that argues that the dependence of $ \tildeΩ\big(1/\varepsilon^{d+2}\big)$ is necessary.