MLLGSYOCAug 13, 2024

Variance-Reduced Cascade Q-learning: Algorithms and Sample Complexity

arXiv:2408.06544v2h-index: 27
AI Analysis

This provides improved sample complexity for reinforcement learning algorithms, though it appears incremental as it combines existing variance reduction techniques with a new scheme.

The paper tackles the problem of estimating optimal Q-functions in discounted Markov decision processes under synchronous sampling, introducing Variance-Reduced Cascade Q-learning (VRCQ) which achieves minimax optimal guarantees in the ℓ∞-norm and, for policy evaluation, requires the theoretically minimum number of samples with non-asymptotic instance optimality.

We study the problem of estimating the optimal Q-function of $γ$-discounted Markov decision processes (MDPs) under the synchronous setting, where independent samples for all state-action pairs are drawn from a generative model at each iteration. We introduce and analyze a novel model-free algorithm called Variance-Reduced Cascade Q-learning (VRCQ). VRCQ comprises two key building blocks: (i) the established direct variance reduction technique and (ii) our proposed variance reduction scheme, Cascade Q-learning. By leveraging these techniques, VRCQ provides superior guarantees in the $\ell_\infty$-norm compared with the existing model-free stochastic approximation-type algorithms. Specifically, we demonstrate that VRCQ is minimax optimal. Additionally, when the action set is a singleton (so that the Q-learning problem reduces to policy evaluation), it achieves non-asymptotic instance optimality while requiring the minimum number of samples theoretically possible. Our theoretical results and their practical implications are supported by numerical experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes