MLLGOct 27, 2023

Model-free Posterior Sampling via Learning Rate Randomization

arXiv:2310.18186v28 citationsh-index: 43
Originality Highly original
AI Analysis

This addresses the problem of efficient exploration in reinforcement learning for researchers and practitioners, offering a novel method that is incremental in its algorithmic design.

The paper tackles regret minimization in episodic Markov Decision Processes by introducing Randomized Q-learning (RandQL), a model-free posterior sampling algorithm that achieves regret bounds of order $\widetilde{O}(\sqrt{H^{5}SAT})$ in tabular settings and $\widetilde{O}(H^{5/2} T^{(d_z+1)/(d_z+2)})$ in metric spaces, and empirically outperforms existing approaches.

In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieves a regret bound of order $\widetilde{O}(\sqrt{H^{5}SAT})$, where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the number of episodes. For a metric state-action space, RandQL enjoys a regret bound of order $\widetilde{O}(H^{5/2} T^{(d_z+1)/(d_z+2)})$, where $d_z$ denotes the zooming dimension. Notably, RandQL achieves optimistic exploration without using bonuses, relying instead on a novel idea of learning rate randomization. Our empirical study shows that RandQL outperforms existing approaches on baseline exploration environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes