LG SYSep 6, 2024

Faster Q-Learning Algorithms for Restless Bandits

Parvish Kakarapalli, Devendra Kayande, Rahul Meshram

arXiv:2409.05908v14.65 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This work addresses faster convergence in reinforcement learning for restless bandits, but it appears incremental as it builds on existing Q-learning methods with minor modifications.

The paper tackles the problem of learning Whittle indices for restless multi-armed bandits by proposing Q-learning variants with exploration policies, showing that PhaseQL with UCB achieves the fastest convergence rate in numerical examples.

We study the Whittle index learning algorithm for restless multi-armed bandits (RMAB). We first present Q-learning algorithm and its variants -- speedy Q-learning (SQL), generalized speedy Q-learning (GSQL) and phase Q-learning (PhaseQL). We also discuss exploration policies -- $ε$-greedy and Upper confidence bound (UCB). We extend the study of Q-learning and its variants with UCB policy. We illustrate using numerical example that Q-learning with UCB exploration policy has faster convergence and PhaseQL with UCB have fastest convergence rate. We next extend the study of Q-learning variants for index learning to RMAB. The algorithm of index learning is two-timescale variant of stochastic approximation, on slower timescale we update index learning scheme and on faster timescale we update Q-learning assuming fixed index value. We study constant stepsizes two timescale stochastic approximation algorithm. We describe the performance of our algorithms using numerical example. It illustrate that index learning with Q learning with UCB has faster convergence that $ε$ greedy. Further, PhaseQL (with UCB and $ε$ greedy) has the best convergence than other Q-learning algorithms.

View on arXiv PDF

Similar