LGFeb 7, 2025

From Restless to Contextual: A Thresholding Bandit Reformulation For Finite-horizon Performance

Jiamin Xu, Ivan Nazarov, Aditya Rastogi, África Periáñez, Kyra Gan

arXiv:2502.05145v44.1h-index: 2Has Code

Originality Highly original

AI Analysis

This work addresses the sample complexity problem in restless bandits for researchers and practitioners, offering a practical, sample-efficient solution that is incremental in improving existing methods.

The paper tackles the poor finite-horizon performance of restless bandit algorithms by reformulating them as budgeted thresholding contextual bandits, achieving faster convergence and higher cumulative reward with significant gains over state-of-the-art methods in large-scale heterogeneous environments.

This paper addresses the poor finite-horizon performance of existing online \emph{restless bandit} (RB) algorithms, which stems from the prohibitive sample complexity of learning a full \emph{Markov decision process} (MDP) for each agent. We argue that superior finite-horizon performance requires \emph{rapid convergence} to a \emph{high-quality} policy. Thus motivated, we introduce a reformulation of online RBs as a \emph{budgeted thresholding contextual bandit}, which simplifies the learning problem by encoding long-term state transitions into a scalar reward. We prove the first non-asymptotic optimality of an oracle policy for a simplified finite-horizon setting. We propose a practical learning policy under a heterogeneous-agent, multi-state setting, and show that it achieves a sublinear regret, achieving \emph{faster convergence} than existing methods. This directly translates to higher cumulative reward, as empirically validated by significant gains over state-of-the-art algorithms in large-scale heterogeneous environments. Our work provides a new pathway for achieving practical, sample-efficient learning in finite-horizon RBs.

View on arXiv PDF Code

Similar