LG MLFeb 25, 2022

Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

MohammadJavad Azizi, Thang Duong, Yasin Abbasi-Yadkori, András György, Claire Vernade, Mohammad Ghavamzadeh

arXiv:2202.13001v66.98 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient learning in dynamic environments for applications like online advertising or recommendation systems, but it is incremental as it builds on existing bandit and meta-learning frameworks.

The paper tackles the problem of sequential decision-making in non-stationary and meta-learning bandit settings, where the goal is to compete with the best subset of arms of size M, and shows that their algorithm achieves regret bounds like O~(NM√(Mτ) + N^{2/3}Mτ), which improves over standard baselines in regimes with many tasks and few optimal arms.

We study a sequential decision problem where the learner faces a sequence of $K$-armed bandit tasks. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). For a given integer $M\le K$, the learner aims to compete with the best subset of arms of size $M$. We design an algorithm based on a reduction to bandit submodular maximization, and show that, for $T$ rounds comprised of $N$ tasks, in the regime of large number of tasks and small number of optimal arms $M$, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $τ$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M τ}+N^{2/3}Mτ)$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M τ}+N^{1/2}\sqrt{M K τ})$ regret.

View on arXiv PDF Code

Similar