LGMLMay 4, 2024

Leveraging (Biased) Information: Multi-armed Bandits with Offline Data

arXiv:2405.02594v112 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the challenge of integrating offline data into online decision-making for bandit algorithms, which is incremental as it builds on existing UCB methods.

The paper tackles the problem of using potentially biased offline data to improve online learning in stochastic multi-armed bandits, showing that without a bound on data difference, no policy beats UCB, but with such a bound, their proposed MIN-UCB policy outperforms UCB with tight regret bounds.

We leverage offline data to facilitate online learning in stochastic multi-armed bandits. The probability distributions that govern the offline data and the online rewards can be different. Without any non-trivial upper bound on their difference, we show that no non-anticipatory policy can outperform the UCB policy by (Auer et al. 2002), even in the presence of offline data. In complement, we propose an online policy MIN-UCB, which outperforms UCB when a non-trivial upper bound is given. MIN-UCB adaptively chooses to utilize the offline data when they are deemed informative, and to ignore them otherwise. MIN-UCB is shown to be tight in terms of both instance independent and dependent regret bounds. Finally, we corroborate the theoretical results with numerical experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes