LGMLMay 31, 2022

Online Meta-Learning in Adversarial Multi-Armed Bandits

arXiv:2205.15921v23 citationsh-index: 39
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving regret in sequential decision-making for adversarial environments, though it appears incremental as it builds on existing meta-learning and bandit frameworks.

The paper tackles the problem of meta-learning in adversarial multi-armed bandits by developing an algorithm that leverages non-uniformity in the distribution of best arms across episodes, achieving improved regret bounds compared to non-meta-learning approaches.

We study meta-learning for adversarial multi-armed bandits. We consider the online-within-online setup, in which a player (learner) encounters a sequence of multi-armed bandit episodes. The player's performance is measured as regret against the best arm in each episode, according to the losses generated by an adversary. The difficulty of the problem depends on the empirical distribution of the per-episode best arm chosen by the adversary. We present an algorithm that can leverage the non-uniformity in this empirical distribution, and derive problem-dependent regret bounds. This solution comprises an inner learner that plays each episode separately, and an outer learner that updates the hyper-parameters of the inner algorithm between the episodes. In the case where the best arm distribution is far from uniform, it improves upon the best bound that can be achieved by any online algorithm executed on each episode individually without meta-learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes