LGGTMLDec 14, 2015

Fighting Bandits with a New Kind of Smoothness

arXiv:1512.04152v188 citations
Originality Highly original
AI Analysis

This work provides improved regret bounds for bandit algorithms, which is incremental but important for online learning and decision-making applications.

The paper tackled the adversarial multi-armed bandit problem by introducing a new family of algorithms based on convex smoothing, proving that regularization via Tsallis entropy achieves Θ(√(TN)) minimax regret and that perturbation methods with bounded hazard rate distributions achieve O(√(TN log N)) near-optimal regret.

We define a novel family of algorithms for the adversarial multi-armed bandit problem, and provide a simple analysis technique based on convex smoothing. We prove two main results. First, we show that regularization via the \emph{Tsallis entropy}, which includes EXP3 as a special case, achieves the $Θ(\sqrt{TN})$ minimax regret. Second, we show that a wide class of perturbation methods achieve a near-optimal regret as low as $O(\sqrt{TN \log N})$ if the perturbation distribution has a bounded hazard rate. For example, the Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this key property.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes