ML LGMay 8, 2018

Profitable Bandits

Mastane Achab, Stephan Clémençon, Aurélien Garivier

arXiv:1805.02908v12.74 citations

Originality Incremental advance

AI Analysis

This work addresses a novel problem in sequential decision-making, potentially useful for applications like default risk management, but it is incremental as it adapts existing bandit strategies to a new setting.

The paper tackles the profitable bandit problem, where an agent selects subsets of actions to maximize cumulative earnings from random rewards, and proves finite-time regret bounds for kl-UCB, Bayes-UCB, and Thompson Sampling strategies, establishing asymptotic optimality with Thompson Sampling showing a slight empirical advantage.

Originally motivated by default risk management applications, this paper investigates a novel problem, referred to as the profitable bandit problem here. At each step, an agent chooses a subset of the K possible actions. For each action chosen, she then receives the sum of a random number of rewards. Her objective is to maximize her cumulated earnings. We adapt and study three well-known strategies in this purpose, that were proved to be most efficient in other settings: kl-UCB, Bayes-UCB and Thompson Sampling. For each of them, we prove a finite time regret bound which, together with a lower bound we obtain as well, establishes asymptotic optimality. Our goal is also to compare these three strategies from a theoretical and empirical perspective both at the same time. We give simple, self-contained proofs that emphasize their similarities, as well as their differences. While both Bayesian strategies are automatically adapted to the geometry of information, the numerical experiments carried out show a slight advantage for Thompson Sampling in practice.

View on arXiv PDF

Similar