LGMay 13, 2013

An efficient algorithm for learning with semi-bandit feedback

arXiv:1305.2732v184 citations

Originality Incremental advance

AI Analysis

This addresses the problem of efficient learning in combinatorial decision sets for researchers in online optimization, offering a novel method that is implementable where offline optimization is feasible, though it is incremental in improving regret bounds.

The paper tackles online combinatorial optimization under semi-bandit feedback by proposing an algorithm combining Follow-the-Perturbed-Leader with Geometric Resampling, achieving an expected regret of O(m sqrt(dT log d)) after T rounds, with an improvement to O(m^(3/2) sqrt(T log d)) for full information settings.

We consider the problem of online combinatorial optimization under semi-bandit feedback. The goal of the learner is to sequentially select its actions from a combinatorial decision set so as to minimize its cumulative loss. We propose a learning algorithm for this problem based on combining the Follow-the-Perturbed-Leader (FPL) prediction method with a novel loss estimation procedure called Geometric Resampling (GR). Contrary to previous solutions, the resulting algorithm can be efficiently implemented for any decision set where efficient offline combinatorial optimization is possible at all. Assuming that the elements of the decision set can be described with d-dimensional binary vectors with at most m non-zero entries, we show that the expected regret of our algorithm after T rounds is O(m sqrt(dT log d)). As a side result, we also improve the best known regret bounds for FPL in the full information setting to O(m^(3/2) sqrt(T log d)), gaining a factor of sqrt(d/m) over previous bounds for this algorithm.

View on arXiv PDF

Similar