LG MLJun 18, 2019

Simple Algorithms for Dueling Bandits

arXiv:1906.07611v15.46 citations

Originality Incremental advance

AI Analysis

This addresses the problem of efficient decision-making with binary feedback in comparison-based settings, such as ranking or recommendation systems, and is incremental as it builds on existing Dueling Bandits methods.

The paper tackles the Dueling Bandits problem by proposing simple algorithms with regret bounds of order O(T^rho) where 1/2 <= rho <= 3/4, independent of the preference gap Delta, and shows that in some cases, these algorithms exceed state-of-the-art performance in experiments on synthetic data.

In this paper, we present simple algorithms for Dueling Bandits. We prove that the algorithms have regret bounds for time horizon T of order O(T^rho ) with 1/2 <= rho <= 3/4, which importantly do not depend on any preference gap between actions, Delta. Dueling Bandits is an important extension of the Multi-Armed Bandit problem, in which the algorithm must select two actions at a time and only receives binary feedback for the duel outcome. This is analogous to comparisons in which the rater can only provide yes/no or better/worse type responses. We compare our simple algorithms to the current state-of-the-art for Dueling Bandits, ISS and DTS, discussing complexity and regret upper bounds, and conducting experiments on synthetic data that demonstrate their regret performance, which in some cases exceeds state-of-the-art.

View on arXiv PDF

Similar