LGMay 14, 2014

Reducing Dueling Bandits to Cardinal Bandits

arXiv:1405.3396v1155 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of learning from ordinal feedback (e.g., user preferences) for researchers and practitioners in online learning and recommendation systems, representing an incremental advance by providing a generic reduction framework.

The paper tackles the Dueling Bandits problem by introducing algorithms that reduce it to the conventional Multi-Armed Bandits problem, enabling the application of existing bandit results to ordinal feedback settings, with empirical results showing that one algorithm outperforms others and previous methods.

We present algorithms for reducing the Dueling Bandits problem to the conventional (stochastic) Multi-Armed Bandits problem. The Dueling Bandits problem is an online model of learning with ordinal feedback of the form "A is preferred to B" (as opposed to cardinal feedback like "A has value 2.5"), giving it wide applicability in learning from implicit user feedback and revealed and stated preferences. In contrast to existing algorithms for the Dueling Bandits problem, our reductions -- named $\Doubler$, $\MultiSbm$ and $\DoubleSbm$ -- provide a generic schema for translating the extensive body of known results about conventional Multi-Armed Bandit algorithms to the Dueling Bandits setting. For $\Doubler$ and $\MultiSbm$ we prove regret upper bounds in both finite and infinite settings, and conjecture about the performance of $\DoubleSbm$ which empirically outperforms the other two as well as previous algorithms in our experiments. In addition, we provide the first almost optimal regret bound in terms of second order terms, such as the differences between the values of the arms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes