LGMay 14, 2014

Reducing Dueling Bandits to Cardinal Bandits

Nir Ailon, Thorsten Joachims, Zohar Karnin

arXiv:1405.3396v1155 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of learning from ordinal feedback (e.g., user preferences) for researchers and practitioners in online learning and recommendation systems, representing an incremental advance by providing a generic reduction framework.

The paper tackles the Dueling Bandits problem by introducing algorithms that reduce it to the conventional Multi-Armed Bandits problem, enabling the application of existing bandit results to ordinal feedback settings, with empirical results showing that one algorithm outperforms others and previous methods.

We present algorithms for reducing the Dueling Bandits problem to the conventional (stochastic) Multi-Armed Bandits problem. The Dueling Bandits problem is an online model of learning with ordinal feedback of the form "A is preferred to B" (as opposed to cardinal feedback like "A has value 2.5"), giving it wide applicability in learning from implicit user feedback and revealed and stated preferences. In contrast to existing algorithms for the Dueling Bandits problem, our reductions -- named $\Doubler$, $\MultiSbm$ and $\DoubleSbm$ -- provide a generic schema for translating the extensive body of known results about conventional Multi-Armed Bandit algorithms to the Dueling Bandits setting. For $\Doubler$ and $\MultiSbm$ we prove regret upper bounds in both finite and infinite settings, and conjecture about the performance of $\DoubleSbm$ which empirically outperforms the other two as well as previous algorithms in our experiments. In addition, we provide the first almost optimal regret bound in terms of second order terms, such as the differences between the values of the arms.

View on arXiv PDF

Similar