Dueling Bandits: From Two-dueling to Multi-dueling
This work addresses the problem of efficient decision-making with subjective feedback on multiple options for applications like recommendation systems, though it is incremental as it builds on existing dueling bandit frameworks.
The paper tackles the multi-dueling bandit problem by extending from two-dueling to multiple options, proposing algorithms like DoublerBAI, MultiSBM-Feedback, and MultiRUCB that achieve O(ln T) regret bounds, with MultiSBM-Feedback reducing the constant factor by almost half compared to benchmarks.
We study a general multi-dueling bandit problem, where an agent compares multiple options simultaneously and aims to minimize the regret due to selecting suboptimal arms. This setting generalizes the traditional two-dueling bandit problem and finds many real-world applications involving subjective feedback on multiple options. We start with the two-dueling bandit setting and propose two efficient algorithms, DoublerBAI and MultiSBM-Feedback. DoublerBAI provides a generic schema for translating known results on best arm identification algorithms to the dueling bandit problem, and achieves a regret bound of $O(\ln T)$. MultiSBM-Feedback not only has an optimal $O(\ln T)$ regret, but also reduces the constant factor by almost a half compared to benchmark results. Then, we consider the general multi-dueling case and develop an efficient algorithm MultiRUCB. Using a novel finite-time regret analysis for the general multi-dueling bandit problem, we show that MultiRUCB also achieves an $O(\ln T)$ regret bound and the bound tightens as the capacity of the comparison set increases. Based on both synthetic and real-world datasets, we empirically demonstrate that our algorithms outperform existing algorithms.