LG MLFeb 14, 2022

Versatile Dueling Bandits: Best-of-both-World Analyses for Online Learning from Preferences

arXiv:2202.06694v17.89 citations

Originality Highly original

AI Analysis

This solves a long-standing problem in online learning from preferences, providing a unified framework for dueling bandits that is optimal across different environments, which is significant for researchers and practitioners in machine learning and decision-making systems.

The paper tackles the problem of K-armed dueling bandits in both stochastic and adversarial environments by proposing a novel reduction to multi-armed bandits, achieving the first best-of-both-world result with optimal regret bounds, including an O(∑_{i=1}^K log T/Δ_i) bound against the Condorcet-winner benchmark and improved performance under adversarial corruptions.

We study the problem of $K$-armed dueling bandit for both stochastic and adversarial environments, where the goal of the learner is to aggregate information through relative preferences of pair of decisions points queried in an online sequential manner. We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits and despite the simplicity, it allows us to improve many existing results in dueling bandits. In particular, \emph{we give the first best-of-both world result for the dueling bandits regret minimization problem} -- a unified framework that is guaranteed to perform optimally for both stochastic and adversarial preferences simultaneously. Moreover, our algorithm is also the first to achieve an optimal $O(\sum_{i = 1}^K \frac{\log T}{Δ_i})$ regret bound against the Condorcet-winner benchmark, which scales optimally both in terms of the arm-size $K$ and the instance-specific suboptimality gaps $\{Δ_i\}_{i = 1}^K$. This resolves the long-standing problem of designing an instancewise gap-dependent order optimal regret algorithm for dueling bandits (with matching lower bounds up to small constant factors). We further justify the robustness of our proposed algorithm by proving its optimal regret rate under adversarially corrupted preferences -- this outperforms the existing state-of-the-art corrupted dueling results by a large margin. In summary, we believe our reduction idea will find a broader scope in solving a diverse class of dueling bandits setting, which are otherwise studied separately from multi-armed bandits with often more complex solutions and worse guarantees. The efficacy of our proposed algorithms is empirically corroborated against the existing dueling bandit methods.

View on arXiv PDF

Similar