Adversarial Multi-dueling Bandits
This addresses a gap in bandit theory by extending adversarial preferences to multi-dueling settings, which is incremental but provides theoretical guarantees for learning from pairwise-subset choices.
The paper tackles the problem of regret minimization in adversarial multi-dueling bandits, where the learner selects multiple arms per round and observes feedback based on an arbitrary preference matrix, and introduces the MiDEX algorithm, achieving an expected cumulative regret upper bound of O((K log K)^{1/3} T^{2/3}) and proving it is near-optimal with a matching lower bound of Ω(K^{1/3} T^{2/3}).
We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select $m \geq 2$ arms at each round and observes as feedback the identity of the most preferred arm which is based on an arbitrary preference matrix chosen obliviously. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative $T$-round regret of MiDEX compared to a Borda-winner from a set of $K$ arms is upper bounded by $O((K \log K)^{1/3} T^{2/3})$. Moreover, we prove a lower bound of $Ω(K^{1/3} T^{2/3})$ for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.