Regret Analysis for Continuous Dueling Bandit
This work addresses regret minimization for decision-making under limited pairwise feedback, which is incremental as it extends dueling bandits to continuous settings with theoretical guarantees.
The paper tackles the dueling bandit problem in a continuous action space by proposing a stochastic mirror descent algorithm, achieving an O(√(T log T)) regret bound under strong convexity and smoothness assumptions, and shows equivalence to convex optimization with near-optimal rates.
The dueling bandit is a learning framework wherein the feedback information in the learning process is restricted to a noisy comparison between a pair of actions. In this research, we address a dueling bandit problem based on a cost function over a continuous space. We propose a stochastic mirror descent algorithm and show that the algorithm achieves an $O(\sqrt{T\log T})$-regret bound under strong convexity and smoothness assumptions for the cost function. Subsequently, we clarify the equivalence between regret minimization in dueling bandit and convex optimization for the cost function. Moreover, when considering a lower bound in convex optimization, our algorithm is shown to achieve the optimal convergence rate in convex optimization and the optimal regret in dueling bandit except for a logarithmic factor.