LG MLJun 1

Tree-Guided Identify-Then-Exploit: A Unified Framework of Best Arm Identification and Regret Minimization for Dueling Bandits

arXiv:2606.0179923.1

AI Analysis

For researchers in online learning and bandit theory, this work provides a unified algorithm that bridges BAI and regret minimization in dueling bandits, showing the trade-off is benign.

The paper proposes TG-ITE, the first unified framework for dueling bandits that handles best-arm identification, weak regret, and strong regret simultaneously, achieving O(N) sample complexity for BAI and weak regret, and O(N log T) for strong regret, eliminating a sub-optimal O(log N) gap in prior work.

We study $N$-armed stochastic dueling bandits under the Condorcet-winner assumption, where three widely adopted objectives are considered: best-arm identification (BAI), weak regret, and strong regret. We propose Tree-Guided Identify-Then-Exploit (TG-ITE), the first unified framework to tackle all these objectives to our knowledge. Without requiring stronger assumptions, we propose a shared tree-guided identification approach to find a high-confidence incumbent within $O(N)$ comparisons. We further propose varied exploitation strategies to utilize this warm-start stage to optimize the specific objectives at hand. This methodology enables our approach to (1) achieve $O(N)$ sample complexity in BAI without commonly adopted stronger assumptions; (2) build the first winner-stays-style algorithm to achieve $O(N)$ weak regret; (3) enjoy the same $O(N \log T)$ guarantee as specialized strong-regret approaches; (4) realize the joint optimization of BAI and weak regret with $O(N)$ guarantees for both, eliminating the sub-optimal gap of $O(\log N)$ in the existing approach. Our results provide evidence that the trade-off between BAI and regret minimization is relatively benign in dueling bandits.

View on arXiv PDF

Similar