Efficient Pure Exploration for Combinatorial Bandits with Semi-Bandit Feedback
This work addresses the challenge of efficiently identifying optimal actions in structured decision-making problems, such as matroids or graph paths, which is incremental but provides practical computational improvements.
The paper tackles the pure-exploration problem in combinatorial bandits with semi-bandit feedback, developing a meta-algorithm that yields asymptotically optimal algorithms with finite-time guarantees and introducing the first computationally efficient, asymptotically optimal algorithm for best-arm identification in this setting.
Combinatorial bandits with semi-bandit feedback generalize multi-armed bandits, where the agent chooses sets of arms and observes a noisy reward for each arm contained in the chosen set. The action set satisfies a given structure such as forming a base of a matroid or a path in a graph. We focus on the pure-exploration problem of identifying the best arm with fixed confidence, as well as a more general setting, where the structure of the answer set differs from the one of the action set. Using the recently popularized game framework, we interpret this problem as a sequential zero-sum game and develop a CombGame meta-algorithm whose instances are asymptotically optimal algorithms with finite time guarantees. In addition to comparing two families of learners to instantiate our meta-algorithm, the main contribution of our work is a specific oracle efficient instance for best-arm identification with combinatorial actions. Based on a projection-free online learning algorithm for convex polytopes, it is the first computationally efficient algorithm which is asymptotically optimal and has competitive empirical performance.