Adversarial Combinatorial Semi-bandits with Graph Feedback
This work addresses a theoretical gap in online learning for combinatorial decision-making with structured feedback, offering incremental improvements by extending existing frameworks to graph-based observations.
The paper tackles the problem of combinatorial semi-bandits with graph feedback, where a learner selects arms and observes rewards of neighboring arms in a graph, establishing an optimal regret bound of ̃Θ(S√T + √(αST)) that interpolates between full information and semi-bandit feedback regimes.
In combinatorial semi-bandits, a learner repeatedly selects from a combinatorial decision set of arms, receives the realized sum of rewards, and observes the rewards of the individual selected arms as feedback. In this paper, we extend this framework to include \emph{graph feedback}, where the learner observes the rewards of all neighboring arms of the selected arms in a feedback graph $G$. We establish that the optimal regret over a time horizon $T$ scales as $\widetildeΘ(S\sqrt{T}+\sqrt{αST})$, where $S$ is the size of the combinatorial decisions and $α$ is the independence number of $G$. This result interpolates between the known regrets $\widetildeΘ(S\sqrt{T})$ under full information (i.e., $G$ is complete) and $\widetildeΘ(\sqrt{KST})$ under the semi-bandit feedback (i.e., $G$ has only self-loops), where $K$ is the total number of arms. A key technical ingredient is to realize a convexified action using a random decision vector with negative correlations. We also show that online stochastic mirror descent (OSMD) that only realizes convexified actions in expectation is suboptimal. In addition, we describe the problem of \emph{combinatorial semi-bandits with general capacity} and apply our results to derive an improved regret upper bound, which may be of independent interest.