Multi-Armed Bandits with Metric Movement Costs
This addresses the problem of optimizing decisions under switching costs for researchers in online learning and bandit theory, offering foundational theoretical insights with some incremental generalizations.
The paper tackles the non-stochastic Multi-Armed Bandit problem with metric switching costs, providing a tight characterization of minimax regret in terms of a complexity measure based on covering numbers. It presents efficient algorithms achieving regret bounds like $\widetilde{O}(\max\{\mathcal{C}^{1/3}T^{2/3},\sqrt{kT}\})$ for finite metrics and $\widetilde{\Theta}(T^{rac{d+1}{d+2}})$ for infinite metrics, showing these are optimal.
We consider the non-stochastic Multi-Armed Bandit problem in a setting where there is a fixed and known metric on the action space that determines a cost for switching between any pair of actions. The loss of the online learner has two components: the first is the usual loss of the selected actions, and the second is an additional loss due to switching between actions. Our main contribution gives a tight characterization of the expected minimax regret in this setting, in terms of a complexity measure $\mathcal{C}$ of the underlying metric which depends on its covering numbers. In finite metric spaces with $k$ actions, we give an efficient algorithm that achieves regret of the form $\widetilde{O}(\max\{\mathcal{C}^{1/3}T^{2/3},\sqrt{kT}\})$, and show that this is the best possible. Our regret bound generalizes previous known regret bounds for some special cases: (i) the unit-switching cost regret $\widetildeΘ(\max\{k^{1/3}T^{2/3},\sqrt{kT}\})$ where $\mathcal{C}=Θ(k)$, and (ii) the interval metric with regret $\widetildeΘ(\max\{T^{2/3},\sqrt{kT}\})$ where $\mathcal{C}=Θ(1)$. For infinite metrics spaces with Lipschitz loss functions, we derive a tight regret bound of $\widetildeΘ(T^{\frac{d+1}{d+2}})$ where $d \ge 1$ is the Minkowski dimension of the space, which is known to be tight even when there are no switching costs.