Batched Stochastic Matching Bandits
This work addresses computational efficiency in matching bandits for applications like online advertising or job markets, but it is incremental as it builds on existing bandit and MNL models.
The paper tackles the problem of minimizing regret in a stochastic matching bandit framework with NP-hard combinatorial optimization, proposing batched algorithms that reduce amortized computational cost to O(1) per round while achieving a regret bound of O~(√T).
In this study, we introduce a novel bandit framework for stochastic matching based on the Multi-nomial Logit (MNL) choice model. In our setting, $N$ agents on one side are assigned to $K$ arms on the other side, where each arm stochastically selects an agent from its assigned pool according to an unknown preference and yields a corresponding reward. The objective is to minimize regret by maximizing the cumulative revenue from successful matches across all agents. This task requires solving a combinatorial optimization problem based on estimated preferences, which is NP-hard and leads a naive approach to incur a computational cost of $O(K^N)$ per round. To address this challenge, we propose batched algorithms that limit the frequency of matching updates, thereby reducing the amortized computational cost (i.e., the average cost per round) to $O(1)$ while still achieving a regret bound of $\tilde{O}(\sqrt{T})$.