LG MLFeb 2, 2024

Multi-Armed Bandits with Interference

arXiv:2402.01845v213.49 citationsh-index: 40

Originality Highly original

AI Analysis

This addresses the challenge of cumulative performance in online platforms with interference, offering a novel solution for scenarios like A/B testing with spatial dependencies.

The paper tackles the problem of experimentation with interference in multi-armed bandits, where rewards depend on treatments across all units, and shows that a cluster randomization policy achieves optimal expected regret with high probability bounds that improve with the number of units.

Experimentation with interference poses a significant challenge in contemporary online platforms. Prior research on experimentation with interference has concentrated on the final output of a policy. The cumulative performance, while equally crucial, is less well understood. To address this gap, we introduce the problem of {\em Multi-armed Bandits with Interference} (MABI), where the learner assigns an arm to each of $N$ experimental units over a time horizon of $T$ rounds. The reward of each unit in each round depends on the treatments of {\em all} units, where the influence of a unit decays in the spatial distance between units. Furthermore, we employ a general setup wherein the reward functions are chosen by an adversary and may vary arbitrarily across rounds and units. We first show that switchback policies achieve an optimal {\em expected} regret $\tilde O(\sqrt T)$ against the best fixed-arm policy. Nonetheless, the regret (as a random variable) for any switchback policy suffers a high variance, as it does not account for $N$. We propose a cluster randomization policy whose regret (i) is optimal in {\em expectation} and (ii) admits a high probability bound that vanishes in $N$.

View on arXiv PDF

Similar