Multi-Agent Multi-Armed Bandits with Limited Communication
This addresses the challenge of efficient collaboration in distributed learning for applications like sensor networks or robotics, though it is incremental as it builds on existing bandit and communication-limited frameworks.
The paper tackles the problem of multiple agents collaboratively minimizing cumulative regret in a stochastic multi-armed bandit setting with limited communication, proposing algorithms LCC-UCB and LCC-UCB-GRAPH that achieve regret bounds of $ ilde{O}\left(\sqrt{({K/N}+ N)T} ight)$ and $ ilde{O}\left(D\sqrt{(K/N+ K_G)DT} ight)$, respectively, with $O(\log T)$ communication steps and $O(\log K)$ bits per step.
We consider the problem where $N$ agents collaboratively interact with an instance of a stochastic $K$ arm bandit problem for $K \gg N$. The agents aim to simultaneously minimize the cumulative regret over all the agents for a total of $T$ time steps, the number of communication rounds, and the number of bits in each communication round. We present Limited Communication Collaboration - Upper Confidence Bound (LCC-UCB), a doubling-epoch based algorithm where each agent communicates only after the end of the epoch and shares the index of the best arm it knows. With our algorithm, LCC-UCB, each agent enjoys a regret of $\tilde{O}\left(\sqrt{({K/N}+ N)T}\right)$, communicates for $O(\log T)$ steps and broadcasts $O(\log K)$ bits in each communication step. We extend the work to sparse graphs with maximum degree $K_G$, and diameter $D$ and propose LCC-UCB-GRAPH which enjoys a regret bound of $\tilde{O}\left(D\sqrt{(K/N+ K_G)DT}\right)$. Finally, we empirically show that the LCC-UCB and the LCC-UCB-GRAPH algorithm perform well and outperform strategies that communicate through a central node