Simple Opinion Dynamics for No-Regret Learning
This work addresses efficient decentralized learning for multi-agent systems, offering a novel approach with best-of-both-worlds guarantees, though it is incremental in building on existing opinion dynamics and GOSSIP models.
The paper tackles the problem of cooperative multi-agent bandit learning in a distributed GOSSIP model, introducing simple memoryless protocols that achieve constant cumulative regret scaling like O~(1/T) for stationary rewards and sublinear regret for adversarial rewards, while also reaching consensus on the best action within O~(√n) rounds.
We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of $n$ agents chooses an action from a common set, observes the action's corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which may inform its choice in the next round. We introduce and analyze families of memoryless and time-independent protocols for this setting, inspired by opinion dynamics that are well-studied for other algorithmic tasks in the GOSSIP model. For stationary reward settings, we prove for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like $R(T)/T = \widetilde O(1/T)$, and also reaching consensus on the highest-mean action within $\widetilde O(\sqrt{n})$ rounds. We obtain these results by showing a new connection between the global evolution of these decentralized protocols and a class of zero-sum multiplicative weights update} processes. Using this connection, we establish a general framework for analyzing the population-level regret and other properties of our protocols. Finally, we show our protocols are also surprisingly robust to adversarial rewards, and in this regime we obtain sublinear regret scaling like $R(T)/T = \widetilde O(1/\sqrt{T})$ as long as the number of rounds does not grow too fast as a function of $n$.