LGGTMLOct 21, 2019

Decentralized Heterogeneous Multi-Player Multi-Armed Bandits with Non-Zero Rewards on Collisions

arXiv:1910.09089v418 citations
Originality Incremental advance
AI Analysis

This addresses coordination challenges in distributed systems like wireless networks, but it is incremental as it builds on prior bandit models with specific modifications.

The paper tackles the problem of decentralized multi-player multi-armed bandits with heterogeneous rewards and non-zero rewards on collisions, presenting a policy that achieves near order-optimal expected regret of O(log^{1+δ} T) over time T.

We consider a fully decentralized multi-player stochastic multi-armed bandit setting where the players cannot communicate with each other and can observe only their own actions and rewards. The environment may appear differently to different players, $\textit{i.e.}$, the reward distributions for a given arm are heterogeneous across players. In the case of a collision (when more than one player plays the same arm), we allow for the colliding players to receive non-zero rewards. The time-horizon $T$ for which the arms are played is \emph{not} known to the players. Within this setup, where the number of players is allowed to be greater than the number of arms, we present a policy that achieves near order-optimal expected regret of order $O(\log^{1 + δ} T)$ for some $0 < δ< 1$ over a time-horizon of duration $T$. This paper is accepted at IEEE Transactions on Information Theory.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes