LGJun 19, 2025

On the optimal regret of collaborative personalized linear bandits

arXiv:2506.15943v1h-index: 50
Originality Highly original
AI Analysis

This work provides a theoretical characterization of optimal regret for multi-agent personalized linear bandits, which is significant for researchers and practitioners designing collaborative learning systems.

This paper investigates the optimal regret in collaborative personalized linear bandits, where multiple agents solve heterogeneous bandit problems. It provides an information-theoretic lower bound and proposes a two-stage collaborative algorithm that achieves this optimal regret, demonstrating improved regret bounds like \tilde{O}(d\sqrt{mn}) compared to \tilde{O}(dm\sqrt{n}) for non-collaborative agents under certain conditions.

Stochastic linear bandits are a fundamental model for sequential decision making, where an agent selects a vector-valued action and receives a noisy reward with expected value given by an unknown linear function. Although well studied in the single-agent setting, many real-world scenarios involve multiple agents solving heterogeneous bandit problems, each with a different unknown parameter. Applying single agent algorithms independently ignores cross-agent similarity and learning opportunities. This paper investigates the optimal regret achievable in collaborative personalized linear bandits. We provide an information-theoretic lower bound that characterizes how the number of agents, the interaction rounds, and the degree of heterogeneity jointly affect regret. We then propose a new two-stage collaborative algorithm that achieves the optimal regret. Our analysis models heterogeneity via a hierarchical Bayesian framework and introduces a novel information-theoretic technique for bounding regret. Our results offer a complete characterization of when and how collaboration helps with a optimal regret bound $\tilde{O}(d\sqrt{mn})$, $\tilde{O}(dm^{1-γ}\sqrt{n})$, $\tilde{O}(dm\sqrt{n})$ for the number of rounds $n$ in the range of $(0, \frac{d}{m σ^2})$, $[\frac{d}{m^{2γ} σ^2}, \frac{d}{σ^2}]$ and $(\frac{d}{σ^2}, \infty)$ respectively, where $σ$ measures the level of heterogeneity, $m$ is the number of agents, and $γ\in[0, 1/2]$ is an absolute constant. In contrast, agents without collaboration achieve a regret bound $O(dm\sqrt{n})$ at best.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes