LG MLMay 7

Bandit Learning in General Open Multi-agent Systems

arXiv:2605.0620225.6

AI Analysis

This work addresses the fundamental challenge of bandit learning in open systems without restrictive structural assumptions, providing a general framework and tight theoretical guarantees for practitioners dealing with dynamic agent populations.

The paper formulates a unified bandit learning problem for general open multi-agent systems where agents can arrive and depart over time, introducing concepts like pre-training degree and stability. It develops global-UCB algorithms with provable regret bounds that show entry uncertainty enters linearly via pre-training degree, and in stable regimes regret is governed by the time to identify a persistent optimal arm and agent patterns, with matching lower bounds.

Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently violated in practice. A learning paradigm for general open systems creates fresh challenges: newly arriving agents induce endogenous non-stationarity; agent patterns determine how quickly information accumulates; and new agents make regret scale further with the time horizon. To this end, we formulate a unified open-system bandit problem with general dynamics, including heterogeneous rewards and general agent patterns. We introduce new concepts to capture the inherent complexities: the \emph{pre-training degree} of new agents quantifies how much information an agent carries upon entry, \emph{stability} measures the impact of new agents on the system, and \emph{global dynamic regret} compares the cumulative expected reward of all active agents with that of the varying optimal arms. We develop certified global-UCB learning methodologies with provable guarantees. Our regret bounds reveal that entry uncertainty enters linearly via the pre-training degree, while in stable regimes, regret is governed by the time needed to identify a persistent optimal arm, as well as by the agent patterns. We further show that these dependencies are tight via lower bounds in hard instances.

View on arXiv PDF

Similar