ML LGFeb 11, 2018

Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit

Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie

arXiv:1802.03692v419.7108 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of adapting to non-stationary environments in online learning, which is crucial for applications like recommendation systems, but it is incremental as it builds on classic UCB methods with a change-detection component.

The paper tackles the problem of piecewise-stationary multi-armed bandits, where reward distributions change at unknown times, by proposing the M-UCB algorithm that integrates change detection with UCB to achieve a nearly optimal regret bound of O(√MKT log T). Numerical experiments on a Yahoo! dataset show superior performance compared to state-of-the-art algorithms.

Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. We consider a scenario where the reward distributions may change in a piecewise-stationary fashion at unknown time steps. We show that by incorporating a simple change-detection component with classic UCB algorithms to detect and adapt to changes, our so-called M-UCB algorithm can achieve nearly optimal regret bound on the order of $O(\sqrt{MKT\log T})$, where $T$ is the number of time steps, $K$ is the number of arms, and $M$ is the number of stationary segments. Comparison with the best available lower bound shows that our M-UCB is nearly optimal in $T$ up to a logarithmic factor. We also compare M-UCB with the state-of-the-art algorithms in numerical experiments using a public Yahoo! dataset to demonstrate its superior performance.

View on arXiv PDF

Similar