Nearly Optimal Algorithms for Piecewise-Stationary Cascading Bandits
This addresses the challenge of adapting to changing user preferences in web search and online advertising, offering an incremental improvement over existing methods with fewer tuning parameters and better regret dependencies.
The paper tackles the problem of piecewise-stationary cascading bandits, where user preferences change over time, by developing two algorithms (GLRT-CascadeUCB and GLRT-CascadeKL-UCB) that achieve nearly optimal regret bounds of O(√(NLT log T)), with a lower bound of Ω(√(NLT)).
Cascading bandit (CB) is a popular model for web search and online advertising, where an agent aims to learn the $K$ most attractive items out of a ground set of size $L$ during the interaction with a user. However, the stationary CB model may be too simple to apply to real-world problems, where user preferences may change over time. Considering piecewise-stationary environments, two efficient algorithms, \texttt{GLRT-CascadeUCB} and \texttt{GLRT-CascadeKL-UCB}, are developed and shown to ensure regret upper bounds on the order of $\mathcal{O}(\sqrt{NLT\log{T}})$, where $N$ is the number of piecewise-stationary segments, and $T$ is the number of time slots. At the crux of the proposed algorithms is an almost parameter-free change-point detector, the generalized likelihood ratio test (GLRT). Comparing with existing works, the GLRT-based algorithms: i) are free of change-point-dependent information for choosing parameters; ii) have fewer tuning parameters; iii) improve at least the $L$ dependence in regret upper bounds. In addition, we show that the proposed algorithms are optimal (up to a logarithm factor) in terms of regret by deriving a minimax lower bound on the order of $Ω(\sqrt{NLT})$ for piecewise-stationary CB. The efficiency of the proposed algorithms relative to state-of-the-art approaches is validated through numerical experiments on both synthetic and real-world datasets.