Cascading Non-Stationary Bandits: Online Learning to Rank in the Non-Stationary Cascade Model
This addresses the challenge of adapting ranking systems in dynamic settings like web search, though it is incremental as it builds on existing cascading bandit models.
The paper tackles the problem of online learning to rank in non-stationary environments where user preferences change abruptly, proposing two algorithms (CascadeDUCB and CascadeSWUCB) that achieve regret bounds matching a derived lower bound up to a logarithmic factor.
Non-stationarity appears in many online applications such as web search and advertising. In this paper, we study the online learning to rank problem in a non-stationary environment where user preferences change abruptly at an unknown moment in time. We consider the problem of identifying the K most attractive items and propose cascading non-stationary bandits, an online learning variant of the cascading model, where a user browses a ranked list from top to bottom and clicks on the first attractive item. We propose two algorithms for solving this non-stationary problem: CascadeDUCB and CascadeSWUCB. We analyze their performance and derive gap-dependent upper bounds on the n-step regret of these algorithms. We also establish a lower bound on the regret for cascading non-stationary bandits and show that both algorithms match the lower bound up to a logarithmic factor. Finally, we evaluate their performance on a real-world web search click dataset.