Minimax Regret for Cascading Bandits
This work addresses the problem of optimizing ranking in online systems with click feedback for practitioners, offering incremental improvements in regret bounds and algorithm design.
The authors tackled the problem of learning to rank from click feedback in cascading bandits, proving matching upper and lower bounds for gap-free regret that strictly improve prior results, with key technical insights showing variance-aware algorithms lead to optimal regret unlike in standard bandits. They also proposed a variance-aware algorithm for linear rewards that strictly improves state-of-the-art regret.
Cascading bandits is a natural and popular model that frames the task of learning to rank from Bernoulli click feedback in a bandit setting. For the case of unstructured rewards, we prove matching upper and lower bounds for the problem-independent (i.e., gap-free) regret, both of which strictly improve the best known. A key observation is that the hard instances of this problem are those with small mean rewards, i.e., the small click-through rates that are most relevant in practice. Based on this, and the fact that small mean implies small variance for Bernoullis, our key technical result shows that variance-aware confidence sets derived from the Bernstein and Chernoff bounds lead to optimal algorithms (up to log terms), whereas Hoeffding-based algorithms suffer order-wise suboptimal regret. This sharply contrasts with the standard (non-cascading) bandit setting, where the variance-aware algorithms only improve constants. In light of this and as an additional contribution, we propose a variance-aware algorithm for the structured case of linear rewards and show its regret strictly improves the state-of-the-art.