LGMar 17

A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems

Wei Min Loh, Sajib Kumer Sinha, Ankur Agarwal, Pascal Poupart

arXiv:2603.1675514.4h-index: 2

AI Analysis

This work addresses a more realistic bandit problem for applications like recommendation systems, though it appears incremental as it combines existing techniques.

The paper tackled the problem of feature-rich, non-stationary bandits by introducing C3 Thompson sampling, which achieved a 5.7% lower average cumulative regret on OpenML datasets and a 12.4% click lift on the MIND dataset compared to other algorithms.

Contextual bandits are incredibly useful in many practical problems. We go one step further by devising a more realistic problem that combines: (1) contextual bandits with dense arm features, (2) non-linear reward functions, and (3) a generalization of correlated bandits where reward distributions change over time but the degree of correlation maintains. This formulation lends itself to a wider set of applications such as recommendation tasks. To solve this problem, we introduce conditionally coupled contextual C3 Thompson sampling for Bernoulli bandits. It combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling that allows online learning without retraining. Empirical results show that C3 outperforms the next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets as well as demonstrating a 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.

View on arXiv PDF

Similar