Non-Stationary Contextual Bandit Learning via Neural Predictive Ensemble Sampling
This work addresses the challenge of non-stationary contextual bandits for applications like recommendation systems, where existing methods suffer from excessive exploration or scalability issues, representing a strong specific gain in this domain.
The paper tackles the problem of non-stationarity in contextual bandit learning, which arises from factors like seasonality and evolving trends, by introducing a novel algorithm that combines a scalable deep neural network architecture with an exploration mechanism prioritizing information of lasting value, and it demonstrates significant performance improvements over state-of-the-art baselines on two real-world recommendation datasets.
Real-world applications of contextual bandits often exhibit non-stationarity due to seasonality, serendipity, and evolving social trends. While a number of non-stationary contextual bandit learning algorithms have been proposed in the literature, they excessively explore due to a lack of prioritization for information of enduring value, or are designed in ways that do not scale in modern applications with high-dimensional user-specific features and large action set, or both. In this paper, we introduce a novel non-stationary contextual bandit algorithm that addresses these concerns. It combines a scalable, deep-neural-network-based architecture with a carefully designed exploration mechanism that strategically prioritizes collecting information with the most lasting value in a non-stationary environment. Through empirical evaluations on two real-world recommendation datasets, which exhibit pronounced non-stationarity, we demonstrate that our approach significantly outperforms the state-of-the-art baselines.