Non-Stationary Bandits with Auto-Regressive Temporal Dependency
This addresses the challenge of temporal dynamics in real-world applications like recommendation systems and online advertising, representing a novel method for a known bottleneck.
The paper tackles the problem of non-stationary multi-armed bandits with temporal dependencies by introducing a novel framework with an auto-regressive reward structure and an algorithm that balances exploration-exploitation and discards outdated information, achieving a regret upper bound nearly matching the lower bound and demonstrating efficacy in a tourism demand prediction case study.
Traditional multi-armed bandit (MAB) frameworks, predominantly examined under stochastic or adversarial settings, often overlook the temporal dynamics inherent in many real-world applications such as recommendation systems and online advertising. This paper introduces a novel non-stationary MAB framework that captures the temporal structure of these real-world dynamics through an auto-regressive (AR) reward structure. We propose an algorithm that integrates two key mechanisms: (i) an alternation mechanism adept at leveraging temporal dependencies to dynamically balance exploration and exploitation, and (ii) a restarting mechanism designed to discard out-of-date information. Our algorithm achieves a regret upper bound that nearly matches the lower bound, with regret measured against a robust dynamic benchmark. Finally, via a real-world case study on tourism demand prediction, we demonstrate both the efficacy of our algorithm and the broader applicability of our techniques to more complex, rapidly evolving time series.