Exploration via Feature Perturbation in Contextual Bandits
This provides a computationally efficient and theoretically near-optimal solution for exploration in contextual bandits, applicable to non-parametric or neural models, with potential impact on recommendation systems and online decision-making.
The paper tackles the exploration-exploitation trade-off in contextual bandits by proposing feature perturbation, which injects randomness into feature inputs rather than parameters or rewards, achieving a worst-case regret bound of Õ(d√T) that improves upon the typical Õ(d^{3/2}√T) of existing methods.
We propose feature perturbation, a simple yet effective exploration strategy for contextual bandits that injects randomness directly into feature inputs, instead of randomizing unknown parameters or adding noise to rewards. Remarkably, this algorithm achieves $\tilde{\mathcal{O}}(d\sqrt{T})$ worst-case regret bound for generalized linear contextual bandits, while avoiding the $\tilde{\mathcal{O}}(d^{3/2}\sqrt{T})$ regret typical of existing randomized bandit algorithms. Because our algorithm eschews parameter sampling, it is both computationally efficient and naturally extends to non-parametric or neural network models. We verify these advantages through empirical evaluations, demonstrating that feature perturbation not only surpasses existing methods but also unifies strong practical performance with the near-optimal regret guarantees.