LG MLDec 12, 2023

Contextual Bandits with Online Neural Regression

Rohan Deb, Yikun Ban, Shiliang Zuo, Jingrui He, Arindam Banerjee

arXiv:2312.07145v111.58 citationsh-index: 14

Originality Incremental advance

AI Analysis

This work addresses the challenge of achieving better regret guarantees in contextual bandits with neural networks, which is significant for online learning and decision-making applications, though it builds incrementally on prior reductions and network theories.

The paper tackles the problem of improving regret bounds for Neural Contextual Bandits (NeuCBs) by developing a novel approach using perturbed neural networks to achieve logarithmic regret for online regression, which translates to improved regret bounds of $ ilde{\mathcal{O}}(\sqrt{KT})$ and $ ilde{\mathcal{O}}(\sqrt{KL^*} + K)$ for NeuCBs, outperforming existing algorithms in experiments.

Recent works have shown a reduction from contextual bandits to online regression under a realizability assumption [Foster and Rakhlin, 2020, Foster and Krishnamurthy, 2021]. In this work, we investigate the use of neural networks for such online regression and associated Neural Contextual Bandits (NeuCBs). Using existing results for wide networks, one can readily show a ${\mathcal{O}}(\sqrt{T})$ regret for online regression with square loss, which via the reduction implies a ${\mathcal{O}}(\sqrt{K} T^{3/4})$ regret for NeuCBs. Departing from this standard approach, we first show a $\mathcal{O}(\log T)$ regret for online regression with almost convex losses that satisfy QG (Quadratic Growth) condition, a generalization of the PL (Polyak-Łojasiewicz) condition, and that have a unique minima. Although not directly applicable to wide networks since they do not have unique minima, we show that adding a suitable small random perturbation to the network predictions surprisingly makes the loss satisfy QG with unique minima. Based on such a perturbed prediction, we show a ${\mathcal{O}}(\log T)$ regret for online regression with both squared loss and KL loss, and subsequently convert these respectively to $\tilde{\mathcal{O}}(\sqrt{KT})$ and $\tilde{\mathcal{O}}(\sqrt{KL^*} + K)$ regret for NeuCB, where $L^*$ is the loss of the best policy. Separately, we also show that existing regret bounds for NeuCBs are $Ω(T)$ or assume i.i.d. contexts, unlike this work. Finally, our experimental results on various datasets demonstrate that our algorithms, especially the one based on KL loss, persistently outperform existing algorithms.

View on arXiv PDF

Similar