Neural Contextual Bandits with Deep Representation and Shallow Exploration
This work addresses the computational inefficiency in neural contextual bandits for researchers and practitioners, though it is incremental as it builds on existing UCB and deep learning approaches.
The paper tackles the problem of contextual bandits with unknown reward functions by proposing a learning algorithm that uses deep representation learning for feature transformation and shallow exploration in the last linear layer, achieving a finite-time regret of $ ilde{O}(\sqrt{T})$ and improved computational efficiency compared to existing methods.
We study a general class of contextual bandits, where each context-action pair is associated with a raw feature vector, but the reward generating function is unknown. We propose a novel learning algorithm that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network (deep representation learning), and uses an upper confidence bound (UCB) approach to explore in the last linear layer (shallow exploration). We prove that under standard assumptions, our proposed algorithm achieves $\tilde{O}(\sqrt{T})$ finite-time regret, where $T$ is the learning time horizon. Compared with existing neural contextual bandit algorithms, our approach is computationally much more efficient since it only needs to explore in the last layer of the deep neural network.