LGMLMar 8, 2022

Reward-Biased Maximum Likelihood Estimation for Neural Contextual Bandits

arXiv:2203.04192v22 citationsh-index: 11
AI Analysis

This work addresses the explore-exploit trade-off in contextual bandits for applications with non-linear reward functions, offering a novel neural network-based approach that is incremental in adapting a classic principle.

The paper tackles the stochastic contextual bandit problem with general bounded reward functions by proposing NeuralRBMLE, which adapts reward-biased maximum likelihood estimation to enforce exploration using neural networks, achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and showing comparable or better empirical performance than state-of-the-art methods on real-world datasets.

Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in the adaptive control literature for tackling explore-exploit trade-offs. This paper studies the stochastic contextual bandit problem with general bounded reward functions and proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration. NeuralRBMLE leverages the representation power of neural networks and directly encodes exploratory behavior in the parameter space, without constructing confidence intervals of the estimated rewards. We propose two variants of NeuralRBMLE algorithms: The first variant directly obtains the RBMLE estimator by gradient ascent, and the second variant simplifies RBMLE to a simple index policy through an approximation. We show that both algorithms achieve $\widetilde{\mathcal{O}}(\sqrt{T})$ regret. Through extensive experiments, we demonstrate that the NeuralRBMLE algorithms achieve comparable or better empirical regrets than the state-of-the-art methods on real-world datasets with non-linear reward functions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes