Conservative Contextual Bandits: Beyond Linear Representations
This work addresses safety constraints in sequential decision-making for applications like online advertising or healthcare, though it is incremental by extending prior linear methods to non-linear settings.
The paper tackles the problem of ensuring safety in contextual bandits with non-linear cost functions by developing two algorithms, C-SquareCB and C-FastCB, which achieve sub-linear regret bounds of O(sqrt(KT) + K/α) and O(sqrt(KL*) + K(1 + 1/α)) respectively while maintaining performance guarantees.
Conservative Contextual Bandits (CCBs) address safety in sequential decision making by requiring that an agent's policy, along with minimizing regret, also satisfies a safety constraint: the performance is not worse than a baseline policy (e.g., the policy that the company has in production) by more than $(1+α)$ factor. Prior work developed UCB-style algorithms in the multi-armed [Wu et al., 2016] and contextual linear [Kazerouni et al., 2017] settings. However, in practice the cost of the arms is often a non-linear function, and therefore existing UCB algorithms are ineffective in such settings. In this paper, we consider CCBs beyond the linear case and develop two algorithms $\mathtt{C-SquareCB}$ and $\mathtt{C-FastCB}$, using Inverse Gap Weighting (IGW) based exploration and an online regression oracle. We show that the safety constraint is satisfied with high probability and that the regret of $\mathtt{C-SquareCB}$ is sub-linear in horizon $T$, while the regret of $\mathtt{C-FastCB}$ is first-order and is sub-linear in $L^*$, the cumulative loss of the optimal policy. Subsequently, we use a neural network for function approximation and online gradient descent as the regression oracle to provide $\tilde{O}(\sqrt{KT} + K/α) $ and $\tilde{O}(\sqrt{KL^*} + K (1 + 1/α))$ regret bounds, respectively. Finally, we demonstrate the efficacy of our algorithms on real-world data and show that they significantly outperform the existing baseline while maintaining the performance guarantee.