A One-Size-Fits-All Solution to Conservative Bandit Problems
This work provides a more robust and theoretically stronger solution for conservative bandit problems, which is important for applications requiring strict performance guarantees over time, such as in finance or healthcare, by ensuring performance at any given time rather than just in expectation.
This paper addresses conservative bandit problems (CBPs) with sample-path reward constraints, where the learner's reward must always exceed a baseline. The authors propose a unified solution that achieves T-independent additive regrets for CMAB, CLB, and CCCB, outperforming prior T-dependent results. They also extend this to a novel conservative mean-variance bandit problem, achieving O(1/T) normalized additive regrets.
In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees ($T$-independent additive regrets instead of $T$-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with $O(1/T)$ normalized additive regrets ($T$-independent in the cumulative form) and validate this result through empirical evaluation.