Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs
This addresses risk management in sequential decision-making for applications like portfolio selection, but it is incremental as it adapts existing methods to a specific risk criterion.
The paper tackles the contextual multi-armed bandit problem with linear payoffs under a risk-averse mean-variance criterion, proposing a Thompson Sampling variant and proving a regret bound of O((1+ρ+1/ρ) d ln T ln(K/δ) √(d K T^{1+2ε} ln(K/δ)/ε)) with high probability.
In this paper we consider the contextual multi-armed bandit problem for linear payoffs under a risk-averse criterion. At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward. In particular, we consider mean-variance as the risk criterion, and the best arm is the one with the largest mean-variance reward. We apply the Thompson Sampling algorithm for the disjoint model, and provide a comprehensive regret analysis for a variant of the proposed algorithm. For $T$ rounds, $K$ actions, and $d$-dimensional feature vectors, we prove a regret bound of $O((1+ρ+\frac{1}ρ) d\ln T \ln \frac{K}δ\sqrt{d K T^{1+2ε} \ln \frac{K}δ \frac{1}ε})$ that holds with probability $1-δ$ under the mean-variance criterion with risk tolerance $ρ$, for any $0<ε<\frac{1}{2}$, $0<δ<1$. The empirical performance of our proposed algorithms is demonstrated via a portfolio selection problem.