Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR
This provides theoretical guarantees for risk-aware decision-making in RL, addressing a specific challenge in domains like finance or robotics, though it is incremental in refining existing risk-sensitive frameworks.
The paper tackles risk-sensitive reinforcement learning with Conditional Value at Risk (CVaR) by establishing minimax regret lower bounds and proposing algorithms that achieve near-optimal rates, such as $\widetilde O(\sqrt{ au^{-1}SAK})$ for tabular MDPs, improving on prior bounds.
In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $τ$. Starting with multi-arm bandits (MABs), we show the minimax CVaR regret rate is $Ω(\sqrt{τ^{-1}AK})$, where $A$ is the number of actions and $K$ is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of $Ω(\sqrt{τ^{-1}SAK})$ (with normalized cumulative rewards), where $S$ is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of $\widetilde O(\sqrt{τ^{-1}SAK})$ under a continuity assumption and in general attains a near-optimal regret of $\widetilde O(τ^{-1}\sqrt{SAK})$, which is minimax-optimal for constant $τ$. This improves on the best available bounds. By discretizing rewards appropriately, our algorithms are computationally efficient.