LGOCSTMLFeb 7, 2023

Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR

arXiv:2302.03201v238 citationsh-index: 38
AI Analysis

This provides theoretical guarantees for risk-aware decision-making in RL, addressing a specific challenge in domains like finance or robotics, though it is incremental in refining existing risk-sensitive frameworks.

The paper tackles risk-sensitive reinforcement learning with Conditional Value at Risk (CVaR) by establishing minimax regret lower bounds and proposing algorithms that achieve near-optimal rates, such as $\widetilde O(\sqrt{ au^{-1}SAK})$ for tabular MDPs, improving on prior bounds.

In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $τ$. Starting with multi-arm bandits (MABs), we show the minimax CVaR regret rate is $Ω(\sqrt{τ^{-1}AK})$, where $A$ is the number of actions and $K$ is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of $Ω(\sqrt{τ^{-1}SAK})$ (with normalized cumulative rewards), where $S$ is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of $\widetilde O(\sqrt{τ^{-1}SAK})$ under a continuity assumption and in general attains a near-optimal regret of $\widetilde O(τ^{-1}\sqrt{SAK})$, which is minimax-optimal for constant $τ$. This improves on the best available bounds. By discretizing rewards appropriately, our algorithms are computationally efficient.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes