Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model
This work addresses sample efficiency for risk-sensitive RL, providing tight bounds that are optimal for fixed risk levels, which is incremental but important for applications requiring robust decision-making under uncertainty.
The paper tackles the sample complexity problem of risk-sensitive reinforcement learning with a generative model, focusing on maximizing Conditional Value at Risk (CVaR) at each step, and establishes nearly matching upper and lower bounds, such as an upper bound of Õ(SA/((1-γ)^4τ^2ε^2)) for an ε-optimal policy.
In this work, we study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model, where we aim to maximize the Conditional Value at Risk (CVaR) with risk tolerance level $τ$ at each step, a criterion we refer to as Iterated CVaR. We first build a connection between Iterated CVaR RL and $(s, a)$-rectangular distributional robust RL with a specific uncertainty set for CVaR. We establish nearly matching upper and lower bounds on the sample complexity of this problem. Specifically, we first prove that a value iteration-based algorithm, ICVaR-VI, achieves an $ε$-optimal policy with at most $\tilde{O} \left(\frac{SA}{(1-γ)^4τ^2ε^2} \right)$ samples, where $γ$ is the discount factor, and $S, A$ are the sizes of the state and action spaces. Furthermore, when $τ\geq γ$, the sample complexity improves to $\tilde{O} \left( \frac{SA}{(1-γ)^3ε^2} \right)$. We further show a minimax lower bound of $\tilde{O} \left(\frac{(1-γτ)SA}{(1-γ)^4τε^2} \right)$. For a fixed risk level $τ\in (0,1]$, our upper and lower bounds match, demonstrating the tightness and optimality of our analysis. We also investigate a limiting case with a small risk level $τ$, called Worst-Path RL, where the objective is to maximize the minimum possible cumulative reward. We develop matching upper and lower bounds of $\tilde{O} \left(\frac{SA}{p_{\min}} \right)$, where $p_{\min}$ denotes the minimum non-zero reaching probability of the transition kernel.