LGAIMLOct 25, 2022

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

arXiv:2210.14051v319 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses sample complexity challenges in risk-sensitive RL for applications requiring safety or robustness, though it is incremental as it builds on existing regret bounds with novel distributional insights.

The paper tackles the problem of risk-sensitive reinforcement learning (RSRL) by bridging it with distributional reinforcement learning (DRL) methods, establishing a risk-sensitive distributional dynamic programming framework and proposing novel algorithms that achieve a regret upper bound of Õ((exp(|β|H)-1)/|β| * H√(S²AK)), matching prior work with new distributional analysis.

We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. By leveraging a key property of the EntRM, the independence property, we establish the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that they both attain $\tilde{\mathcal{O}}(\frac{\exp(|β| H)-1}{|β|}H\sqrt{S^2AK})$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number of states, actions, episodes, and the time horizon, respectively. It matches RSVI2 proposed in \cite{fei2021exponential}, with novel distributional analysis. To the best of our knowledge, this is the first regret analysis that bridges DRL and RSRL in terms of sample complexity. Acknowledging the computational inefficiency associated with the model-free DRL algorithm, we propose an alternative DRL algorithm with distribution representation. This approach not only maintains the established regret bounds but also significantly amplifies computational efficiency. We also prove a tighter minimax lower bound of $Ω(\frac{\exp(βH/6)-1}{βH}H\sqrt{SAT})$ for the $β>0$ case, which recovers the tight lower bound $Ω(H\sqrt{SAT})$ in the risk-neutral setting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes