LGAIOCMLMay 30, 2025

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

arXiv:2506.00286v22 citationsh-index: 14
Originality Incremental advance
AI Analysis

This work addresses risk-sensitive reinforcement learning for agents with specific risk preferences, offering theoretical insights but is incremental in extending existing bounds to risk-aware settings.

The paper tackles the problem of learning optimal policies in discounted Markov decision processes with entropic risk preferences, providing sample complexity bounds for a model-based algorithm and showing that exponential dependence on risk sensitivity and horizon is unavoidable.

In this paper, we analyze the sample complexities of learning the optimal state-action value function $Q^*$ and an optimal policy $π^*$ in a finite discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $β\neq 0$ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive $Q$-value-iteration (MB-RS-QVI) which leads to $(\varepsilon,δ)$-PAC-bounds on $\|Q^*-Q^k\|$, and $\|V^*-V^{π_k}\|$ where $Q_k$ is the output of MB-RS-QVI after k iterations and $π_k$ is the greedy policy with respect to $Q_k$. Both PAC-bounds have exponential dependence on the effective horizon $\frac{1}{1-γ}$ and the strength of this dependence grows with the learners risk-sensitivity $|β|$. We also provide two lower bounds which shows that exponential dependence on $|β|\frac{1}{1-γ}$ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are tight in the parameters $S,A,δ,\varepsilon$ and that unlike in the classical setting it is not possible to have polynomial dependence in all model parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes