Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

arXiv:2605.0065411.2h-index: 3

AI Analysis

This work provides a theoretical framework and algorithm for risk-averse reinforcement learning with provable regret guarantees, addressing a known bottleneck in handling risk measures in finite-horizon MDPs.

The paper introduces mini-batch Markov coherent risk measures and multipattern risk-averse problems, proposing a feature-based Q-learning method with multipattern Q-factor approximation that achieves a regret bound of O(H^2 N^H sqrt(K)). The method is validated on stochastic assignment and multi-armed bandit problems.

For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the class of linear systems. We use both concepts in a feature-based $Q$-learning method with multipattern $Q$-factor approximation and we prove a high-probability regret bound of $\mathcal{O}\big(H^2 N^H \sqrt{ K}\big)$, where $H$ is the horizon, $N$ is the mini-batch size, and $K$ is the number of episodes. We also propose an economical version of the $Q$-learning method that streamlines the policy evaluation (backward) step. The theoretical results are illustrated on a stochastic assignment problem and a short-horizon multi-armed bandit problem.

View on arXiv PDF

Similar