LGAINov 12, 2025

Optimistic Reinforcement Learning with Quantile Objectives

arXiv:2511.09652v1h-index: 10
Originality Incremental advance
AI Analysis

This addresses risk sensitivity in RL for applications like healthcare and finance, but it is an incremental improvement as it builds on existing quantile-based methods with a new algorithm.

The paper tackles the problem of risk-sensitive reinforcement learning by developing UCB-QRL, an optimistic algorithm for optimizing quantile objectives in finite-horizon MDPs, achieving a high-probability regret bound of O((2/κ)^{H+1}H√(SATH log(2SATH/δ))).

Reinforcement Learning (RL) has achieved tremendous success in recent years. However, the classical foundations of RL do not account for the risk sensitivity of the objective function, which is critical in various fields, including healthcare and finance. A popular approach to incorporate risk sensitivity is to optimize a specific quantile of the cumulative reward distribution. In this paper, we develop UCB-QRL, an optimistic learning algorithm for the $τ$-quantile objective in finite-horizon Markov decision processes (MDPs). UCB-QRL is an iterative algorithm in which, at each iteration, we first estimate the underlying transition probability and then optimize the quantile value function over a confidence ball around this estimate. We show that UCB-QRL yields a high-probability regret bound $\mathcal O\left((2/κ)^{H+1}H\sqrt{SATH\log(2SATH/δ)}\right)$ in the episodic setting with $S$ states, $A$ actions, $T$ episodes, and $H$ horizons. Here, $κ>0$ is a problem-dependent constant that captures the sensitivity of the underlying MDP's quantile value.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes