Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis
This work addresses risk-sensitive decision-making in reinforcement learning, offering a more efficient and model-free approach for quantile optimization, though it is incremental as it builds on prior methods with improved convergence and performance.
The paper tackles the problem of optimizing quantile risk measures in Markov decision processes by proposing a new Q-learning algorithm with a simple dynamic program decomposition, achieving convergence to its DP variant and outperforming earlier algorithms in tabular domains.
In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes. This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees. The algorithm leverages a new, simple dynamic program (DP) decomposition for quantile MDPs. Compared with prior work, our DP decomposition requires neither known transition probabilities nor solving complex saddle point equations and serves as a suitable foundation for other model-free RL algorithms. Our numerical results in tabular domains show that our Q-learning algorithm converges to its DP variant and outperforms earlier algorithms.