LGJul 2, 2023

Is Risk-Sensitive Reinforcement Learning Properly Resolved?

arXiv:2307.00547v23 citationsh-index: 17
AI Analysis

This addresses a foundational issue in reinforcement learning for risk-sensitive applications, offering a novel solution with theoretical guarantees, though it is incremental in improving existing methods.

The paper tackles the problem that existing risk-sensitive reinforcement learning methods do not properly optimize risk measures, proving they lack unbiased optimization and optimality guarantees, and proposes Trajectory Q-Learning (TQL) to achieve provable policy improvement and better performance in experiments.

Due to the nature of risk management in learning applicable policies, risk-sensitive reinforcement learning (RSRL) has been realized as an important direction. RSRL is usually achieved by learning risk-sensitive objectives characterized by various risk measures, under the framework of distributional reinforcement learning. However, it remains unclear if the distributional Bellman operator properly optimizes the RSRL objective in the sense of risk measures. In this paper, we prove that the existing RSRL methods do not achieve unbiased optimization and cannot guarantee optimality or even improvements regarding risk measures over accumulated return distributions. To remedy this issue, we further propose a novel algorithm, namely Trajectory Q-Learning (TQL), for RSRL problems with provable policy improvement towards the optimal policy. Based on our new learning architecture, we are free to introduce a general and practical implementation for different risk measures to learn disparate risk-sensitive policies. In the experiments, we verify the learnability of our algorithm and show how our method effectively achieves better performances toward risk-sensitive objectives.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes