MLAILGSYOCFeb 27, 2020

Cautious Reinforcement Learning via Distributional Risk in the Dual Domain

arXiv:2002.12475v129 citations
AI Analysis

This work addresses computational inefficiencies in risk-sensitive reinforcement learning for finite MDPs, offering an incremental improvement in method efficiency.

The paper tackles the computational challenges of risk-sensitive reinforcement learning by proposing a new risk definition called caution, which is added to the dual objective of the linear programming formulation, and introduces a stochastic primal-dual method that matches tight dependencies on state and action space cardinalities while improving reward reliability without extra computational burden.

We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite. Prior efforts are predominately afflicted by computational challenges associated with the fact that risk-sensitive MDPs are time-inconsistent. To ameliorate this issue, we propose a new definition of risk, which we call caution, as a penalty function added to the dual objective of the linear programming (LP) formulation of reinforcement learning. The caution measures the distributional risk of a policy, which is a function of the policy's long-term state occupancy distribution. To solve this problem in an online model-free manner, we propose a stochastic variant of primal-dual method that uses Kullback-Lieber (KL) divergence as its proximal term. We establish that the number of iterations/samples required to attain approximately optimal solutions of this scheme matches tight dependencies on the cardinality of the state and action spaces, but differs in its dependence on the infinity norm of the gradient of the risk measure. Experiments demonstrate the merits of this approach for improving the reliability of reward accumulation without additional computational burdens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes