Policy Gradient Methods for Distortion Risk Measures
This work addresses risk management in reinforcement learning for applications requiring safety or robustness, but it appears incremental as it extends existing policy gradient methods to a specific risk measure.
The authors tackled the problem of learning risk-sensitive policies in reinforcement learning by maximizing distortion risk measures of cumulative rewards, and they derived policy gradient algorithms with non-asymptotic convergence bounds to approximate stationary points.
We propose policy gradient algorithms which learn risk-sensitive policies in a reinforcement learning (RL) framework. Our proposed algorithms maximize the distortion risk measure (DRM) of the cumulative reward in an episodic Markov decision process in on-policy and off-policy RL settings, respectively. We derive a variant of the policy gradient theorem that caters to the DRM objective, and integrate it with a likelihood ratio-based gradient estimation scheme. We derive non-asymptotic bounds that establish the convergence of our proposed algorithms to an approximate stationary point of the DRM objective.