LGAICPPMMLJul 8, 2020

A Natural Actor-Critic Algorithm with Downside Risk Constraints

arXiv:2007.04203v11 citations
Originality Incremental advance
AI Analysis

This work addresses risk-sensitive reinforcement learning for applications requiring downside risk aversion, but it is incremental as it builds on existing constrained policy optimization methods.

The paper tackles the problem of high variance and low sample efficiency in risk-sensitive reinforcement learning by introducing a new Bellman equation that upper bounds the lower partial moment for downside risk, enabling sample-efficient online estimation. It demonstrates effectiveness on three benchmark problems by integrating this proxy into a natural actor-critic method for constrained policies.

Existing work on risk-sensitive reinforcement learning - both for symmetric and downside risk measures - has typically used direct Monte-Carlo estimation of policy gradients. While this approach yields unbiased gradient estimates, it also suffers from high variance and decreased sample efficiency compared to temporal-difference methods. In this paper, we study prediction and control with aversion to downside risk which we gauge by the lower partial moment of the return. We introduce a new Bellman equation that upper bounds the lower partial moment, circumventing its non-linearity. We prove that this proxy for the lower partial moment is a contraction, and provide intuition into the stability of the algorithm by variance decomposition. This allows sample-efficient, on-line estimation of partial moments. For risk-sensitive control, we instantiate Reward Constrained Policy Optimization, a recent actor-critic method for finding constrained policies, with our proxy for the lower partial moment. We extend the method to use natural policy gradients and demonstrate the effectiveness of our approach on three benchmark problems for risk-sensitive reinforcement learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes