On the Global Convergence of Risk-Averse Natural Policy Gradient Methods with Expected Conditional Risk Measures
This work addresses the problem of ensuring reliable performance in stochastic sequential decision-making for reinforcement learning practitioners, though it is incremental as it extends existing methods to a risk-averse setting.
The paper tackles the lack of global convergence guarantees for risk-averse policy gradient methods in reinforcement learning by proposing a natural policy gradient algorithm based on Expected Conditional Risk Measures, achieving global optimality and iteration complexity results with empirical validation on a stochastic Cliffwalk environment.
Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While it has been shown that policy gradient methods can find globally optimal policies in the risk-neutral setting, it remains unclear if the risk-averse variants enjoy the same global convergence guarantees. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive natural policy gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality and iteration complexity of the proposed risk-averse NPG algorithm with softmax parameterization and entropy regularization under both exact and inexact policy evaluation. Furthermore, we test our risk-averse NPG algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our method.