Actor-Critic Algorithm for Dynamic Expectile and CVaR
It provides a model-free solution for dynamic risk optimization in reinforcement learning, addressing a known bottleneck in risk-sensitive policy optimization.
The paper proposes a model-free off-policy actor-critic algorithm for dynamic expectile and CVaR optimization, using a surrogate policy gradient without transition perturbation and elicitable value learning. Empirical results show it outperforms existing methods in risk-averse tasks.
Optimizing dynamic risk with stochastic policies is challenging in both policy updates and value learning. The former typically requires transition perturbation, while the latter may rely on model-based approaches. To address these challenges, we propose a surrogate policy gradient without transition perturbation under softmax policy parameterization. We further develop model-free value learning methods for dynamic expectile and conditional value-at-risk by leveraging elicitability. Finally, inspired by Expected SARSA and Expected Policy Gradient, a model-free off-policy actor-critic algorithm is constructed. Empirical results in domains with verifiable risk-averse behavior show that our algorithm can learn risk-averse policy and consistently outperforms other existing methods.