Boosting CVaR Policy Optimization with Quantile Gradients
This addresses sample inefficiency for risk-averse reinforcement learning practitioners, though it appears incremental as it builds on prior CVaR-PG methods.
The paper tackled the sample inefficiency problem in Conditional Value-at-Risk policy optimization by augmenting it with an expected quantile term, resulting in substantial improvements over existing methods in risk-averse domains.
Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.