LG OCMar 17, 2024

A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization

Yudong Luo, Yangchen Pan, Han Wang, Philip Torr, Pascal Poupart

arXiv:2403.11062v312.57 citationsh-index: 17RLJ

Originality Incremental advance

AI Analysis

This work addresses a practical limitation in risk-averse reinforcement learning for applications requiring efficient sample use, though it is incremental as it builds on existing CVaR-PG methods.

The paper tackles the sample inefficiency problem in reinforcement learning algorithms that use policy gradients to optimize Conditional Value at Risk (CVaR) by proposing a simple mixture policy parameterization, which improves performance by utilizing all collected trajectories and preventing gradient vanishing, achieving success in Mujoco environments where traditional methods fail.

Reinforcement learning algorithms utilizing policy gradients (PG) to optimize Conditional Value at Risk (CVaR) face significant challenges with sample inefficiency, hindering their practical applications. This inefficiency stems from two main facts: a focus on tail-end performance that overlooks many sampled trajectories, and the potential of gradient vanishing when the lower tail of the return distribution is overly flat. To address these challenges, we propose a simple mixture policy parameterization. This method integrates a risk-neutral policy with an adjustable policy to form a risk-averse policy. By employing this strategy, all collected trajectories can be utilized for policy updating, and the issue of vanishing gradients is counteracted by stimulating higher returns through the risk-neutral component, thus lifting the tail and preventing flatness. Our empirical study reveals that this mixture parameterization is uniquely effective across a variety of benchmark domains. Specifically, it excels in identifying risk-averse CVaR policies in some Mujoco environments where the traditional CVaR-PG fails to learn a reasonable policy.

View on arXiv PDF

Similar