Return Capping: Sample-Efficient CVaR Policy Gradient Optimisation
This addresses a sample efficiency bottleneck for researchers and practitioners in risk-sensitive reinforcement learning, though it appears incremental as it builds on existing CVaR methods.
The paper tackles the problem of poor sample efficiency in conditional value at risk (CVaR) optimization using policy gradients by proposing a reformulation that caps total returns instead of discarding trajectories, showing consistently improved performance in empirical tests across environments.
When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: https://github.com/HarryMJMead/cvar-return-capping.