LG MLJan 30

Value-at-Risk Constrained Policy Optimization

arXiv:2601.22993v11.4h-index: 10

Originality Highly original

AI Analysis

This addresses safe exploration in reinforcement learning for applications requiring risk-aware decision-making, representing a novel method for a known bottleneck.

The paper tackled the problem of optimizing Value-at-Risk constraints in policy optimization by introducing VaR-CPO, which achieved zero constraint violations during training in feasible environments, outperforming baseline methods.

We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constraints directly. Empirically, we demonstrate that VaR-CPO is capable of safe exploration, achieving zero constraint violations during training in feasible environments, a critical property that baseline methods fail to uphold. To overcome the inherent non-differentiability of the VaR constraint, we employ the one-sided Chebyshev inequality to obtain a tractable surrogate based on the first two moments of the cost return. Additionally, by extending the trust-region framework of the Constrained Policy Optimization (CPO) method, we provide rigorous worst-case bounds for both policy improvement and constraint violation during the training process.

View on arXiv PDF

Similar