LGAug 24, 2025

Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality

Shaocong Ma, Ziyi Chen, Yi Zhou, Heng Huang

arXiv:2508.17448v214.48 citationsh-index: 12Trans. Mach. Learn. Res.

Originality Incremental advance

AI Analysis

This addresses the challenge of ensuring safety and optimality in reinforcement learning under model uncertainty, which is critical for real-world applications like robotics, but the approach is incremental as it builds on existing robust constrained RL methods by removing reliance on strong duality.

The paper tackles the problem of robust constrained reinforcement learning under model uncertainty, showing that strong duality does not generally hold, which can cause traditional primal-dual methods to fail. It proposes a primal-only algorithm called Rectified Robust Policy Optimization (RRPO), which achieves robust and safe performance in a grid-world environment, with theoretical convergence guarantees matching the best-known lower bound under controlled uncertainty.

The goal of robust constrained reinforcement learning (RL) is to optimize an agent's performance under the worst-case model uncertainty while satisfying safety or resource constraints. In this paper, we demonstrate that strong duality does not generally hold in robust constrained RL, indicating that traditional primal-dual methods may fail to find optimal feasible policies. To overcome this limitation, we propose a novel primal-only algorithm called Rectified Robust Policy Optimization (RRPO), which operates directly on the primal problem without relying on dual formulations. We provide theoretical convergence guarantees under mild regularity assumptions, showing convergence to an approximately optimal feasible policy with iteration complexity matching the best-known lower bound when the uncertainty set diameter is controlled in a specific level. Empirical results in a grid-world environment validate the effectiveness of our approach, demonstrating that RRPO achieves robust and safe performance under model uncertainties while the non-robust method can violate the worst-case safety constraints.

View on arXiv PDF

Similar