Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning
This addresses safety constraints across continuous parameter spaces, such as resource distribution, for applications in safe RL, but it is incremental as it builds on existing safe RL methods.
The paper tackles the problem of safe reinforcement learning with an infinite number of constraints, known as semi-infinite safe RL, by proposing the exchange policy optimization (EPO) algorithm, which achieves optimal policy performance and deterministic bounded safety with global constraint violations strictly within a prescribed bound.
Safe reinforcement learning (safe RL) aims to respect safety requirements while optimizing long-term performance. In many practical applications, however, the problem involves an infinite number of constraints, known as semi-infinite safe RL (SI-safe RL). Such constraints typically appear when safety conditions must be enforced across an entire continuous parameter space, such as ensuring adequate resource distribution at every spatial location. In this paper, we propose exchange policy optimization (EPO), an algorithmic framework that achieves optimal policy performance and deterministic bounded safety. EPO works by iteratively solving safe RL subproblems with finite constraint sets and adaptively adjusting the active set through constraint expansion and deletion. At each iteration, constraints with violations exceeding the predefined tolerance are added to refine the policy, while those with zero Lagrange multipliers are removed after the policy update. This exchange rule prevents uncontrolled growth of the working set and supports effective policy training. Our theoretical analysis demonstrates that, under mild assumptions, strategies trained via EPO achieve performance comparable to optimal solutions with global constraint violations strictly remaining within a prescribed bound.