Automatic Constraint Policy Optimization based on Continuous Constraint Interpolation Framework for Offline Reinforcement Learning
This addresses the challenge of choosing appropriate policy constraints in offline RL, which is crucial for preventing extrapolation errors when learning from fixed datasets.
The paper tackles the problem of policy constraint selection in offline reinforcement learning by proposing a unified framework called Continuous Constraint Interpolation (CCI) that connects three constraint families, and develops an algorithm (ACPO) that automatically adapts constraint types. Experiments on D4RL and NeoRL2 benchmarks show state-of-the-art performance with robust gains across domains.
Offline Reinforcement Learning (RL) relies on policy constraints to mitigate extrapolation error, where both the constraint form and constraint strength critically shape performance. However, most existing methods commit to a single constraint family: weighted behavior cloning, density regularization, or support constraints, without a unified principle that explains their connections or trade-offs. In this work, we propose Continuous Constraint Interpolation (CCI), a unified optimization framework in which these three constraint families arise as special cases along a common constraint spectrum. The CCI framework introduces a single interpolation parameter that enables smooth transitions and principled combinations across constraint types. Building on CCI, we develop Automatic Constraint Policy Optimization (ACPO), a practical primal--dual algorithm that adapts the interpolation parameter via a Lagrangian dual update. Moreover, we establish a maximum-entropy performance difference lemma and derive performance lower bounds for both the closed-form optimal policy and its parametric projection. Experiments on D4RL and NeoRL2 demonstrate robust gains across diverse domains, achieving state-of-the-art performance overall.