LGCLROFeb 5

Constrained Group Relative Policy Optimization

arXiv:2602.05863v21 citationsh-index: 9
Originality Highly original
AI Analysis

This work provides a simple and effective method for constrained policy optimization in embodied AI domains, which is crucial for the safe and reliable deployment of large multimodal foundation models.

This paper extends Group Relative Policy Optimization (GRPO) to handle explicit behavioral constraints by introducing Constrained GRPO, a Lagrangian-based framework. It identifies and solves a critical issue where mismatched component-wise standard deviations in advantage estimation corrupt the Lagrangian signal, preventing effective constraint enforcement, and demonstrates its effectiveness in robotics tasks by improving constraint satisfaction while increasing task success.

While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes