Clipping-Free Policy Optimization for Large Language Models
This addresses training instability and reward hacking in LLM post-training, offering a drop-in alternative to clipping-based methods, though it appears incremental as it modifies an existing approach rather than introducing a new paradigm.
The paper tackles optimization issues in reinforcement learning for large language models by proposing Clipping-Free Policy Optimization (CFPO), which replaces clipping with a convex quadratic penalty to enable stable training without hard boundaries, achieving competitive performance in reasoning and alignment tasks while mitigating verbosity exploitation and capability degradation.
Reinforcement learning has become central to post-training large language models, yet dominant algorithms rely on clipping mechanisms that introduce optimization issues at scale, including zero-gradient regions, reward hacking, and training instability. We propose Clipping-Free Policy Optimization (CFPO), which replaces heuristic clipping with a convex quadratic penalty derived from Total Variation divergence constraints, yielding an everywhere-differentiable objective that enforces stable policy updates without hard boundaries. We evaluate CFPO across both reasoning and alignment settings. In reasoning, CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime. In alignment, CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance. CFPO requires only a one-line code change and no additional hyperparameters. Our results suggest that CFPO is a promising drop-in alternative to clipping-based methods for LLM post-training.