LG AI CLAug 6, 2025

COPO: Consistency-Aware Policy Optimization

Jinghang Han, Jiawei Chen, Hang Shao, Hao Ma, Mingcheng Li, Xintian Shen, Lihao Zheng, Wei Chen, Tao Wei, Lihua Zhang

arXiv:2508.04138v11 citationsh-index: 16Has Code

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in rule-based reward methods for LLM reasoning, offering an incremental improvement to enhance training efficiency and downstream performance in complex problem-solving tasks.

The paper tackles the problem of vanishing gradients in reinforcement learning for LLMs when multiple responses to a prompt converge to identical outcomes, which limits training efficiency and performance. It proposes a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency, resulting in substantial performance gains on multiple mathematical reasoning benchmarks.

Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization. However, a common challenge observed across many replication and extension efforts is that when multiple sampled responses under a single prompt converge to identical outcomes, whether correct or incorrect, the group-based advantage degenerates to zero. This leads to vanishing gradients and renders the corresponding samples ineffective for learning, ultimately limiting training efficiency and downstream performance. To address this issue, we propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency, the global loss based on it ensures that, even when model outputs show high intra-group consistency, the training process still receives meaningful learning signals, which encourages the generation of correct and self-consistent reasoning paths from a global perspective. Furthermore, we incorporate an entropy-based soft blending mechanism that adaptively balances local advantage estimation with global optimization, enabling dynamic transitions between exploration and convergence throughout training. Our method introduces several key innovations in both reward design and optimization strategy. We validate its effectiveness through substantial performance gains on multiple mathematical reasoning benchmarks, highlighting the proposed framework's robustness and general applicability. Code of this work has been released at https://github.com/hijih/copo-code.git.

View on arXiv PDF Code

Similar