CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang

arXiv:2603.10101v116.13 citationsh-index: 1Has Code

Predicted impact top 2% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the issue of step-level reasoning inconsistencies in RLVR for LLMs, improving generalization and robustness, but it is incremental as it builds upon existing RLVR methods.

The paper tackled the problem of RLVR relying only on final answer rewards, which neglects intermediate reasoning correctness and can cause hallucination and answer-copying, by incorporating contrastive learning into policy optimization to capture invariant structures across correct reasoning paths, resulting in consistent improvements across multiple RLVR baselines on diverse reasoning benchmarks.

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen-Applications/CLIPO.

View on arXiv PDF Code

Similar