LGAICLMar 10

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

arXiv:2603.10101v141.43 citationsh-index: 3Has Code
Predicted impact top 2% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the issue of step-level reasoning inconsistencies in RLVR for LLMs, improving generalization and robustness, but it is incremental as it builds upon existing RLVR methods.

The paper tackled the problem of RLVR relying only on final answer rewards, which neglects intermediate reasoning correctness and can cause hallucination and answer-copying, by incorporating contrastive learning into policy optimization to capture invariant structures across correct reasoning paths, resulting in consistent improvements across multiple RLVR baselines on diverse reasoning benchmarks.

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen-Applications/CLIPO.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes