AIOct 16, 2025

SimKO: Simple Pass@K Policy Optimization

Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen

arXiv:2510.14807v211 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses a key limitation in RLVR for LLMs, improving exploration in reasoning tasks, though it is an incremental advance over existing methods.

The paper tackles the systematic bias in reinforcement learning with verifiable rewards (RLVR) for large language models, where methods improve pass@1 but reduce pass@K (K>1) performance due to probability concentration; SimKO, a method designed to mitigate this by asymmetrically adjusting token probabilities, consistently yields higher pass@K across math and reasoning benchmarks.

Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K>1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR's exploration.

View on arXiv PDF

Similar