LGAICLAug 14, 2025

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

arXiv:2508.10751v1113 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses a key challenge in optimizing large language models for tasks requiring reasoning, offering a method to enhance performance without inherent conflict between exploration and exploitation, though it appears incremental as it builds on existing Pass@k evaluation metrics.

The paper tackles the problem of balancing exploration and exploitation in reinforcement learning with verifiable rewards (RLVR) for large reasoning models by proposing Pass@k Training, which uses Pass@k as the reward metric to improve exploration ability and derive an analytical solution for efficient training.

Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., $\textbf{Pass@k Training}$), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes