LGOct 27, 2025

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

arXiv:2510.23393v16 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in RL fine-tuning for LLMs in mathematical/coding domains, though it appears incremental as an extension of existing RLVR methods.

The paper tackles the problem where reinforcement learning fine-tuning harms model exploration ability, decreasing generation diversity and degrading Best-of-N sampling performance for large N values. The result is a method that directly optimizes the max@k metric, showing effective alignment with Best-of-N inference in off-policy scenarios.

The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes