MLLGJul 8, 2025

Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis

arXiv:2507.05913v19 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses the challenge of aligning generative models with human preferences, but it is incremental as it builds on existing BoN methods by introducing smoothing and theoretical analysis.

The paper tackles the problem of inference-time alignment in generative models using Best-of-N (BoN) and its smoothed variant, Soft Best-of-N (SBoN), by analyzing KL divergence and regret to show that smoothing mitigates reward overoptimization, especially with low-quality proxy rewards.

A simple yet effective method for inference-time alignment of generative models is Best-of-$N$ (BoN), where $N$ outcomes are sampled from a reference policy, evaluated using a proxy reward model, and the highest-scoring one is selected. While prior work argues that BoN is almost optimal in reward vs KL tradeoffs, the effectiveness of BoN depends critically on the quality of the proxy reward model used for selection. For this purpose, we study BoN through a smooth version known as Soft Best-of-N (SBoN) and develop a theoretical framework to address this gap. We analyze the scaling behaviour of BoN by providing bounds on the KL divergence between the SBoN policy and the reference policy, offering insights into how performance varies with the number of samples. We also study the regret gap, i.e., the gap between the expected true reward under the optimal policy and the SBoN policy. Our theoretical and empirical findings show that smoothing helps SBoN mitigate reward overoptimization, especially when the quality of the proxy reward is low.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes