CL LGJun 16, 2025

Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models

Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch

arXiv:2506.13681v27 citationsh-index: 19

Originality Synthesis-oriented

AI Analysis

It addresses the reliability of claims in ML research, particularly for researchers and practitioners relying on sampling methods, but is incremental as it critiques an existing study.

This paper critically re-examines the evidence for min-p sampling in language models, finding that it does not outperform established baselines in quality, diversity, or their trade-off, based on reanalysis of human evaluations, benchmarks, and community adoption claims.

Sampling from language models impacts the quality and diversity of outputs, affecting both research and real-world applications. Recently, Nguyen et al. 2024's "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" introduced a new sampler called min-p, claiming it achieves superior quality and diversity over established samplers such as basic, top-k, and top-p sampling. The significance of these claims was underscored by the paper's recognition as the 18th highest-scoring submission to ICLR 2025 and selection for an Oral presentation. This paper conducts a comprehensive re-examination of the evidence supporting min-p and reaches different conclusions from the original paper's four lines of evidence. First, the original paper's human evaluations omitted data, conducted statistical tests incorrectly, and described qualitative feedback inaccurately; our reanalysis demonstrates min-p did not outperform baselines in quality, diversity, or a trade-off between quality and diversity; in response to our findings, the authors of the original paper conducted a new human evaluation using a different implementation, task, and rubric that nevertheless provides further evidence min-p does not improve over baselines. Second, comprehensively sweeping the original paper's NLP benchmarks reveals min-p does not surpass baselines when controlling for the number of hyperparameters. Third, the original paper's LLM-as-a-Judge evaluations lack methodological clarity and appear inconsistently reported. Fourth, community adoption claims (49k GitHub repositories, 1.1M GitHub stars) were found to be unsubstantiated, leading to their removal; the revised adoption claim remains misleading. We conclude that evidence presented in the original paper fails to support claims that min-p improves quality, diversity, or a trade-off between quality and diversity.

View on arXiv PDF

Similar