Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers
This work addresses efficiency in LLM inference for math and reasoning tasks, offering a practical cost-saving method with incremental improvements over prior approaches.
The paper tackles the problem of reducing sampling costs in LLM inference by introducing a Bayesian stopping strategy that halts sampling once sufficient answer consistency is reached, achieving up to 50% reduction in LLM calls while maintaining similar accuracy.
A simple strategy for improving LLM accuracy, especially in math and reasoning problems, is to sample multiple responses and submit the answer most consistently reached. In this paper we leverage Bayesian prior information to save on sampling costs, stopping once sufficient consistency is reached. Although the exact posterior is computationally intractable, we further introduce an efficient "L-aggregated" stopping policy that tracks only the L-1 most frequent answer counts. Theoretically, we prove that L=3 is all you need: this coarse approximation is sufficient to achieve asymptotic optimality, and strictly dominates prior-free baselines, while having a fast posterior computation. Empirically, this identifies the most consistent (i.e., mode) LLM answer using fewer samples, and can achieve similar answer accuracy while cutting the number of LLM calls (i.e., saving on LLM inference costs) by up to 50%.