ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
This addresses evaluation reliability for LLM developers and researchers, though it appears incremental as it builds on existing stochastic evaluation concepts.
The authors tackled the problem of unreliable LLM evaluation due to prompt sensitivity by proposing a stochastic method of moments evaluation over meaning-preserving prompt perturbations, finding that even top models like GPT-4o and Claude-3.7-Sonnet exhibit substantial sensitivity.
LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.