CLMay 28, 2025

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, Gabriel Stanovsky

arXiv:2505.22169v213.97 citationsh-index: 7EMNLP

Originality Incremental advance

AI Analysis

This addresses evaluation reliability for LLM developers and researchers, though it appears incremental as it builds on existing stochastic evaluation concepts.

The authors tackled the problem of unreliable LLM evaluation due to prompt sensitivity by proposing a stochastic method of moments evaluation over meaning-preserving prompt perturbations, finding that even top models like GPT-4o and Claude-3.7-Sonnet exhibit substantial sensitivity.

LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.

View on arXiv PDF

Similar