CLAILGSep 1, 2025

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

arXiv:2509.01790v113 citationsh-index: 3EMNLP
Originality Incremental advance
AI Analysis

This challenges a core limitation assumption in LLM evaluation, potentially impacting how researchers and practitioners assess model robustness.

The study re-evaluates prompt sensitivity in large language models (LLMs) and finds that much of the reported sensitivity is due to heuristic evaluation methods, with LLM-as-a-Judge evaluations reducing performance variance and improving model ranking consistency across prompts.

Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-a-Judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes