Within-Model vs Between-Prompt Variability in Large Language Models for Creative Tasks
This work addresses the problem of evaluating LLM performance for researchers and practitioners by revealing that single-sample evaluations risk conflating sampling noise with prompt or model effects, which is incremental but provides concrete guidance.
The study quantified the sources of output variance in large language models for creative tasks, finding that prompts explain 36.43% of variance in originality, comparable to model choice at 40.94%, while model choice dominates fluency variance at 51.25% with prompts only explaining 4.22%.
How much of LLM output variance is explained by prompts versus model choice versus stochasticity through sampling? We answer this by evaluating 12 LLMs on 10 creativity prompts with 100 samples each (N = 12,000). For output quality (originality), prompts explain 36.43% of variance, comparable to model choice (40.94%). But for output quantity (fluency), model choice (51.25%) and within-LLM variance (33.70%) dominate, with prompts explaining only 4.22%. Prompts are powerful levers for steering output quality, but given the substantial within-LLM variance (10-34%), single-sample evaluations risk conflating sampling noise with genuine prompt or model effects.