CL AISep 18, 2025

The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior

Angelina Wang, Daniel E. Ho, Sanmi Koyejo

arXiv:2509.19364v18 citationsh-index: 39

Originality Incremental advance

AI Analysis

It highlights a critical flaw in current evaluation methods for LLMs, which could mislead researchers and practitioners about model performance.

The paper demonstrates that standard offline evaluations fail to capture real-world language model behavior due to personalization, showing that identical questions yield different responses across users in field tests with 800 users of ChatGPT and Gemini.

Standard offline evaluations for language models -- a series of independent, state-less inferences made by models -- fail to capture how language models actually behave in practice, where personalization fundamentally alters model behavior. For instance, identical benchmark questions to the same language model can produce markedly different responses when prompted to a state-less system, in one user's chat session, or in a different user's chat session. In this work, we provide empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other provided questions to their chat interfaces.

View on arXiv PDF

Similar