From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs
For researchers and practitioners evaluating LLMs, this work provides a formalized method to capture user-centric evaluation that benchmarks often miss, though the contribution is incremental as it builds on existing concepts of personalization and subjective evaluation.
The paper studies how users informally evaluate LLMs through 'vibe-testing' and formalizes this process into a two-part framework involving personalized prompts and user-aware evaluation. Experiments on coding benchmarks show that this approach can change model preferences, indicating its potential to bridge benchmark scores and real-world user experience.
Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.