HCAIApr 7

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

arXiv:2605.2020498.51 citations
Predicted impact top 1% in HC · last 90 daysOriginality Highly original
AI Analysis

For researchers evaluating LLM-based agents, this work provides a more realistic user simulation framework that exposes critical failure modes masked by existing cooperative simulators.

RealUserSim grounds LLM-based user simulation in real behavioral data from 14,000+ human-LLM conversations, raising behavioral match rate from 24.2% to 45.3% across five dimensions and revealing failure mechanisms in agent evaluation that are invisible to cooperative simulators, with a mean task success degradation of -3.2% to -3.5%.

LLM-based user simulation is the primary mechanism for end-to-end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Formalism Ceiling (style match rates of 6-8% against real users), while hand-crafted behavioral directives trigger Directive Amplification, where models hyper-interpret instructions into unnatural behavioral extremes that vary dramatically across simulator models. We present RealUserSim, the first user simulation framework grounded in real behavioral data. From 14,000+ authentic human-LLM conversations (WildChat), we extract 7,275 executable behavioral profiles and use them to ground LLM simulators. A fidelity benchmark (PT3) on 600 conversations across 71+ domains with anti-leakage controls shows that grounded simulation raises match rate from 24.2% to 45.3% across five behavioral dimensions. Agent evaluation on TauBench with 6 simulator models and extensive analysis shows that grounded simulation acts as a realistic stress test, surfacing three failure mechanisms invisible to cooperative simulators (mean -3.2% to -3.5% task success degradation), while Directive Amplification in existing benchmarks produces unrealistic behavior that compromises the validity of agent evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes