CLApr 28

Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Yisi Sang, Zheshen, Wang, Qi He, Dakuo Wang

arXiv:2503.2074991.113 citationsh-index: 16

AI Analysis

For researchers and practitioners using LLM agents to simulate human behavior in applications like e-commerce, this work provides the first large-scale quantitative benchmark revealing poor accuracy of prompt-based methods and demonstrating that fine-tuning on real data yields substantial improvements.

The paper evaluates LLM agents' ability to simulate multi-turn human behavior using real online shopping data (31,865 sessions, 230,965 actions), finding prompt-based LLMs achieve only 11.86% accuracy. Fine-tuning Qwen2.5-7B on real data with reasoning traces improves accuracy to 17.26% and purchase prediction F1 to 33.86%.

Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.

View on arXiv PDF

Similar