CLOct 29, 2025

TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, Yiqun Liu

arXiv:2510.25536v27 citationsh-index: 19

Originality Incremental advance

AI Analysis

This work addresses the need for systematic evaluation of LLM-based persona simulation, which is crucial for developing digital twins, but it is incremental as it builds on existing benchmarking efforts.

The authors tackled the problem of evaluating persona simulation in large language models by introducing TwinVoice, a multi-dimensional benchmark across social, interpersonal, and narrative contexts, and found that advanced models achieve moderate accuracy but fall short in capabilities like syntactic style and memory recall, with average performance significantly below human baselines.

Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.

View on arXiv PDF

Similar