CLAIOct 31, 2025

Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

arXiv:2511.00222v130 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the need for more coherent and faithful simulated human users in applications like therapy and education, representing an incremental improvement.

The paper tackled the problem of LLMs drifting from assigned personas in interactive simulations, introducing a framework with automatic metrics and multi-turn reinforcement learning that reduced inconsistency by over 55%.

Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue. We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations. Using these metrics as reward signals, we apply multi-turn reinforcement learning to fine-tune LLMs for three user roles: a patient, a student, and a social chat partner. Our method reduces inconsistency by over 55%, resulting in more coherent and faithful simulated users.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes