CLAISep 22, 2025

Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues

arXiv:2509.17694v22 citationsh-index: 2
Originality Incremental advance
AI Analysis

It addresses the challenge of evaluating LLMs in long-form, knowledge-grounded role-play dialogues for professional training, providing a benchmark and hybrid evaluation framework, though it is incremental in assessing degradation.

This study compared LLM-generated and human-authored responses in multi-turn professional training simulations, finding that LLM response quality significantly degraded over turns in naturalness and context maintenance, while human responses improved, with participants consistently preferring human-authored dialogue.

Evaluating large language models (LLMs) in long-form, knowledge-grounded role-play dialogues remains challenging. This study compares LLM-generated and human-authored responses in multi-turn professional training simulations through human evaluation ($N=38$) and automated LLM-as-a-judge assessment. Human evaluation revealed significant degradation in LLM-generated response quality across turns, particularly in naturalness, context maintenance and overall quality, while human-authored responses progressively improved. In line with this finding, participants also indicated a consistent preference for human-authored dialogue. These human judgements were validated by our automated LLM-as-a-judge evaluation, where Gemini 2.0 Flash achieved strong alignment with human evaluators on both zero-shot pairwise preference and stochastic 6-shot construct ratings, confirming the widening quality gap between LLM and human responses over time. Our work contributes a multi-turn benchmark exposing LLM degradation in knowledge-grounded role-play dialogues and provides a validated hybrid evaluation framework to guide the reliable integration of LLMs in training simulations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes