CL AIJun 19, 2024

DialSim: A Dialogue Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversational Agents

Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yeonsu Kwon, Yohan Jo, Edward Choi

arXiv:2406.13144v67.79 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the need for more realistic benchmarks in conversational AI, particularly for applications in education and entertainment, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of evaluating conversational agents on long-term multi-party dialogue understanding by introducing DialSim, a simulation-based framework, and LongDialQA, a dataset from TV shows with over 1,300 sessions and 1,000+ questions each, finding that state-of-the-art LLMs struggle with accurate comprehension in such scenarios.

Recent advancements in Large Language Models (LLMs) have significantly enhanced conversational agents, making them applicable to various fields (e.g., education, entertainment). Despite their progress, the evaluation of the agents often overlooks the complexities of real-world conversations, such as multi-party dialogues and extended contextual dependencies. To bridge this gap, we introduce DialSim, a dialogue simulation-based evaluation framework. In DialSim, an agent assumes the role of a character in a scripted conversation and is evaluated on their ability to answer spontaneous questions using only the dialogue history, while recognizing when they lack sufficient information. To support this framework, we introduce LongDialQA, a new QA dataset constructed from long-running TV shows, comprising over 1,300 dialogue sessions, each paired with more than 1,000 carefully curated questions, totaling over 352,000 tokens. To minimize reliance on prior knowledge, all character names are anonymized or swapped. Our evaluation of state-of-the-art LLM-based conversational agents using DialSim reveals that even models with large context windows or RAG capabilities struggle to maintain accurate comprehension over long-term, multi-party interactions-underscoring the need for more realistic and challenging benchmarks in conversational AI.

View on arXiv PDF Code

Similar