CLAINov 15, 2024

Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems

arXiv:2411.09972v111 citationsh-index: 22SLT
Originality Incremental advance
AI Analysis

This work addresses the need for more realistic evaluation benchmarks in conversational AI, though it builds incrementally on prior research using LLMs for user-agents.

The paper tackles the problem of evaluating task-oriented dialogue systems by using large language models as context-aware user-agents, showing improved performance in diversity and task completion metrics with better prompts.

Traditionally, offline datasets have been used to evaluate task-oriented dialogue (TOD) models. These datasets lack context awareness, making them suboptimal benchmarks for conversational systems. In contrast, user-agents, which are context-aware, can simulate the variability and unpredictability of human conversations, making them better alternatives as evaluators. Prior research has utilized large language models (LLMs) to develop user-agents. Our work builds upon this by using LLMs to create user-agents for the evaluation of TOD systems. This involves prompting an LLM, using in-context examples as guidance, and tracking the user-goal state. Our evaluation of diversity and task completion metrics for the user-agents shows improved performance with the use of better prompts. Additionally, we propose methodologies for the automatic evaluation of TOD models within this dynamic framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes