CLAIAug 19, 2024

X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents

arXiv:2408.09853v26 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses the challenge of assessing human-like behavior in AI dialogue systems for prolonged interactions, though it is incremental as it builds on the traditional Turing test.

The paper tackles the problem of evaluating long-term dialogue agents by proposing X-Turing, an enhanced Turing test that uses burst dialogue and pseudo-dialogues to reduce human workload, and finds that LLMs like GPT-4 achieve pass rates of 51.9% at 3 turns and 38.9% at 10 turns, with performance dropping over time.

The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations. The traditional setting limits each participant to one message at a time and requires constant human participation. This fails to reflect a natural conversational style and hinders the evaluation of dialogue agents based on Large Language Models (LLMs) in complex and prolonged interactions. This paper proposes \textbf{\textsc{X-Turing}}, which enhances the original test with a \textit{burst dialogue} pattern, allowing more dynamic exchanges using consecutive messages. It further reduces human workload by iteratively generating dialogues that simulate the long-term interaction between the agent and a human to compose the majority of the test process. With the \textit{pseudo-dialogue} history, the agent then engages in a shorter dialogue with a real human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the \textit{X-Turn Pass-Rate} metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9\% and 38.9\% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes