X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents
This addresses the challenge of assessing human-like behavior in AI dialogue systems for prolonged interactions, though it is incremental as it builds on the traditional Turing test.
The paper tackles the problem of evaluating long-term dialogue agents by proposing X-Turing, an enhanced Turing test that uses burst dialogue and pseudo-dialogues to reduce human workload, and finds that LLMs like GPT-4 achieve pass rates of 51.9% at 3 turns and 38.9% at 10 turns, with performance dropping over time.
The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations. The traditional setting limits each participant to one message at a time and requires constant human participation. This fails to reflect a natural conversational style and hinders the evaluation of dialogue agents based on Large Language Models (LLMs) in complex and prolonged interactions. This paper proposes \textbf{\textsc{X-Turing}}, which enhances the original test with a \textit{burst dialogue} pattern, allowing more dynamic exchanges using consecutive messages. It further reduces human workload by iteratively generating dialogues that simulate the long-term interaction between the agent and a human to compose the majority of the test process. With the \textit{pseudo-dialogue} history, the agent then engages in a shorter dialogue with a real human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the \textit{X-Turn Pass-Rate} metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9\% and 38.9\% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.