CLAIJan 29, 2025

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

arXiv:2501.17399v2138 citationsh-index: 11ACL
Originality Incremental advance
AI Analysis

This addresses the underexamined problem of realistic multi-turn conversation capabilities in LLMs, which is crucial for their applications, though it is incremental as it builds on existing evaluation methods.

The paper introduces MultiChallenge, a benchmark for evaluating large language models on multi-turn conversations, revealing that despite near-perfect scores on existing benchmarks, frontier models achieve less than 50% accuracy, with the best model scoring 41.4%.

We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning at the same time. We also develop LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (June 2024) achieving just a 41.4% average accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes