CLAIMar 17

TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

arXiv:2603.1675955.71 citationsh-index: 33
AI Analysis

This addresses the problem of inadequate multi-turn conversational ability in language models for users relying on extended interactions, though it is incremental as it builds on existing training methods.

The paper tackles the gap between single- and multi-turn language model capabilities by introducing TurnWiseEval, a benchmark for multi-turn evaluation, and TurnWiseData, a pipeline for synthetic multi-turn training data, showing that training with multi-turn data improves performance by 12% on TurnWiseEval with as little as 10k conversations.

Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes