CLMay 7

SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

arXiv:2605.0635347.31 citations
AI Analysis

It provides a more realistic evaluation of instruction-following in multi-turn conversations, revealing that current models struggle with long-horizon tasks.

The paper introduces SEQUOR, a benchmark for evaluating constraint adherence in long multi-turn conversations, and finds that instruction-following accuracy drops by over 11% as conversations lengthen, with larger declines (over 40%) for multiple constraints and over 9% for dynamic constraint changes.

In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon instruction-following tasks. To bridge this gap, we present SEQUOR, an automatic benchmark for evaluating constraint adherence in long multi-turn conversations. SEQUOR consists of simulated persona-driven interactions built with constraints extracted from real-world conversations. Our results show that even when following a single constraint, instruction-following accuracy consistently decreases as the conversation grows longer, with drops exceeding 11%. This decline becomes larger when models have to follow multiple constraints simultaneously, reducing their accuracy by over 40%. In scenarios where constraints are added or replaced at arbitrary points of the conversation, model accuracy decreases by more than 9%. Taken together, our results reveal that current models still struggle to follow user instructions in multi-turn conversations, and provide a way for better measuring instruction-following capabilities in assistants.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes