CL LGOct 18, 2025

When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs

Richard J. Young, Brandon Gillins, Alice M. Matthews

arXiv:2510.18892v13 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This provides a practical diagnostic tool for researchers and practitioners to assess LLM instruction adherence, though it is incremental as it builds on existing evaluation approaches.

The paper tackled the problem of systematically evaluating instruction-following capabilities in Large Language Models by developing a streamlined framework with 20 prompts, and the result was a large-scale empirical study testing 256 models that revealed consistent failure modes and challenging instruction types.

Despite widespread deployment of Large Language Models, systematic evaluation of instruction-following capabilities remains challenging. While comprehensive benchmarks exist, focused assessments that quickly diagnose specific instruction adherence patterns are valuable. As newer models may be trained on existing benchmarks, novel evaluation approaches are needed to assess genuine capabilities rather than memorized performance. This paper presents a streamlined evaluation framework using twenty carefully designed prompts to assess LLM instruction-following across diverse task categories. We demonstrate this framework through a large-scale empirical study conducted on October 14, 2025, testing 256 verified working models from 331 available via OpenRouter. To ensure methodological rigor and prevent selection bias, we first verified each model's basic functionality before inclusion. Unlike large-scale benchmarks requiring extensive computational resources, our approach offers a practical diagnostic tool researchers and practitioners can readily apply. Our methodology builds upon verifiable instructions while introducing a compact test suite balancing comprehensiveness with efficiency. Each prompt targets distinct aspects of instruction following, including format compliance, content constraints, logical sequencing, and multi-step task execution. We evaluate models from major providers (OpenAI, Anthropic, Google, Meta, Mistral) and emerging implementations (Qwen, DeepSeek, community models), providing comparative performance analysis. Our findings reveal consistent failure modes and identify specific instruction types posing particular challenges. This work contributes both a practical evaluation tool and one of the most comprehensive empirical analyses of instruction-following capabilities across the contemporary LLM landscape.

View on arXiv PDF

Similar