CLMay 1

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

arXiv:2605.0081753.9
AI Analysis

For researchers and practitioners using LLMs for multi-step reasoning, this diagnostic study exposes a critical gap between apparent reasoning performance and actual procedural fidelity.

LLMs' ability to faithfully execute step-wise procedures degrades sharply with procedure length, with accuracy dropping from 61% on 5-step to 20% on 95-step arithmetic tasks, revealing that final-answer accuracy masks failures in faithful execution.

Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic algorithm and two numeric inputs, and must return the final computed value. The benchmark uses simple arithmetic operations but increases complexity through algorithm length and look-back dependencies over intermediate variables. Across 14 models and 55 datasets, average first-answer accuracy drops from 61% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error, under-executed traces, and hallucinated extra steps. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful instruction execution.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes