CLAILOMar 19

Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic

arXiv:2604.1306584.4h-index: 1
Predicted impact top 55% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the challenge of evaluating true reasoning capabilities in LLMs for AI researchers, though it is incremental as it builds on existing benchmarking methods.

The paper tackles the problem of distinguishing genuine reasoning from pattern retrieval in LLMs by introducing the Novel Operator Test, a benchmark that separates operator logic from operator name, and demonstrates that models can execute correct chain-of-thought reasoning but still produce wrong final answers, with specific errors such as 31 out of 31 errors at Claude Sonnet 4's depth 7 having correct reasoning yet wrong answers.

LLMs can execute every step of chain-of-thought reasoning correctly and still produce wrong final answers. We introduce the Novel Operator Test, a benchmark that separates operator logic from operator name, enabling rigorous distinction between genuine reasoning and pattern retrieval. By evaluating Boolean operators under unfamiliar names across depths 1-10 on five models (up to 8,100 problems each), we demonstrate a reasoning-output dissociation that existing benchmarks cannot detect. At Claude Sonnet 4's depth 7, all 31 errors have verifiably correct reasoning yet wrong declared answers; 17/19 errors in mixed-operator chains exhibit the same pattern. The benchmark reveals two failure types: strategy failures at depth 2, where models attempt terse retrieval (+62pp from scaffolding), and content failures at depth 7, where models reason fully but err systematically (+8-30pp, 0/300 errors post-intervention). A Trojan operator (XOR's truth table under a novel name) confirms name alone does not gate reasoning (p >= 0.49), while Llama's novelty gap widens to 28pp at depth 8-9 with the Trojan at 92-100%, isolating genuine difficulty with novel logic from name unfamiliarity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes