Even Heads Fix Odd Errors: Mechanistic Discovery and Surgical Repair in Transformer Attention
This mechanistic discovery reveals hidden substructure in transformer attention, offering insights for interpretability and efficiency in AI models, though it is incremental as it focuses on a specific bug in one model.
The study identified a format-dependent reasoning failure in Llama-3.1-8B-Instruct where it incorrectly compares numbers like '9.11' and '9.8' in certain formats, and discovered that even-indexed attention heads specialize in numerical comparison, with a sharp threshold requiring exactly 8 such heads at a specific layer for perfect repair.
We present a mechanistic case study of a format-dependent reasoning failure in Llama-3.1-8B-Instruct, where the model incorrectly judges "9.11" as larger than "9.8" in chat or Q&A formats, but answers correctly in simple format. Through systematic intervention, we discover transformers implement even/odd attention head specialization: even indexed heads handle numerical comparison, while odd heads serve incompatible functions. The bug requires exactly 8 even heads at Layer 10 for perfect repair. Any combination of 8+ even heads succeeds, while 7 or fewer completely fails, revealing sharp computational thresholds with perfect redundancy among the 16 even heads. SAE analysis reveals the mechanism: format representations separate (10% feature overlap at Layer 7), then re-entangle with different weightings (80% feature overlap at Layer 10), with specific features showing 1.5x amplification in failing formats. We achieve perfect repair using only 25% of attention heads and identify a 60% pattern replacement threshold, demonstrating that apparent full-module requirements hide sophisticated substructure with implications for interpretability and efficiency. All of our code is available at https://github.com/gussand/surgeon.