Understanding Addition and Subtraction in Transformers
This addresses the issue of arithmetic failures in large language models, offering a tractable case study for mechanistic interpretability, but it is incremental as it builds directly on prior work on addition circuits.
The paper tackled the problem of transformers failing at basic arithmetic tasks like multidigit addition and subtraction, showing that small transformers trained from scratch can achieve 99.999% accuracy on these tasks. It also found that only 7% of publicly available large language models can reliably perform addition, highlighting a gap between specialized small models and general-purpose ones.
Transformers are widely deployed in large language models (LLMs), yet most models still fail on basic arithmetic tasks such as multidigit addition. In contrast, we show that small transformers trained from scratch can solve n-digit addition and subtraction with 99.999% accuracy. Building directly on prior work that uncovered addition circuits, we extend the analysis to subtraction and present a unified mechanistic account based on cascading carry and borrow circuits. Using a suite of 49 trained models, we apply systematic ablations and node-level constraints to validate the learned mechanisms and release a reproducible interpretability toolkit for studying arithmetic circuits. Finally, surveying 180 publicly available LLMs, we find that only 7% can reliably perform addition, underscoring the gap between specialized small models and general-purpose LLMs. Our results show that arithmetic can be implemented exactly by tiny transformers, offering a tractable case study for mechanistic interpretability and a cautionary contrast with the persistent arithmetic failures of much larger models.