CLSep 11, 2025

All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens

Siddarth Mamidanna, Daking Rai, Ziyu Yao, Yilun Zhou

arXiv:2509.09650v14 citationsh-index: 5EMNLP

Originality Incremental advance

AI Analysis

This work provides insights into the inner workings of LLMs for computational tasks, which is incremental but clarifies mechanisms for researchers in interpretability and model design.

The paper investigates how large language models (LLMs) perform mental math tasks by identifying a specific computational subgraph (All-for-One) where meaningful computation occurs late and only at the last token, which receives information from other tokens in middle layers, achieving high accuracy across various models and arithmetic expressions.

Large language models (LLMs) demonstrate proficiency across numerous computational tasks, yet their inner workings remain unclear. In theory, the combination of causal self-attention and multilayer perceptron layers allows every token to access and compute information based on all preceding tokens. In practice, to what extent are such operations present? In this paper, on mental math tasks (i.e., direct math calculation via next-token prediction without explicit reasoning), we investigate this question in three steps: inhibiting input-specific token computations in the initial layers, restricting the routes of information transfer across token positions in the next few layers, and forcing all computation to happen at the last token in the remaining layers. With two proposed techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), we identify an All-for-One subgraph (AF1) with high accuracy on a wide variety of mental math tasks, where meaningful computation occurs very late (in terms of layer depth) and only at the last token, which receives information of other tokens in few specific middle layers. Experiments on a variety of models and arithmetic expressions show that this subgraph is sufficient and necessary for high model performance, transfers across different models, and works on a variety of input styles. Ablations on different CAMA and ABP alternatives reveal their unique advantages over other methods, which may be of independent interest.

View on arXiv PDF

Similar