CLNov 12, 2025

Where does an LLM begin computing an instruction?

arXiv:2511.10694v2h-index: 9
Originality Incremental advance
AI Analysis

This work addresses a fundamental question in understanding LLM behavior for researchers, but it is incremental as it builds on existing activation patching techniques to provide a specific measurement approach.

The paper tackled the problem of identifying where in a large language model's layers instruction following begins, by introducing three simple datasets and using activation patching to measure layer-wise flip rates. The result was the identification of an inflection point termed 'onset', observed across Llama family models, where interventions before this point become ineffective afterward, providing a replicable method to compare this location across tasks and model sizes.

Following an instruction involves distinct sub-processes, such as reading content, reading the instruction, executing it, and producing an answer. We ask where, along the layer stack, instruction following begins, the point where reading gives way to doing. We introduce three simple datasets (Key-Value, Quote Attribution, Letter Selection) and two hop compositions of these tasks. Using activation patching on minimal-contrast prompt pairs, we measure a layer-wise flip rate that indicates when substituting selected residual activations changes the predicted answer. Across models in the Llama family, we observe an inflection point, which we term onset, where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions show a similar onset location. These results provide a simple, replicable way to locate where instruction following begins and to compare this location across tasks and model sizes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes