Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers
This study provides insights into the internal mechanisms of large language models, which is incremental for researchers in AI interpretability.
The paper tackled the problem of distinguishing between distributional associations and in-context reasoning in Transformer models, finding that feed-forward layers learn simple bigrams while attention layers focus on reasoning, with theoretical analysis attributing this to gradient noise.
Large language models have been successful at tasks involving basic forms of in-context reasoning, such as generating coherent language, as well as storing vast amounts of knowledge. At the core of the Transformer architecture behind such models are feed-forward and attention layers, which are often associated to knowledge and reasoning, respectively. In this paper, we study this distinction empirically and theoretically in a controlled synthetic setting where certain next-token predictions involve both distributional and in-context information. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Our theoretical analysis identifies the noise in the gradients as a key factor behind this discrepancy. Finally, we illustrate how similar disparities emerge in pre-trained models through ablations on the Pythia model family on simple reasoning tasks.