LGCLFeb 7, 2025

Paying Attention to Facts: Quantifying the Knowledge Capacity of Attention Layers

arXiv:2502.05076v1h-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of understanding and quantifying the knowledge capacity of attention layers for researchers in machine learning, but it is incremental as it builds on existing linear-algebraic frameworks and uses toy datasets.

The paper investigates how single-layer attention-only transformers can memorize facts from databases using a linear-algebraic approach, defining tensor ranks to measure database size and attention layer capacity, and empirically shows a relationship between these ranks on toy models and random databases.

In this paper, we investigate the ability of single-layer attention-only transformers (i.e. attention layers) to memorize facts contained in databases from a linear-algebraic perspective. We associate with each database a 3-tensor, propose the rank of this tensor as a measure of the size of the database, and provide bounds on the rank in terms of properties of the database. We also define a 3-tensor corresponding to an attention layer, and empirically demonstrate the relationship between its rank and database rank on a dataset of toy models and random databases. By highlighting the roles played by the value-output and query-key weights, and the effects of argmax and softmax on rank, our results shed light on the `additive motif' of factual recall in transformers, while also suggesting a way of increasing layer capacity without increasing the number of parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes