AILGFeb 2

Geometric Analysis of Token Selection in Multi-Head Attention

arXiv:2602.01893v1h-index: 3
AI Analysis

This work provides interpretability and design insights for attention mechanisms in LLMs, though it is incremental as it builds on existing attention frameworks without altering the mechanism.

The paper tackles the problem of understanding token selection in multi-head attention in large language models by developing a geometric framework to analyze separability between selected and non-selected tokens, with empirical results showing that top-N selection sharpens separability and heads specialize into distinct regimes like Retriever, Mixer, and Reset.

We present a geometric framework for analysing multi-head attention in large language models (LLMs). Without altering the mechanism, we view standard attention through a top-N selection lens and study its behaviour directly in value-state space. We define geometric metrics - Precision, Recall, and F-score - to quantify separability between selected and non-selected tokens, and derive non-asymptotic bounds with explicit dependence on dimension and margin under empirically motivated assumptions (stable value norms with a compressed sink token, exponential similarity decay, and piecewise attention weight profiles). The theory predicts a small-N operating regime of strongest non-trivial separability and clarifies how sequence length and sink similarity shape the metrics. Empirically, across LLaMA-2-7B, Gemma-7B, and Mistral-7B, measurements closely track the theoretical envelopes: top-N selection sharpens separability, sink similarity correlates with Recall. We also found that in LLaMA-2-7B heads specialize into three regimes - Retriever, Mixer, Reset - with distinct geometric signatures. Overall, attention behaves as a structured geometric classifier with measurable criteria for token selection, offering head level interpretability and informing geometry-aware sparsification and design of attention in LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes