MLLGMay 19, 2025

Attention-based clustering

arXiv:2505.13112v33 citationsh-index: 5
Originality Incremental advance
AI Analysis

This provides theoretical insights into transformers' clustering capabilities, which is incremental for researchers in machine learning theory.

The paper tackles the problem of understanding how transformers can extract structure from data in unsupervised settings, specifically showing that a simplified two-head attention layer can align with true mixture centroids when data comes from a Gaussian mixture model, and that even a non-trainable attention layer can perform in-context quantization.

Transformers have emerged as a powerful neural network architecture capable of tackling a wide range of learning tasks. In this work, we provide a theoretical analysis of their ability to automatically extract structure from data in an unsupervised setting. In particular, we demonstrate their suitability for clustering when the input data is generated from a Gaussian mixture model. To this end, we study a simplified two-head attention layer and define a population risk whose minimization with unlabeled data drives the head parameters to align with the true mixture centroids. This phenomenon highlights the ability of attention-based layers to capture underlying distributional structure. We further examine an attention layer with key, query, and value matrices fixed to the identity, and show that, even without any trainable parameters, it can perform in-context quantization, revealing the surprising capacity of transformer-based methods to adapt dynamically to input-specific distributions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes