LG AIMay 22

Representation Alignment Rests on Linear Structure

Kiril Bangachev, Guy Bresler, Yury Polyanskiy

arXiv:2605.2887067.3h-index: 5

AI Analysis

For researchers studying representation learning and alignment in AI, this work provides a mechanistic explanation of why representations align across models, though it is incremental as it builds on existing hypotheses.

The paper proposes that the Platonic Representation Hypothesis (PRH) arises from linear object-attribute features, showing that sparse representations from autoencoders improve cross-modal alignment and that centering/normalization reduces bias, while data scarcity increases noise. A statistical model refines the Linear Representation Hypothesis to explain alignment across diverse AI architectures.

We investigate the Platonic Representation Hypothesis (PRH) through a tripartite statistical framework of representations: signal, bias, and noise. {1) Signal:} We propose that Platonic alignment arises from the universal relationship between objects and attributes, which is encoded linearly in representations according to the Linear Representation Hypothesis (LRH). We provide evidence that LRH helps explain PRH by extracting linear object-attribute features with sparse autoencoders and showing that these sparse representations often exhibit stronger cross-modal alignment than their dense counterparts. {2) Bias:} Models have different implicit biases due to the diverse architectures and training procedures used. We show that this difference can be partially mitigated. Centering and normalization consistently improve cross-model alignment. {3) Noise:} Finite-sample training leads to noise in representations. We provide evidence that representational noise is driven by data scarcity by revealing a strong and consistent positive correlation between word frequency and alignment in LLMs and text embedding models. Synthesizing signal, bias, and noise, we propose a statistical model that refines the Linear Representation Hypothesis and explains further phenomena related to the alignment of representations emerging from diverse modern AI architectures.

View on arXiv PDF

Similar