Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers

Jakub Grzywaczewski, Dawid Płudowski, Przemysław Biecek

arXiv:2605.1489319.9

AI Analysis

For researchers using VLMs as feature extractors, this work reveals that a substantial portion of the latent space is non-semantic noise, enabling more efficient and robust representations.

The paper shows that contrastively pretrained VLMs have a large subspace of shared noise in their embeddings (e.g., 164 dimensions for CLIP), and pruning this noise preserves or improves downstream performance.

Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.

View on arXiv PDF

Similar