CVJun 2

Beyond Compression: Quantifying Spectral Accessibility in Vision Representations

arXiv:2606.0379532.3h-index: 7
Predicted impact top 84% in CV · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers studying representation learning in vision-language models, this work provides a novel analysis of spectral transformations, though it is an incremental contribution.

This paper investigates how vision-language models alter the spectral structure of visual representations, finding that spectral accessibility peaks at intermediate layers and that CLIP's projection is spectrally neutral while DINOv2's pooling induces structured spectral loss.

Vision-language models map visual features into a shared embedding space through learned projection layers, yet it remains unclear how these transformations alter the structure of visual information. This study examines changes in representation through spatial-frequency accessibility, measured by the linear recoverability of band-limited Fourier energy from model representations. To isolate effects beyond dimensionality reduction, we introduce Residual Spectral Loss (RSL), which evaluates changes relative to a dimension-matched random projection baseline. To reduce confounding effects from optimization, the analysis uses pretrained models with all parameters frozen. The experimental results show consistent frequency-dependent changes in accessibility across CLIP and DINOv2 on ImageNet and MS-COCO datasets. Spectral accessibility follows a non-monotonic trajectory across depth, peaking at intermediate layers before decreasing toward the output representation. The final transformation differs across architectures: CLIP's learned projection is spectrally neutral, with changes explained by compression, whereas DINOv2's [CLS] pooling induces a structured loss across the spectrum. These findings identify intermediate layers and pooling mechanisms as primary drivers of spectral transformation in modern vision encoders.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes