LGAIMar 28

On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment

arXiv:2604.085791.3
AI Analysis

For researchers studying multimodal representation alignment, the paper identifies a fundamental structural property—the spectral complexity–orientation gap—that explains limitations of spectral alignment methods, though the findings are primarily diagnostic and the proposed framework does not improve retrieval performance.

The paper investigates cross-modal alignment between vision and language encoders using functional maps, finding that while the eigenvalue spectra of the two modalities are similar (normalized spectral distance 0.043), the eigenvector bases are unaligned (diagonal dominance <0.05, orthogonality error 70.15), revealing a spectral complexity–orientation gap. The functional map framework underperforms existing methods for cross-modal retrieval.

We study cross-modal alignment between independently pretrained vision (DINOv2) and language (all-MiniLM-L6-v2) encoders using the functional map framework from computational geometry, which represents correspondence between representation manifolds as a compact linear operator between graph Laplacian eigenbases. While the framework underperforms Procrustes alignment and relative representations for cross-modal retrieval across all supervision budgets, it reveals a structural property of multimodal representations. We find that the Laplacian eigenvalue spectra of the two encoders are quantitatively similar (normalized spectral distance 0.043), indicating that independently trained models develop manifolds of comparable intrinsic complexity. However, the functional map exhibits near-zero diagonal dominance (mean below 0.05) and large orthogonality error (70.15), showing that the eigenvector bases are effectively unaligned. We term this decoupling the spectral complexity--orientation gap: models converge in how much structure they capture but not in how they organize it. This gap defines a boundary condition for spectral alignment methods and motivates three diagnostic quantities : diagonal dominance, orthogonality deviation, and Laplacian commutativity error for characterizing cross-modal representation compatibility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes