CVAILGApr 20

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

Berkeley
arXiv:2604.1857298.11 citationsh-index: 111
Predicted impact top 6% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work critically re-evaluates a widely cited hypothesis, showing its empirical support is fragile, which matters for researchers relying on cross-modal alignment in multimodal AI.

The authors challenge the Platonic Representation Hypothesis by showing that cross-modal alignment between vision and language models degrades significantly when scaling evaluation datasets from ~1K to millions of samples, and that reported alignment trends do not hold for newer models.

The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes