AIOct 5, 2025

Representation Potentials of Foundation Models for Multimodal Alignment: A Survey

arXiv:2510.05184v19 citationsh-index: 5EMNLP
Originality Synthesis-oriented
AI Analysis

This is an incremental survey synthesizing existing evidence on representation potentials for researchers in multimodal AI.

This survey investigates the representation potentials of foundation models, finding that their learned representations exhibit structural regularities and semantic consistencies across modalities, positioning them as strong candidates for cross-modal alignment and transfer.

Foundation models learn highly transferable representations through large-scale pretraining on diverse data. An increasing body of research indicates that these representations exhibit a remarkable degree of similarity across architectures and modalities. In this survey, we investigate the representation potentials of foundation models, defined as the latent capacity of their learned representations to capture task-specific information within a single modality while also providing a transferable basis for alignment and unification across modalities. We begin by reviewing representative foundation models and the key metrics that make alignment measurable. We then synthesize empirical evidence of representation potentials from studies in vision, language, speech, multimodality, and neuroscience. The evidence suggests that foundation models often exhibit structural regularities and semantic consistencies in their representation spaces, positioning them as strong candidates for cross-modal transfer and alignment. We further analyze the key factors that foster representation potentials, discuss open questions, and highlight potential challenges.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes