CVNov 26, 2024

Multimodal Alignment and Fusion: A Survey

arXiv:2411.17040v2139 citationsh-index: 1Int J Comput Vis
Originality Synthesis-oriented
AI Analysis

It addresses the need for generalizable techniques in multimodal learning to improve scalability and robustness across applications like social media analysis and medical imaging, but is incremental as a survey.

This survey provides a comprehensive overview of recent advances in multimodal alignment and fusion in machine learning, categorizing approaches through structural and methodological frameworks based on over 260 studies.

This survey provides a comprehensive overview of recent advances in multimodal alignment and fusion within the field of machine learning, driven by the increasing availability and diversity of data modalities such as text, images, audio, and video. Unlike previous surveys that often focus on specific modalities or limited fusion strategies, our work presents a structure-centric and method-driven framework that emphasizes generalizable techniques. We systematically categorize and analyze key approaches to alignment and fusion through both structural perspectives -- data-level, feature-level, and output-level fusion -- and methodological paradigms -- including statistical, kernel-based, graphical, generative, contrastive, attention-based, and large language model (LLM)-based methods, drawing insights from an extensive review of over 260 relevant studies. Furthermore, this survey highlights critical challenges such as cross-modal misalignment, computational bottlenecks, data quality issues, and the modality gap, along with recent efforts to address them. Applications ranging from social media analysis and medical imaging to emotion recognition and embodied AI are explored to illustrate the real-world impact of robust multimodal systems. The insights provided aim to guide future research toward optimizing multimodal learning systems for improved scalability, robustness, and generalizability across diverse domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes