Sparsely Multimodal Data Fusion
It addresses multimodal integration for real-world applications with sparse data, but the study is incremental as it compares existing methods without introducing a fundamentally new approach.
This paper tackled the problem of multimodal data fusion with incomplete or sparse modalities by comparing three embedding techniques, finding that Modal Channel Attention (MCA) outperformed Zorro and Everything at Once (EAO) in tasks like regression and classification on datasets such as CMU-MOSEI and TCGA.
Multimodal data fusion is essential for applications requiring the integration of diverse data sources, especially in the presence of incomplete or sparsely available modalities. This paper presents a comparative study of three multimodal embedding techniques, Modal Channel Attention (MCA), Zorro, and Everything at Once (EAO), to evaluate their performance on sparsely multimodal data. MCA introduces fusion embeddings for all combinations of input modalities and uses attention masking to create distinct attention channels, enabling flexible and efficient data fusion. Experiments on two datasets with four modalities each, CMU-MOSEI and TCGA, demonstrate that MCA outperforms Zorro across ranking, recall, regression, and classification tasks and outperforms EAO across regression and classification tasks. MCA achieves superior performance by maintaining robust uniformity across unimodal and fusion embeddings. While EAO performs best in ranking metrics due to its approach of forming fusion embeddings post-inference, it underperforms in downstream tasks requiring multimodal interactions. These results highlight the importance of contrasting all modality combinations in constructing embedding spaces and offers insights into the design of multimodal architectures for real-world applications with incomplete data.