CV AINov 26, 2025

The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, Maarten De Vos, Panagiotis Tsakalides, Grigorios Tsagatakis

arXiv:2511.21331v13.6h-index: 5

Originality Incremental advance

AI Analysis

This work addresses the problem of insufficient higher-order interaction modeling in multimodal machine learning for researchers and practitioners, though it appears incremental as it builds on existing contrastive methods.

The paper tackles the challenge of learning joint representations across multiple modalities by introducing Contrastive Fusion (ConFu), a framework that aligns individual modalities and their fused combinations in a unified space, enabling capture of higher-order dependencies like XOR-like relationships while maintaining pairwise correspondence, with results showing competitive performance on retrieval and classification tasks.

Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.

View on arXiv PDF

Similar