CVAIMar 3

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

arXiv:2603.02767v2h-index: 5
Originality Highly original
AI Analysis

This work addresses a problem for the computer vision and natural language processing communities by improving visual representation learning.

The authors tackled the limitation of existing image-text contrastive pretraining methods, which often yield partially modality-organized representations, and achieved consistent outperformance across multiple benchmarks. Their framework, ITO, eliminated the modality gap and stabilized training dynamics.

Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes