CV AIMar 3

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

arXiv:2603.02767v21.5h-index: 9

Originality Highly original

AI Analysis

This work addresses a problem for the computer vision and natural language processing communities by improving visual representation learning.

The authors tackled the limitation of existing image-text contrastive pretraining methods, which often yield partially modality-organized representations, and achieved consistent outperformance across multiple benchmarks. Their framework, ITO, eliminated the modality gap and stabilized training dynamics.

Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

View on arXiv PDF

Similar