CVOct 21, 2025

Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents

arXiv:2510.18703v11 citationsh-index: 34Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of multimodal representation learning for web documents, offering a scalable approach that eliminates the need for OCR or modality fusion, though it is incremental in building on existing contrastive learning paradigms.

The paper tackles the limited ability of contrastive vision-language models like CLIP to handle complex, real-world web documents with interleaved or loosely aligned text and images, proposing Vision-Centric Contrastive Learning (VC2L) which renders all inputs as images and uses snippet-level contrastive learning, achieving competitive or superior performance on new and established benchmarks.

Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly paired image-text data. To assess the effectiveness of this approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR, designed to evaluate cross-modal retrieval, fine-grained sequential understanding, and generalization to unseen data, respectively. Empirical results show that VC2L achieves competitive or superior performance compared to CLIP-style models on both the proposed benchmarks and established datasets such as M-BEIR and MTEB. These findings underscore the potential of multimodal web data as a valuable training resource for contrastive learning and illustrate the scalability of a unified, vision-centric approach for multimodal representation learning. Code and models are available at: https://github.com/showlab/VC2L.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes