CVOct 18, 2024

ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

arXiv:2410.14332v43 citationsh-index: 12Has Code
Originality Highly original
AI Analysis

This addresses a key bottleneck in LMMs for multimodal AI applications, offering a novel method to enhance visual understanding, though it is incremental in the context of existing pretraining frameworks.

The paper tackles the modality representation gap in Large Multimodal Models (LMMs) by introducing ViCToR, a pretraining framework that improves visual comprehension, achieving state-of-the-art results with gains of 10.4%, 3.2%, and 7.2% on benchmarks like MMStar, SEED^I, and RealWorldQA.

Large Multimodal Models (LMMs) often face a modality representation gap during pretraining: while language embeddings remain stable, visual representations are highly sensitive to contextual noise (e.g., background clutter). To address this issue, we introduce a visual comprehension stage, which we call ViCToR (Visual Comprehension via Token Reconstruction), a novel pretraining framework for LMMs. ViCToR employs a learnable visual token pool and utilizes the Hungarian matching algorithm to select semantically relevant tokens from this pool for visual token replacement. Furthermore, by integrating a visual token reconstruction loss with dense semantic supervision, ViCToR can learn tokens which retain high visual detail, thereby enhancing the large language model's (LLM's) understanding of visual information. After pretraining on 3 million publicly accessible images and captions, ViCToR achieves state-of-the-art results, improving over LLaVA-NeXT-8B by 10.4%, 3.2%, and 7.2% on the MMStar, SEED$^I$, and RealWorldQA benchmarks, respectively. Code is available at https://github.com/deepglint/Victor.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes