CVCLMar 23, 2025

Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models

arXiv:2503.18034v21 citationsh-index: 29
Originality Incremental advance
AI Analysis

This addresses a bottleneck in MLLM performance for researchers and practitioners, offering an incremental improvement by focusing on vision encoder prior knowledge.

The paper tackles the problem of vision encoder prior knowledge constraining multi-modal large language models (MLLMs) by proposing VisPRE, a two-stage training framework that explicitly incorporates prior knowledge, which substantially boosts visual understanding capabilities, especially for uncommon visual entities.

Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder's prior knowledge is seldom investigated. In this work, we introduce a novel metric, $Rank_e$, to quantify the effect of prior knowledge of the vision encoder on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient, particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes