CVFeb 17, 2024

CoLLaVO: Crayon Large Language and Vision mOdel

arXiv:2402.11248v437 citationsh-index: 11ACL
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in VLMs for researchers and developers by improving object-level understanding, though it is incremental as it builds on existing methods with novel tuning schemes.

The paper tackles the problem of limited object-level image understanding in Vision Language Models (VLMs) and shows that enhancing this capability significantly improves zero-shot performance on vision-language tasks, achieving a 'significant leap' in numerous benchmarks.

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes