CVCLFeb 18, 2024

Visual In-Context Learning for Large Vision-Language Models

arXiv:2402.11574v1142 citationsh-index: 21ACL
Originality Incremental advance
AI Analysis

This work addresses cross-modal interaction challenges in vision-language models, offering an incremental improvement for tasks like visual reasoning.

The paper tackles the limited efficacy of In-Context Learning in Large Vision-Language Models by introducing a Visual In-Context Learning method, which improves performance on five visual reasoning datasets through techniques like visual demonstration retrieval and intent-oriented image summarization.

In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities. To overcome these challenges, we introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition. Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations that reduce token count and alleviate cross-modal interaction problem. Experimental evaluations on five visual reasoning datasets demonstrate the effectiveness of our method. Moreover, our extensive experiments leverage information flow analysis to elucidate the effectiveness of our method, and investigate the impact of length and position of demonstrations for LVLM. The use of in-context unlearning further shows promise in resetting specific model knowledge without retraining.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes