CVAICLLGJun 9, 2025

Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models

arXiv:2506.07936v16 citationsh-index: 48Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding the reasoning capabilities of VLMs for researchers in multimodal AI, revealing that their in-context learning is often shallow and incremental in improving model evaluation.

The paper investigates whether vision-language models (VLMs) truly perform multimodal in-context learning (MM-ICL) by evaluating them under distribution shifts, finding that performance often degrades with more demonstrations and models rely on copying rather than learning. It proposes a new MM-ICL with Reasoning pipeline that adds generated rationales to demonstrations, but experiments show limited sensitivity to factors like shot count and rationale quality, indicating current VLMs do not effectively utilize demonstration information.

Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL), a property similar to that of their language-only counterparts. While recent work suggests VLMs can perform multimodal ICL (MM-ICL), studies show they often rely on shallow heuristics -- such as copying or majority voting -- rather than true task understanding. We revisit this assumption by evaluating VLMs under distribution shifts, where support examples come from a dataset different from the query. Surprisingly, performance often degrades with more demonstrations, and models tend to copy answers rather than learn from them. To investigate further, we propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer. We conduct extensive and comprehensive experiments on both perception- and reasoning-required datasets with open-source VLMs ranging from 3B to 72B and proprietary models such as Gemini 2.0. We conduct controlled studies varying shot count, retrieval method, rationale quality, and distribution. Our results show limited performance sensitivity across these factors, suggesting that current VLMs do not effectively utilize demonstration-level information as intended in MM-ICL.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes