CV AIJul 21, 2025

True Multimodal In-Context Learning Needs Attention to the Visual Context

Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu

DeepMindOxford

arXiv:2507.15807v218.210 citationsh-index: 21

Originality Incremental advance

AI Analysis

This addresses a critical limitation in multimodal AI for practical applications by enhancing the ability to integrate visual information, though it is incremental as it builds on existing MLLM frameworks.

The paper tackles the problem that Multimodal Large Language Models (MLLMs) often neglect visual cues in Multimodal In-Context Learning (MICL), leading to text imitation rather than genuine multimodal adaptation. It introduces Dynamic Attention Reallocation (DARA) and the TrueMICL dataset, resulting in substantial improvements in true multimodal in-context learning capabilities.

Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm .

View on arXiv PDF

Similar