CVAILGNov 29, 2023

Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?

arXiv:2311.18021v224 citationsh-index: 17Has Code
Originality Incremental advance
AI Analysis

This work addresses a key limitation in multimodal AI by showing that current MLLMs may not fully leverage visual information in ICL, which is incremental but important for improving model efficiency and accuracy in tasks like image-text understanding.

The paper investigates whether multimodal large language models (MLLMs) genuinely perform multimodal in-context learning (ICL) or rely primarily on text, finding that multimodal ICL is driven by textual content with visual information having little influence, though visual content aids in demo selection to improve performance, and proposes a method (MMICES) that enhances results by selecting demos using both modalities.

Large Language Models (LLMs) with in-context learning (ICL) ability can quickly adapt to a specific context given a few demonstrations (demos). Recently, Multimodal Large Language Models (MLLMs) built upon LLMs have also shown multimodal ICL ability, i.e., responding to queries given a few multimodal demos, including images, queries, and answers. While ICL has been extensively studied on LLMs, its research on MLLMs remains limited. One essential question is whether these MLLMs can truly conduct multimodal ICL, or if only the textual modality is necessary. We investigate this question by examining two primary factors that influence ICL: 1) Demo content, i.e., understanding the influences of demo content in different modalities. 2) Demo selection strategy, i.e., how to select better multimodal demos for improved performance. Experiments revealed that multimodal ICL is predominantly driven by the textual content whereas the visual information in the demos has little influence. Interestingly, visual content is still necessary and useful for selecting demos to increase performance. Motivated by our analysis, we propose a simple yet effective approach, termed Mixed Modality In-Context Example Selection (MMICES), which considers both visual and language modalities when selecting demos. Extensive experiments are conducted to support our findings and verify the improvement brought by our method. Code is available at \url{https://chenxshuo.github.io/m-icl/}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes