CVAIApr 24, 2024

What Makes Multimodal In-Context Learning Work?

arXiv:2404.15736v255 citationsh-index: 32Has Code2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Originality Incremental advance
AI Analysis

This work addresses the effectiveness and limitations of multimodal in-context learning for researchers and practitioners in AI, revealing incremental insights into its mechanisms.

The study investigates Multimodal In-Context Learning (M-ICL) in Large Multimodal Models, finding that it primarily relies on text-driven mechanisms with minimal image influence and that advanced strategies like RICES do not outperform simple majority voting.

Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at https://gitlab.com/folbaeni/multimodal-icl

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes