LGCLFeb 2, 2024

Can MLLMs Perform Text-to-Image In-Context Learning?

arXiv:2402.01293v315 citationsh-index: 7Has Code
AI Analysis

This work addresses a gap in multimodal AI by defining and benchmarking T2I-ICL, which is incremental as it extends existing ICL research to a new modality direction.

The paper tackles the underexplored problem of Text-to-Image In-Context Learning (T2I-ICL) by introducing CoBSAT, the first benchmark dataset with ten tasks, and finds that state-of-the-art MLLMs struggle significantly with this task, with strategies like fine-tuning and Chain-of-Thought prompting leading to notable performance improvements.

The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation, and show that strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate these difficulties, leading to notable improvements in performance. Our code and dataset are available at https://github.com/UW-Madison-Lee-Lab/CoBSAT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes