CLMar 19

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

arXiv:2603.1842576.7h-index: 28
Predicted impact top 73% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This addresses a critical issue for developers of multimodal dialogue systems, highlighting performance bottlenecks in task switching, though it is incremental as it extends known text-only interference to multimodal settings.

The paper tackles the problem of task interference in multimodal LLMs, introducing a benchmark that reveals performance drops are highly directional, with severe degradation when switching from text-only to image-based tasks, especially when mismatches co-occur across modalities and answer formats.

Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes