CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention
This addresses the problem of unreliable task adaptation without parameter updates for users of large vision-language models, representing an incremental improvement through attention modulation.
The paper tackles the instability of multimodal in-context learning in large vision-language models by identifying deficits in self-attention and proposing Context-Aware Modulated Attention (CAMA), a plug-and-play method that dynamically modulates attention logits, resulting in consistent performance improvements across four models and seven benchmarks.
Multimodal in-context learning (ICL) is emerging as a key capability that enables large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, expanding their utility across various real-world applications. However, ICL remains unstable, even with well-matched in-context demonstrations (ICDs), suggesting that LVLMs struggle to fully utilize the provided context. While existing efforts focus on prompt engineering or post-hoc logit calibration, we instead investigate the underlying attention dynamics to overcome LVLMs' inherent limitations. We identify two critical deficits in their self-attention that impair effective ICL. To bridge the gap, we propose \textbf{Context-Aware Modulated Attention} (CAMA), a plug-and-play and training-free method that dynamically modulates LVLM's attention logits based on the input in-context sequence. CAMA employs a two-stage attention modulation to address both identified deficits, enhancing the focus on semantically significant tokens, particularly visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, demonstrating great effectiveness and generalization. It can also activate the desired effects of prompt engineering methods and remains robust under diverse sequence configurations. Thus, CAMA paves the way for deeper explorations of attention dynamics to advance multimodal reasoning.