AIDec 9, 2023

Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models

arXiv:2312.06685v116 citationsh-index: 12CVPR
Originality Incremental advance
AI Analysis

This addresses the challenge of improving accuracy in visual question answering for multi-modal language models, though it is incremental as it builds on existing prompting strategies.

The paper tackled the problem of multi-modal language models struggling with factual and precise responses in visual question answering by proposing Causal-CoG, a prompting strategy that uses generated contexts and causality filtering, resulting in improvements such as +6.30% on POPE and +13.69% on Vizwiz.

While Multi-modal Language Models (MLMs) demonstrate impressive multimodal ability, they still struggle on providing factual and precise responses for tasks like visual question answering (VQA). In this paper, we address this challenge from the perspective of contextual information. We propose Causal Context Generation, Causal-CoG, which is a prompting strategy that engages contextual information to enhance precise VQA during inference. Specifically, we prompt MLMs to generate contexts, i.e, text description of an image, and engage the generated contexts for question answering. Moreover, we investigate the advantage of contexts on VQA from a causality perspective, introducing causality filtering to select samples for which contextual information is helpful. To show the effectiveness of Causal-CoG, we run extensive experiments on 10 multimodal benchmarks and show consistent improvements, e.g., +6.30% on POPE, +13.69% on Vizwiz and +6.43% on VQAv2 compared to direct decoding, surpassing existing methods. We hope Casual-CoG inspires explorations of context knowledge in multimodal models, and serves as a plug-and-play strategy for MLM decoding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes