CVAIAug 15, 2024

Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across Modalities

arXiv:2408.08105v414 citationsh-index: 17Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of robust causal inference across text and vision for AI researchers, though it is incremental as it builds on existing MLLM capabilities.

The authors tackled the problem of multimodal causal reasoning by introducing the MuCR benchmark, which uses synthetic siamese images and text pairs to challenge Multimodal Large Language Models (MLLMs), revealing that current models fall short compared to textual settings and proposing a VcCoT strategy that improves performance.

Multimodal Large Language Models (MLLMs) have showcased exceptional Chain-of-Thought (CoT) reasoning ability in complex textual inference tasks including causal reasoning. However, will these causalities remain straightforward when crucial hints hide in visual details? If not, what factors might influence cross-modal generalization? Whether we can effectively enhance their capacity for robust causal inference across both text and vision? Motivated by these, we introduce MuCR - a novel Multimodal Causal Reasoning benchmark that leverages synthetic siamese images and text pairs to challenge MLLMs. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess MLLMs' comprehension abilities. Our experiments reveal that current MLLMs fall short in multimodal causal reasoning compared to their performance in purely textual settings. Additionally, we find that identifying visual cues across images is key to effective cross-modal generalization. Finally, we propose a VcCoT strategy that better highlights visual cues, and our results confirm its efficacy in enhancing multimodal causal reasoning. The project is available at: https://github.com/Zhiyuan-Li-John/MuCR

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes