AIFeb 24

Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

arXiv:2602.20878v1h-index: 1
Originality Incremental advance
AI Analysis

This addresses the challenge of diagnosing causal reasoning limitations in vision-language models for AI researchers, though it is incremental as it builds on existing evaluation methods.

The paper tackled the problem of assessing whether vision-language models rely on spurious correlations rather than genuine causal reasoning, by introducing Vision-Language Causal Graphs (VLCGs) and the ViLCaR benchmark, which showed that injecting structured relevance information significantly improves attribution and inference consistency in state-of-the-art models.

Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes