AI LGMay 21

Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

Joël Roman Ky, Salah Ghamizi, Maxime Cordy

arXiv:2605.2216872.6

Predicted impact top 46% in AI · last 90 daysOriginality Highly original

AI Analysis

For researchers and practitioners evaluating VLM explainability, this work provides a rigorous metric to decouple visual plausibility from cross-modal faithfulness, addressing a critical evaluation collapse in high-stakes deployments.

The paper identifies a failure in existing VLM explainability evaluation: unimodal metrics penalize faithful explainers due to cross-modal redundancy, causing ranking contradictions (Kendall's τ = -0.06). They propose Synergistic Faithfulness (F_syn), a metric based on Shapley Interaction Index that achieves high accuracy (ρ=0.92) with 24× speedup, and show that current VLM explainers over-index on visual salience while attention-based methods better capture cross-modal synergy.

Vision-Language Models (VLMs) map complex visual inputs to semantic spaces, but interpreting the cross-modal reasoning of VLMs currently relies on post-hoc explainers evaluated via unimodal perturbation metrics. We expose a limitation in this paradigm: because multimodal datasets contain language priors and modality biases, VLMs frequently exhibit cross-modal redundancy, allowing them to answer visual queries using text alone. Consequently, unimodal metrics penalize faithful explainers, triggering an evaluation collapse where visual and textual rankings fundamentally contradict each other. %(Kendall's $τ= -0.06$). To resolve this, we introduce Synergistic Faithfulness ($\mathcal{F}_{syn}$), a scalable metric rooted in the Shapley Interaction Index that strictly isolates the joint Harsanyi dividend between modalities, serving as a highly accurate surrogate ($ρ= 0.92$) while achieving a $24\times$ computational speedup. Evaluating 8 distinct XAI methods across 3 VLM architectures and 3 benchmark datasets, reveals that explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods in capturing true cross-modal synergy. By decoupling visual plausibility from cross-modal faithfulness, this work provides a rigorous evaluation framework required to safely audit VLM reasoning in high-stakes deployments.

View on arXiv PDF

Similar