CV AIJun 23, 2025

GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models

arXiv:2506.18985v33 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of explainability for large vision-language models, which is essential for understanding model behavior, though it appears incremental as it builds on existing gradient-based methods.

The authors tackled the problem of interpreting where large vision-language models direct their visual attention by introducing GLIMPSE, a lightweight framework that jointly attributes outputs to visual and textual signals, outperforming prior methods in faithfulness and human-attention alignment.

Recent large vision-language models (LVLMs) have advanced capabilities in visual question answering (VQA). However, interpreting where LVLMs direct their visual attention remains a significant challenge, yet is essential for understanding model behavior. We introduce GLIMPSE (Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation), a lightweight, model-agnostic framework that jointly attributes LVLM outputs to the most relevant visual evidence and textual signals that support open-ended generation. GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and relevance-weighted token aggregation to produce holistic response-level heat maps for interpreting cross-modal reasoning, outperforming prior methods in faithfulness and pushing the state-of-the-art in human-attention alignment. We demonstrate an analytic approach to uncover fine-grained insights into LVLM cross-modal attribution, trace reasoning dynamics, analyze systematic misalignment, diagnose hallucination and bias, and ensure transparency.

View on arXiv PDF

Similar