CVAIJun 23, 2025

GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models

arXiv:2506.18985v33 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of explainability for large vision-language models, which is essential for understanding model behavior, though it appears incremental as it builds on existing gradient-based methods.

The authors tackled the problem of interpreting where large vision-language models direct their visual attention by introducing GLIMPSE, a lightweight framework that jointly attributes outputs to visual and textual signals, outperforming prior methods in faithfulness and human-attention alignment.

Recent large vision-language models (LVLMs) have advanced capabilities in visual question answering (VQA). However, interpreting where LVLMs direct their visual attention remains a significant challenge, yet is essential for understanding model behavior. We introduce GLIMPSE (Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation), a lightweight, model-agnostic framework that jointly attributes LVLM outputs to the most relevant visual evidence and textual signals that support open-ended generation. GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and relevance-weighted token aggregation to produce holistic response-level heat maps for interpreting cross-modal reasoning, outperforming prior methods in faithfulness and pushing the state-of-the-art in human-attention alignment. We demonstrate an analytic approach to uncover fine-grained insights into LVLM cross-modal attribution, trace reasoning dynamics, analyze systematic misalignment, diagnose hallucination and bias, and ensure transparency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes