CVAIMay 23, 2024

Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

arXiv:2405.14612v24 citationsh-index: 3BMVC
Originality Highly original
AI Analysis

This addresses the problem of interpretability for users in critical applications, offering a novel method for analyzing vision perception in MLLMs.

The paper tackles the interpretability challenge in Multi-modal Large Language Models (MLLMs) by proposing a novel architecture that integrates an open-world localization model with a MLLM, enabling simultaneous text generation and object localization from vision embeddings, which enhances interpretability through saliency maps, hallucination identification, and bias assessment.

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in understanding and generating content across various modalities, such as images and text. However, their interpretability remains a challenge, hindering their adoption in critical applications. This research proposes a novel approach to enhance the interpretability of MLLMs by focusing on the image embedding component. We combine an open-world localization model with a MLLM, thus creating a new architecture able to simultaneously produce text and object localization outputs from the same vision embedding. The proposed architecture greatly promotes interpretability, enabling us to design a novel saliency map to explain any output token, to identify model hallucinations, and to assess model biases through semantic adversarial perturbations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes