Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis
This addresses the issue of evaluating genuine visual understanding in MLLMs for AI researchers, though it is incremental as it builds on existing attention analysis methods.
The paper tackled the problem of Multimodal Large Language Models (MLLMs) providing correct answers without fully comprehending visual input, by defining implicit visual misunderstanding (IVM) and introducing a scale-agnostic metric called attention accuracy to quantify it, which remains robust to positional biases.
Recent advancements have enhanced the capability of Multimodal Large Language Models (MLLMs) to comprehend multi-image information. However, existing benchmarks primarily evaluate answer correctness, overlooking whether models genuinely comprehend the visual input. To address this, we define implicit visual misunderstanding (IVM), where MLLMs provide correct answers without fully comprehending the visual input. Through our analysis, we decouple the visual and textual modalities within the causal attention module, revealing that attention distribution increasingly converges on the image associated with the correct answer as the network layers deepen. This insight leads to the introduction of a scale-agnostic metric, \textit{attention accuracy}, and a novel benchmark for quantifying IVMs. Attention accuracy directly evaluates the model's visual understanding via internal mechanisms, remaining robust to positional biases for more reliable assessments. Furthermore, we extend our approach to finer granularities and demonstrate its effectiveness in unimodal scenarios, underscoring its versatility and generalizability.