CVCLDec 13, 2024

Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation

arXiv:2412.09817v15 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This work addresses the interpretability and efficiency issues in multimodal models for researchers and practitioners, but it is incremental as it builds on existing methods like LLaVA1.5.

The paper tackles the problem of improving complex reasoning in multimodal large language models by proposing Simignore, a method that reduces irrelevant image tokens based on similarity computation, leading to enhanced performance on complex reasoning tasks.

Multimodal large language models have experienced rapid growth, and numerous different models have emerged. The interpretability of LVLMs remains an under-explored area. Especially when faced with more complex tasks such as chain-of-thought reasoning, its internal mechanisms still resemble a black box that is difficult to decipher. By studying the interaction and information flow between images and text, we noticed that in models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence in the LLM decoding layer, and these image tokens receive higher attention scores. However, those image tokens that are less relevant to the text do not have information flow convergence, and they only get very small attention scores. To efficiently utilize the image information, we propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs by computing the similarity between image and text embeddings and ignoring image tokens that are irrelevant and unimportant to the text. Through extensive experiments, we demonstrate the effectiveness of our method for complex reasoning tasks. The paper's source code can be accessed from \url{https://github.com/FanshuoZeng/Simignore}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes