CVOct 15, 2024

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

arXiv:2410.11829v121 citationsh-index: 26Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of fine-grained vision-language understanding for MLLM users, offering a more flexible and lightweight solution compared to previous multi-encoder methods, though it is incremental as it builds on existing models like LLaVA-1.5.

The paper tackled the challenge of capturing intricate image details in Multimodal Large Language Models (MLLMs) by proposing MMFuser, a multi-layer feature fuser that integrates deep and shallow features from Vision Transformers, which achieved significant improvements in visual representation and benchmark performance when applied to the LLaVA-1.5 model.

Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the representation with fine-grained information. Applied to the LLaVA-1.5 model, \modelname~achieves significant improvements in visual representation and benchmark performance, providing a more flexible and lightweight solution compared to multi-encoder ensemble methods. The code and model have been released at https://github.com/yuecao0119/MMFuser.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes