CVAILGSep 9, 2024

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling

arXiv:2409.05395v225 citationsh-index: 20
Originality Incremental advance
AI Analysis

This work addresses the efficiency and performance trade-offs in vision-language modeling for AI researchers, though it is incremental as it compares existing architectures rather than introducing a new paradigm.

This study tackled the problem of replacing Transformers with Mamba structured state space models in Visual Language Models, finding that Mamba-based models up to 3B parameters outperform Transformers in captioning, question answering, and reading comprehension, but Transformers achieve greater performance in visual grounding, with the gap widening with scale.

This study explores replacing Transformers in Visual Language Models (VLMs) with Mamba, a recent structured state space model (SSM) that demonstrates promising performance in sequence modeling. We test models up to 3B parameters under controlled conditions, showing that Mamba-based VLMs outperforms Transformers-based VLMs in captioning, question answering, and reading comprehension. However, we find that Transformers achieve greater performance in visual grounding and the performance gap widens with scale. We explore two hypotheses to explain this phenomenon: 1) the effect of task-agnostic visual encoding on the updates of the hidden states, and 2) the difficulty in performing visual grounding from the perspective of in-context multimodal retrieval. Our results indicate that a task-aware encoding yields minimal performance gains on grounding, however, Transformers significantly outperform Mamba at in-context multimodal retrieval. Overall, Mamba shows promising performance on tasks where the correct output relies on a summary of the image but struggles when retrieval of explicit information from the context is required.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes