CVMar 21, 2024

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

arXiv:2403.14520v4126 citationsh-index: 18Has CodeAAAI
AI Analysis

This work addresses efficiency problems for users of multimodal AI models, offering a more scalable alternative with reduced computational costs, though it is incremental as it builds on existing Mamba and fusion techniques.

The paper tackles the inefficiency of Transformer-based multimodal large language models (MLLMs) by proposing Cobra, a linear computational complexity MLLM that integrates the Mamba language model into visual modality, achieving competitive performance with state-of-the-art efficient methods and comparable results to LLaVA with about 43% fewer parameters.

In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes