CVAIMay 29

Zamba2-VL Technical Report

arXiv:2606.0039092.2h-index: 26Has Code
Predicted impact top 13% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners deploying vision-language models on resource-constrained devices, Zamba2-VL offers a more efficient alternative to Transformer-based VLMs without sacrificing accuracy.

Zamba2-VL, a suite of vision-language models using a hybrid Mamba2-transformer architecture, achieves competitive performance with leading Transformer-based VLMs across multiple benchmarks while delivering roughly an order of magnitude lower time-to-first-token, especially at smaller scales suitable for edge deployment.

We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space layers with a small number of shared transformer blocks. Across a broad range of image understanding, reasoning, OCR, grounding, and counting benchmarks, Zamba2-VL is competitive with leading Transformer-based open-weight VLMs of comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, and substantially outperforms prior SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting the near-linear prefill compute and small, near-constant recurrent state of its Zamba2 backbone, Zamba2-VL delivers roughly an order of magnitude lower time-to-first-token (TTFT) than these Transformer baselines at matched parameter scale, with the efficiency gap most pronounced at the smaller 1.2B and 2.7B scales most relevant to on-device and edge deployment. We release three models -- 1.2B, 2.7B, and 7B -- together with inference code at https://huggingface.co/collections/Zyphra/zamba2-vl.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes