CVAIMay 8

ZAYA1-VL-8B Technical Report

arXiv:2605.0856096.1Has Code
Predicted impact top 7% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and practitioners needing efficient vision-language models, this work demonstrates that a small MoE model with novel architectural tweaks can match or exceed larger models, offering a practical trade-off between size and performance.

ZAYA1-VL-8B, a compact MoE vision-language model, achieves competitive performance with leading base models (e.g., Molmo2-4B, InternVL3.5-4B) and surpasses several others (e.g., Qwen2.5-VL-3B) on image understanding, reasoning, and counting benchmarks, using only 1.4B active parameters.

We present ZAYA1-VL-8B, a compact mixture-of-experts vision-language model built upon our in-house language model, ZAYA1-8B. Despite its compact size, ZAYA1-VL achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing models including Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B across a range of image understanding, reasoning, and counting benchmarks. The architecture incorporates two key innovations: (1) vision-specific LoRA adapters integrated into the LLM to increase modality-specific capacity without increasing the number of experts, and (2) bidirectional attention over image tokens within the LLM to enhance visual understanding. We detail the full training pipeline including data composition at each stage, sequence packing, and the attention masking scheme. The model comprises 9.2B total parameters, with 1.4B active parameters including the vision encoder, and is publicly available at https://huggingface.co/Zyphra/ZAYA1-VL.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes