CLAICVDec 3, 2025

Jina-VLM: Small Multilingual Vision Language Model

arXiv:2512.04032v21 citationsh-index: 10Has Code
AI Analysis

This work addresses the problem of multilingual visual understanding for AI applications, representing an incremental improvement in scaling and efficiency for existing model architectures.

The paper tackles multilingual visual question answering by introducing Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art performance among open 2B-scale models on standard benchmarks.

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes