CL AI CVDec 3, 2025

Jina-VLM: Small Multilingual Vision Language Model

Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao

arXiv:2512.04032v24.91 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of multilingual visual understanding for AI applications, representing an incremental improvement in scaling and efficiency for existing model architectures.

The paper tackles multilingual visual question answering by introducing Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art performance among open 2B-scale models on standard benchmarks.

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .

View on arXiv PDF

Similar