CVCLSep 12, 2025

VARCO-VISION-2.0 Technical Report

arXiv:2509.10105v24 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work advances bilingual vision-language models for practical applications in Korean and English contexts, though it is incremental as it builds on previous models.

The authors introduced VARCO-VISION-2.0, an open-weight bilingual vision-language model for Korean and English that improves upon its predecessor by supporting multi-image understanding and layout-aware OCR, achieving competitive results such as 8th place on the OpenCompass VLM leaderboard for the 14B model.

We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stage curriculum with memory-efficient techniques, the model achieves enhanced multimodal alignment, while preserving core language abilities and improving safety via preference optimization. Extensive benchmark evaluations demonstrate strong spatial grounding and competitive results for both languages, with the 14B model achieving 8th place on the OpenCompass VLM leaderboard among models of comparable scale. Alongside the 14B-scale model, we release a 1.7B version optimized for on-device deployment. We believe these models advance the development of bilingual VLMs and their practical applications. Two variants of VARCO-VISION-2.0 are available at Hugging Face: a full-scale 14B model and a lightweight 1.7B model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes