LGAIOct 22, 2025

TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

arXiv:2510.21849v31 citationsh-index: 18
Originality Incremental advance
AI Analysis

This work addresses the limitation of multilingual effectiveness in vision-language models, which is an incremental improvement for researchers and practitioners in AI and NLP.

The paper tackled the problem of English-centric design in vision-language models by analyzing multilingual design choices and introducing TowerVision, a family of open multilingual VLMs that achieve competitive performance on benchmarks like ALM-Bench and Multi30K, surpassing existing approaches trained on larger datasets.

Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization -- both from high-resource to underrepresented languages and vice versa -- and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes