LG AIOct 22, 2025

TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

André G. Viveiros, Patrick Fernandes, Saul Santos, Sonal Sannigrahi, Emmanouil Zaranis, Nuno M. Guerreiro, Amin Farajian, Pierre Colombo, Graham Neubig, André F. T. Martins

arXiv:2510.21849v31 citationsh-index: 18

Originality Incremental advance

AI Analysis

This work addresses the limitation of multilingual effectiveness in vision-language models, which is an incremental improvement for researchers and practitioners in AI and NLP.

The paper tackled the problem of English-centric design in vision-language models by analyzing multilingual design choices and introducing TowerVision, a family of open multilingual VLMs that achieve competitive performance on benchmarks like ALM-Bench and Multi30K, surpassing existing approaches trained on larger datasets.

Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization -- both from high-resource to underrepresented languages and vice versa -- and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.

View on arXiv PDF

Similar