CVAIMar 28, 2025

Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization

arXiv:2503.22577v23 citationsh-index: 21IJCNLP-AACL
Originality Incremental advance
AI Analysis

This work addresses the language barrier issue in VLMs for global adoption, representing an incremental improvement by enhancing existing methods with multilingual data integration.

The paper tackled the problem of Visual Language Models (VLMs) generating English responses regardless of input language, known as Image-induced Fidelity Loss (IFL), by proposing a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning. The result was a significant improvement in linguistic fidelity across languages without degradation in visual performance, offering a scalable solution to mitigate IFL.

Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning, preserving the language model's original multilingual capabilities. Extensive evaluations demonstrate that our approach significantly improves linguistic fidelity across languages without degradation in visual performance. We also explore model merging, which improves language fidelity but comes at the cost of visual performance. In contrast, our core method achieves robust multilingual alignment without trade-offs, offering a scalable and effective path to mitigating IFL for global VLM adoption.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes