CLApr 30, 2025

Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders

arXiv:2504.21681v21 citationsh-index: 4TSD
Originality Incremental advance
AI Analysis

This work addresses the challenge of multilingual vision-language tasks for non-English speakers, but it is incremental as it builds on existing cross-lingual transfer methods.

The study investigated how parallel data affects cross-lingual transfer for vision-language encoders, finding that machine-translated task data performed best on average, but authentic caption-like data outperformed it in some languages, and that multilingual training benefits most languages.

Most pre-trained Vision-Language (VL) models and training data for the downstream tasks are only available in English. Therefore, multilingual VL tasks are solved using cross-lingual transfer: fine-tune a multilingual pre-trained model or transfer the text encoder using parallel data. We study the alternative approach: transferring an already trained encoder using parallel data. We investigate the effect of parallel data: domain and the number of languages, which were out of focus in previous work. Our results show that even machine-translated task data are the best on average, caption-like authentic parallel data outperformed it in some languages. Further, we show that most languages benefit from multilingual training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes