CVCLMar 16, 2021

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

arXiv:2103.08849v3739 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of making vision-language models effective across multiple languages without language-specific annotations, which is important for global AI applications, though it is incremental as it builds on existing pre-training paradigms.

The paper tackles the problem of performance degradation in zero-shot cross-lingual transfer for vision-language models when using non-English queries, and introduces a multilingual multimodal pre-training strategy with a new dataset (MultiHowTo100M) that significantly improves video search in non-English languages without additional annotations, achieving large-margin outperformance over recent baselines on datasets like VTT, VATEX, and Multi30K.

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (MultiHowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at http://github.com/berniebear/Multi-HT100M.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes