CLCVJun 7, 2022

cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

arXiv:2206.03354v24 citationsh-index: 20
Originality Incremental advance
AI Analysis

This work addresses the lack of non-English vision-language models, enabling broader accessibility for tasks like visual question answering, though it is incremental as it builds on existing methods like OSCAR+.

The authors tackled the problem of limited cross-lingual vision-language models by proposing a pipeline that uses knowledge distillation to train monolingual models for target languages like Japanese and Hindi, achieving relative accuracy increases of 4.4% and 13.4% over state-of-the-art models.

Vision-and-language tasks are gaining popularity in the research community, but the focus is still mainly on English. We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language. We propose to extend OSCAR+, a model which leverages object tags as anchor points for learning image-text alignments, to train on visual question answering datasets in different languages. We propose a novel approach to knowledge distillation to train the model in other languages using parallel sentences. Compared to other models that use the target language in the pretraining corpora, we can leverage an existing English model to transfer the knowledge to the target language using significantly lesser resources. We also release a large-scale visual question answering dataset in Japanese and Hindi language. Though we restrict our work to visual question answering, our model can be extended to any sequence-level classification task, and it can be extended to other languages as well. This paper focuses on two languages for the visual question answering task - Japanese and Hindi. Our pipeline outperforms the current state-of-the-art models by a relative increase of 4.4% and 13.4% respectively in accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes