CV CL LGMar 11, 2020

Visual Grounding in Video for Unsupervised Word Translation

Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman

arXiv:2003.05078v217.752 citationsHas Code

Originality Highly original

AI Analysis

This addresses the problem of language translation for low-resource languages by leveraging visual grounding, though it is incremental as it builds on existing text-based methods.

The paper tackles unsupervised word translation by learning a shared visual representation from unpaired instructional videos, achieving superior performance in translating English to French, Korean, and Japanese without parallel corpora.

There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word mapping between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language. Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages. We apply these methods to translate words from English to French, Korean, and Japanese -- all without any parallel corpora and simply by watching many videos of people speaking while doing things.

View on arXiv PDF Code

Similar