CLMar 17, 2022

On Vision Features in Multimodal Machine Translation

arXiv:2203.09173v1655 citationsh-index: 32Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of improving translation accuracy in multimodal systems for researchers, though it is incremental as it focuses on an understudied aspect rather than introducing a new method.

The paper investigates the impact of vision model quality on multimodal machine translation, finding that stronger vision models like Vision Transformer improve translation learning from visual data, as shown through selective attention and probing tasks.

Previous work on multimodal machine translation (MMT) has focused on the way of incorporating vision features into translation but little attention is on the quality of vision models. In this work, we investigate the impact of vision models on MMT. Given the fact that Transformer is becoming popular in computer vision, we experiment with various strong models (such as Vision Transformer) and enhanced features (such as object-detection and image captioning). We develop a selective attention model to study the patch-level contribution of an image in MMT. On detailed probing tasks, we find that stronger vision models are helpful for learning translation from the visual modality. Our results also suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased. Our code could be found at \url{https://github.com/libeineu/fairseq_mmt}.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes