CLJan 15, 2016

Multimodal Pivots for Image Caption Translation

arXiv:1601.03916v3101 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of translating image captions with limited parallel data, offering a domain-specific solution for multimodal applications.

The paper tackles the problem of improving machine translation for image descriptions by using visual similarity to retrieve target-language captions for reranking, achieving a 1 BLEU point improvement over strong baselines.

We present an approach to improve statistical machine translation of image descriptions by multimodal pivots defined in visual space. The key idea is to perform image retrieval over a database of images that are captioned in the target language, and use the captions of the most similar images for crosslingual reranking of translation outputs. Our approach does not depend on the availability of large amounts of in-domain parallel data, but only relies on available large datasets of monolingually captioned images, and on state-of-the-art convolutional neural networks to compute image similarities. Our experimental evaluation shows improvements of 1 BLEU point over strong baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes