CLNov 1, 2018

Latent Variable Model for Multi-modal Translation

arXiv:1811.00357v21111 citations
Originality Incremental advance
AI Analysis

This work addresses multi-modal translation for AI applications, but it is incremental as it builds on existing latent variable and variational auto-encoder approaches.

The authors tackled the problem of multi-modal neural machine translation by modeling interactions between visual and textual features using a latent variable model, which improved over strong baselines without requiring images at test time.

In this work, we propose to model the interaction between visual and textual features for multi-modal neural machine translation (MMT) through a latent variable model. This latent variable can be seen as a multi-modal stochastic embedding of an image and its description in a foreign language. It is used in a target-language decoder and also to predict image features. Importantly, our model formulation utilises visual and textual inputs during training but does not require that images be available at test time. We show that our latent variable MMT formulation improves considerably over strong baselines, including a multi-task learning approach (Elliott and Kádár, 2017) and a conditional variational auto-encoder approach (Toyama et al., 2016). Finally, we show improvements due to (i) predicting image features in addition to only conditioning on them, (ii) imposing a constraint on the minimum amount of information encoded in the latent variable, and (iii) by training on additional target-language image descriptions (i.e. synthetic data).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes