CLOct 7, 2019

Adversarial reconstruction for Multi-modal Machine Translation

arXiv:1910.02766v11 citations
Originality Highly original
AI Analysis

This work addresses the problem of improving multi-modal translation accuracy for applications in computer vision and natural language processing, representing an incremental advancement with a novel method.

The paper tackles the challenge of grounding structured descriptions in images for multi-modal machine translation by proposing a model that learns grounding through adversarial reconstruction of visual features. The approach achieves the highest reported scores on BLEU and METEOR metrics across datasets.

Even with the growing interest in problems at the intersection of Computer Vision and Natural Language, grounding (i.e. identifying) the components of a structured description in an image still remains a challenging task. This contribution aims to propose a model which learns grounding by reconstructing the visual features for the Multi-modal translation task. Previous works have partially investigated standard approaches such as regression methods to approximate the reconstruction of a visual input. In this paper, we propose a different and novel approach which learns grounding by adversarial feedback. To do so, we modulate our network following the recent promising adversarial architectures and evaluate how the adversarial response from a visual reconstruction as an auxiliary task helps the model in its learning. We report the highest scores in term of BLEU and METEOR metrics on the different datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes