Imagination improves Multimodal Translation
This addresses multimodal translation for language processing, but it appears incremental as it builds on existing multitask and attention-based methods.
The paper tackles multimodal translation by decomposing it into translation and visually grounded representation learning in a multitask framework, improving state-of-the-art performance on the Multi30K dataset and showing effectiveness with external datasets like MS COCO and News Commentary.
We decompose multimodal translation into two sub-tasks: learning to translate and learning visually grounded representations. In a multitask learning framework, translations are learned in an attention-based encoder-decoder, and grounded representations are learned through image representation prediction. Our approach improves translation performance compared to the state of the art on the Multi30K dataset. Furthermore, it is equally effective if we train the image prediction task on the external MS COCO dataset, and we find improvements if we train the translation model on the external News Commentary parallel text.