CVOct 9, 2018

Image Captioning as Neural Machine Translation Task in SOCKEYE

arXiv:1810.04101v33 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work provides an incremental improvement for researchers and practitioners by adapting existing NMT methods to image captioning.

The paper tackles image captioning by exploring neural machine translation decoders and attention models, achieving competitive performance on the COCO dataset with BLEU-4 scores up to 36.2.

Image captioning is an interdisciplinary research problem that stands between computer vision and natural language processing. The task is to generate a textual description of the content of an image. The typical model used for image captioning is an encoder-decoder deep network, where the encoder captures the essence of an image while the decoder is responsible for generating a sentence describing the image. Attention mechanisms can be used to automatically focus the decoder on parts of the image which are relevant to predict the next word. In this paper, we explore different decoders and attentional models popular in neural machine translation, namely attentional recurrent neural networks, self-attentional transformers, and fully-convolutional networks, which represent the current state of the art of neural machine translation. The image captioning module is available as part of SOCKEYE at https://github.com/awslabs/sockeye which tutorial can be found at https://awslabs.github.io/sockeye/image_captioning.html .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes