CVDec 8, 2018

Attend More Times for Image Captioning

arXiv:1812.03283v24 citations
Originality Incremental advance
AI Analysis

This addresses the issue of missing information in caption generation for computer vision applications, but it is incremental as it builds on existing attention-based methods.

The paper tackles the problem of rigid attention in image captioning by proposing a model that attends to the image multiple times per word, resulting in improved performance on the MSCOCO dataset with BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE scores of 0.381, 0.283, 0.580, 1.261, and 0.220.

Most attention-based image captioning models attend to the image once per word. However, attending once per word is rigid and is easy to miss some information. Attending more times can adjust the attention position, find the missing information back and avoid generating the wrong word. In this paper, we show that attending more times per word can gain improvements in the image captioning task, without increasing the number of parameters. We propose a flexible two-LSTM merge model to make it convenient to encode more attentions than words. Our captioning model uses two LSTMs to encode the word sequence and the attention sequence respectively. The information of the two LSTMs and the image feature are combined to predict the next word. Experiments on the MSCOCO caption dataset show that our method outperforms the state-of-the-art. Using bottom up features and self-critical training method, our method gets BLEU-4, METEOR, ROUGE-L, CIDEr and SPICE scores of 0.381, 0.283, 0.580, 1.261 and 0.220 on the Karpathy test split.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes