CVJun 15, 2016

Watch What You Just Said: Image Captioning with Text-Conditional Attention

arXiv:1606.04621v344 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of generating more accurate and context-aware captions for images, which is important for applications in accessibility and content analysis, though it is an incremental improvement over existing attention-based methods.

The paper tackled the problem of improving attention mechanisms in image captioning by proposing text-conditional attention, which uses previously generated text to focus on relevant image features, and it outperformed state-of-the-art methods on the MS-COCO dataset in quantitative metrics and human evaluation.

Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance. However, existing methods use only visual content as attention and whether textual context can improve attention in image captioning remains unsolved. To explore this problem, we propose a novel attention mechanism, called \textit{text-conditional attention}, which allows the caption generator to focus on certain image features given previously generated text. To obtain text-related image features for our attention model, we adopt the guiding Long Short-Term Memory (gLSTM) captioning architecture with CNN fine-tuning. Our proposed method allows joint learning of the image embedding, text embedding, text-conditional attention and language model with one network architecture in an end-to-end manner. We perform extensive experiments on the MS-COCO dataset. The experimental results show that our method outperforms state-of-the-art captioning methods on various quantitative metrics as well as in human evaluation, which supports the use of our text-conditional attention in image captioning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes