CVCLSep 5, 2019

Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

arXiv:1909.02489v124 citations
AI Analysis

This work addresses the challenge of generating fine-grained image captions for applications in multimodal translation, though it appears incremental as it builds on existing attention models.

The paper tackles the problem of image caption generation by proposing Stack-VS, a multi-stage architecture that combines bottom-up and top-down attention models to handle both visual and semantic information, resulting in improvements of 0.372 on BLEU-4, 1.226 on CIDEr, and 0.216 on SPICE scores compared to state-of-the-art methods.

Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, i.e., top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semanticlevel attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-gained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-gained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4/CIDEr/SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-arts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes