CVMar 12, 2016

Image Captioning with Semantic Attention

arXiv:1603.03925v11784 citations
Originality Highly original
AI Analysis

This addresses the problem of generating accurate natural language descriptions from images for applications in AI, connecting computer vision and NLP, with incremental improvements over existing methods.

The paper tackles image captioning by proposing a new algorithm that combines top-down and bottom-up approaches using semantic attention, which selectively attends to semantic concepts and fuses them into recurrent neural networks. The result is significant outperformance over state-of-the-art methods on Microsoft COCO and Flickr30K benchmarks across multiple metrics.

Automatically generating a natural language description of an image has attracted interests recently both because of its importance in practical applications and because it connects two major artificial intelligence fields: computer vision and natural language processing. Existing approaches are either top-down, which start from a gist of an image and convert it into words, or bottom-up, which come up with words describing various aspects of an image and then combine them. In this paper, we propose a new algorithm that combines both approaches through a model of semantic attention. Our algorithm learns to selectively attend to semantic concept proposals and fuse them into hidden states and outputs of recurrent neural networks. The selection and fusion form a feedback connecting the top-down and bottom-up computation. We evaluate our algorithm on two public benchmarks: Microsoft COCO and Flickr30K. Experimental results show that our algorithm significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes