CVCLLGDec 20, 2014

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

arXiv:1412.6632v51279 citations
Originality Incremental advance
AI Analysis

This addresses the problem of generating accurate and novel captions for images, which is important for applications in computer vision and natural language processing, though it is incremental as it builds on existing neural network approaches.

The paper tackles image caption generation by proposing a multimodal Recurrent Neural Network (m-RNN) that models word probabilities based on previous words and images, and it outperforms state-of-the-art methods on four benchmark datasets while also improving retrieval tasks.

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN model to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. The project page of this work is: www.stat.ucla.edu/~junhua.mao/m-RNN.html .

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes