CVCLLGOct 4, 2014

Explain Images with Multimodal Recurrent Neural Networks

arXiv:1410.1090v1405 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of automatically explaining image content for applications in computer vision and natural language processing, representing an incremental improvement over existing methods.

The paper tackles the problem of generating novel sentence descriptions for images using a multimodal Recurrent Neural Network (m-RNN) model, which directly models word probabilities based on previous words and the image, and it outperforms state-of-the-art generative methods on benchmark datasets like IAPR TC-12, Flickr 8K, and Flickr 30K.

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12, Flickr 8K, and Flickr 30K). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes