CVApr 20, 2019

Multi-modal gated recurrent units for image description

arXiv:1904.09421v130 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of creating relevant and grammatically correct image descriptions, which is important for applications like accessibility and image retrieval, but it appears incremental as it builds on existing GRU and CNN methods.

The paper tackles the problem of generating natural language descriptions for images by proposing a multi-modal gated recurrent unit (GRU) model that learns inter-modal relations between image features and sentences, achieving state-of-the-art performance on Flickr8K, Flickr30K, and MS COCO datasets.

Using a natural language sentence to describe the content of an image is a challenging but very important task. It is challenging because a description must not only capture objects contained in the image and the relationships among them, but also be relevant and grammatically correct. In this paper a multi-modal embedding model based on gated recurrent units (GRU) which can generate variable-length description for a given image. In the training step, we apply the convolutional neural network (CNN) to extract the image feature. Then the feature is imported into the multi-modal GRU as well as the corresponding sentence representations. The multi-modal GRU learns the inter-modal relations between image and sentence. And in the testing step, when an image is imported to our multi-modal GRU model, a sentence which describes the image content is generated. The experimental results demonstrate that our multi-modal GRU model obtains the state-of-the-art performance on Flickr8K, Flickr30K and MS COCO datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes