CVCLNEApr 23, 2015

Multimodal Convolutional Neural Networks for Matching Image and Sentence

arXiv:1504.06063v5350 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of cross-modal retrieval between images and text, which is incremental as it builds on existing CNN architectures.

The paper tackles the problem of matching images and sentences by proposing multimodal convolutional neural networks (m-CNNs), achieving state-of-the-art performance on benchmark databases like Flickr30K and Microsoft COCO for bidirectional retrieval.

In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content, and one matching CNN learning the joint representation of image and sentence. The matching CNN composes words to different semantic fragments and learns the inter-modal relations between image and the composed fragments at different levels, thus fully exploit the matching relations between image and sentence. Experimental results on benchmark databases of bidirectional image and sentence retrieval demonstrate that the proposed m-CNNs can effectively capture the information necessary for image and sentence matching. Specifically, our proposed m-CNNs for bidirectional image and sentence retrieval on Flickr30K and Microsoft COCO databases achieve the state-of-the-art performances.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes