CVAICLNov 11, 2015

Deep Multimodal Semantic Embeddings for Speech and Images

arXiv:1511.03690v126.6168 citations
Originality Synthesis-oriented
AI Analysis

This work addresses multimodal learning for speech and images, which is incremental as it builds on existing methods for cross-modal alignment.

The paper tackles the problem of learning a joint semantic space between images and spoken captions by using convolutional neural networks and an embedding-alignment model, achieving evaluation on image search and annotation tasks using the Flickr8k dataset augmented with 40,000 spoken captions.

In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities. We employ a pair of convolutional neural networks to model visual objects and speech signals at the word level, and tie the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. We evaluate our model using image search and annotation tasks on the Flickr8k dataset, which we augmented by collecting a corpus of 40,000 spoken captions using Amazon Mechanical Turk.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes