IRCVSDASJan 7, 2018

Cross-modal Embeddings for Video and Audio Retrieval

arXiv:1801.02200v178 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of linking audio and visual data for retrieval tasks, offering an incremental improvement in unsupervised cross-modal feature learning.

The paper tackles the problem of cross-modal retrieval between audio and visual content by learning joint embeddings from the YouTube-8M dataset, resulting in improved Recall@K scores for retrieving audio samples from silent videos and images from audio queries.

The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning. We train embeddings for both scales and assess their quality in a retrieval problem, formulated as using the feature extracted from one modality to retrieve the most similar videos based on the features computed in the other modality.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes