CVJun 5, 2018

Mining for meaning: from vision to language through multiple networks consensus

arXiv:1806.01954v21 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of generating accurate and fluent natural language descriptions for videos, which is incremental as it builds on existing encoder-decoder methods.

The authors tackled video captioning by using a consensus among multiple encoder-decoder networks to select the best description, achieving state-of-the-art results on the MSR-VTT dataset.

Describing visual data into natural language is a very challenging task, at the intersection of computer vision, natural language processing and machine learning. Language goes well beyond the description of physical objects and their interactions and can convey the same abstract idea in many ways. It is both about content at the highest semantic level as well as about fluent form. Here we propose an approach to describe videos in natural language by reaching a consensus among multiple encoder-decoder networks. Finding such a consensual linguistic description, which shares common properties with a larger group, has a better chance to convey the correct meaning. We propose and train several network architectures and use different types of image, audio and video features. Each model produces its own description of the input video and the best one is chosen through an efficient, two-phase consensus process. We demonstrate the strength of our approach by obtaining state of the art results on the challenging MSR-VTT dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes