CVJun 5, 2018

Mining for meaning: from vision to language through multiple networks consensus

Iulia Duta, Andrei Liviu Nicolicioiu, Simion-Vlad Bogolin, Marius Leordeanu

arXiv:1806.01954v21.71 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of generating accurate and fluent natural language descriptions for videos, which is incremental as it builds on existing encoder-decoder methods.

The authors tackled video captioning by using a consensus among multiple encoder-decoder networks to select the best description, achieving state-of-the-art results on the MSR-VTT dataset.

Describing visual data into natural language is a very challenging task, at the intersection of computer vision, natural language processing and machine learning. Language goes well beyond the description of physical objects and their interactions and can convey the same abstract idea in many ways. It is both about content at the highest semantic level as well as about fluent form. Here we propose an approach to describe videos in natural language by reaching a consensus among multiple encoder-decoder networks. Finding such a consensual linguistic description, which shares common properties with a larger group, has a better chance to convey the correct meaning. We propose and train several network architectures and use different types of image, audio and video features. Each model produces its own description of the input video and the best one is chosen through an efficient, two-phase consensus process. We demonstrate the strength of our approach by obtaining state of the art results on the challenging MSR-VTT dataset.

View on arXiv PDF

Similar