CV MMNov 8, 2015

VideoStory Embeddings Recognize Events when Examples are Scarce

Amirhossein Habibian, Thomas Mensink, Cees G. M. Snoek

arXiv:1511.02492v14.511 citations

Originality Highly original

AI Analysis

It addresses the problem of event recognition in videos for applications where labeled data is limited, offering a novel embedding method that improves over existing approaches.

The paper tackles event recognition in videos with scarce or no examples by learning a semantic representation called VideoStory from web videos and descriptions, achieving state-of-the-art accuracy for few- and zero-example recognition.

This paper aims for event recognition when video examples are scarce or even completely absent. The key in such a challenging setting is a semantic video representation. Rather than building the representation from individual attribute detectors and their annotations, we propose to learn the entire representation from freely available web videos and their descriptions using an embedding between video features and term vectors. In our proposed embedding, which we call VideoStory, the correlations between the terms are utilized to learn a more effective representation by optimizing a joint objective balancing descriptiveness and predictability.We show how learning the VideoStory using a multimodal predictability loss, including appearance, motion and audio features, results in a better predictable representation. We also propose a variant of VideoStory to recognize an event in video from just the important terms in a text query by introducing a term sensitive descriptiveness loss. Our experiments on three challenging collections of web videos from the NIST TRECVID Multimedia Event Detection and Columbia Consumer Videos datasets demonstrate: i) the advantages of VideoStory over representations using attributes or alternative embeddings, ii) the benefit of fusing video modalities by an embedding over common strategies, iii) the complementarity of term sensitive descriptiveness and multimodal predictability for event recognition without examples. By it abilities to improve predictability upon any underlying video feature while at the same time maximizing semantic descriptiveness, VideoStory leads to state-of-the-art accuracy for both few- and zero-example recognition of events in video.

View on arXiv PDF

Similar