CVROMLMay 11, 2016

Unsupervised Semantic Action Discovery from Video Collections

arXiv:1605.03324v14 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of automatically understanding and structuring instructional videos for applications in video analysis and retrieval, though it is incremental as it builds on existing unsupervised and multimodal methods.

The paper tackles the problem of parsing instructional videos into semantic steps without supervision, using a joint generative model of visual and language cues to produce a storyline and textual descriptions for each step, and demonstrates semantically correct instruction discovery on complex YouTube videos.

Human communication takes many forms, including speech, text and instructional videos. It typically has an underlying structure, with a starting point, ending, and certain objective steps between them. In this paper, we consider instructional videos where there are tens of millions of them on the Internet. We propose a method for parsing a video into such semantic steps in an unsupervised way. Our method is capable of providing a semantic "storyline" of the video composed of its objective steps. We accomplish this using both visual and language cues in a joint generative model. Our method can also provide a textual description for each of the identified semantic steps and video segments. We evaluate our method on a large number of complex YouTube videos and show that our method discovers semantically correct instructions for a variety of tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes