CVMay 13, 2019

VideoGraph: Recognizing Minutes-Long Human Activities in Videos

arXiv:1905.05143v287 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of modeling long-term temporal dependencies in video activity recognition for applications like surveillance or human-computer interaction, though it is incremental as it builds on existing graph and temporal modeling approaches.

The paper tackles the problem of recognizing minutes-long human activities in videos by proposing VideoGraph, a graph-based representation method that learns temporal structure from video datasets without node-level annotation, resulting in improvements on Epic-Kitchen and Breakfast benchmarks.

Many human activities take minutes to unfold. To represent them, related works opt for statistical pooling, which neglects the temporal structure. Others opt for convolutional methods, as CNN and Non-Local. While successful in learning temporal concepts, they are short of modeling minutes-long temporal dependencies. We propose VideoGraph, a method to achieve the best of two worlds: represent minutes-long human activities and learn their underlying temporal structure. VideoGraph learns a graph-based representation for human activities. The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation. The result is improvements over related works on benchmarks: Epic-Kitchen and Breakfast. Besides, we demonstrate that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes