CVJun 1, 2020

Temporal Aggregate Representations for Long-Range Video Understanding

arXiv:2006.00830v230 citations
AI Analysis

It addresses future prediction in videos, which is important for applications like robotics and surveillance, but is incremental as it builds on existing techniques like max-pooling and attention.

The paper tackles long-range video understanding by introducing a multi-granular temporal aggregation framework, achieving state-of-the-art results in next action and dense anticipation on datasets like Breakfast, 50Salads, and EPIC-Kitchens.

Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes