CVDec 4, 2018

Timeception for Complex Action Recognition

arXiv:1812.01289v2231 citations
AI Analysis

This work addresses the challenge of temporal modeling in video action recognition for applications like surveillance or human-computer interaction, representing an incremental improvement over existing methods.

The paper tackles the problem of recognizing complex human actions in videos by addressing the limitations of fixed-kernel 3D convolutions in capturing varied temporal extents, resulting in Timeception layers that model minute-long patterns and achieve impressive accuracy on datasets like Charades, Breakfast Actions, and MultiTHUMOS.

This paper focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been undervalued. We revisit the conventional definition of activity and restrict it to Complex Action: a set of one-actions with a weak temporal pattern that serves a specific purpose. Related works use spatiotemporal 3D convolutions with fixed kernel size, too rigid to capture the varieties in temporal extents of complex actions, and too short for long-range temporal modeling. In contrast, we use multi-scale temporal convolutions, and we reduce the complexity of 3D convolutions. The outcome is Timeception convolution layers, which reasons about minute-long temporal patterns, a factor of 8 longer than best related works. As a result, Timeception achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions, and MultiTHUMOS. Further, we demonstrate that Timeception learns long-range temporal dependencies and tolerate temporal extents of complex actions.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes