CVAIJan 21, 2021

Activity Graph Transformer for Temporal Action Localization

arXiv:2101.08540v283 citations
AI Analysis

This work addresses the problem of accurately detecting and localizing actions in videos for applications like video analysis, with a novel approach that improves performance over existing methods.

The paper tackles temporal action localization in untrimmed videos by introducing the Activity Graph Transformer, which models videos as graphs to handle non-linear temporal structures like overlapping or re-occurring actions, and it outperforms state-of-the-art methods on THUMOS14, Charades, and EPIC-Kitchens-100 datasets.

We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization, that receives a video as input and directly predicts a set of action instances that appear in the video. Detecting and localizing action instances in untrimmed videos requires reasoning over multiple action instances in a video. The dominant paradigms in the literature process videos temporally to either propose action regions or directly produce frame-level detections. However, sequential processing of videos is problematic when the action instances have non-sequential dependencies and/or non-linear temporal ordering, such as overlapping action instances or re-occurrence of action instances over the course of the video. In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs. We evaluate our model on challenging datasets: THUMOS14, Charades, and EPIC-Kitchens-100. Our results show that our proposed model outperforms the state-of-the-art by a considerable margin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes