CVJul 25, 2024

Harnessing Temporal Causality for Advanced Temporal Action Detection

arXiv:2407.17792v210 citationsh-index: 14Has Code
AI Analysis

This addresses the problem of improving action boundary detection in videos for applications like video understanding, though it appears incremental as it builds on existing temporal modeling approaches.

The paper tackles temporal action detection in videos by proposing CausalTAD, which leverages temporal causality by restricting model access to only past or future context, achieving state-of-the-art performance with 1st place rankings in multiple challenge tracks.

As a fundamental task in long-form video understanding, temporal action detection (TAD) aims to capture inherent temporal relations in untrimmed videos and identify candidate actions with precise boundaries. Over the years, various networks, including convolutions, graphs, and transformers, have been explored for effective temporal modeling for TAD. However, these modules typically treat past and future information equally, overlooking the crucial fact that changes in action boundaries are essentially causal events. Inspired by this insight, we propose leveraging the temporal causality of actions to enhance TAD representation by restricting the model's access to only past or future context. We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on multiple benchmarks. Notably, with CausalTAD, we ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, as well as 1st in the Moment Queries track at the Ego4D Challenge 2024. Our code is available at https://github.com/sming256/OpenTAD/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes