CVJun 23, 2022

Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization

arXiv:2206.11493v147 citationsh-index: 31
Originality Incremental advance
AI Analysis

This work addresses the challenge of isolating subtle human actions from distracting co-occurring elements in videos, which is an incremental advancement for video analysis tasks.

The paper tackled the problem of temporal action localization by decoupling action and co-occurrence features in videos to create a more salient representation, resulting in significant performance improvements on THUMOS14 and ActivityNet v1.3 benchmarks.

The main challenge of Temporal Action Localization is to retrieve subtle human actions from various co-occurring ingredients, e.g., context and background, in an untrimmed video. While prior approaches have achieved substantial progress through devising advanced action detectors, they still suffer from these co-occurring ingredients which often dominate the actual action content in videos. In this paper, we explore two orthogonal but complementary aspects of a video snippet, i.e., the action features and the co-occurrence features. Especially, we develop a novel auxiliary task by decoupling these two types of features within a video snippet and recombining them to generate a new feature representation with more salient action information for accurate action localization. We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features, and then synthesizes a new action-dominated video representation. Extensive experimental results and ablation studies on THUMOS14 and ActivityNet v1.3 demonstrate that our new representation, combined with a simple action detector, can significantly improve the action localization performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes