CVApr 20, 2018

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

arXiv:1804.07667v1682 citations
Originality Incremental advance
AI Analysis

This work addresses video analysis for action detection, offering incremental improvements over existing methods.

The paper tackled temporal action localization in video by proposing TAL-Net, which improved receptive field alignment, temporal context exploitation, and multi-stream feature fusion, achieving state-of-the-art performance on THUMOS'14 and competitive results on ActivityNet.

We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localization on THUMOS'14 detection benchmark and competitive performance on ActivityNet challenge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes