CVJun 5, 2015

Learning to track for spatio-temporal action localization

arXiv:1506.01929v2345 citations
AI Analysis

This work addresses the problem of accurately localizing actions in time and space in videos for computer vision applications, representing an incremental advance with strong specific gains.

The paper tackles spatio-temporal action localization in videos by combining frame-level detection, tracking, and track-level scoring, achieving state-of-the-art performance with mAP improvements of 15%, 7%, and 12% on UCF-Sports, J-HMDB, and UCF-101 datasets.

We propose an effective approach for spatio-temporal action localization in realistic videos. The approach first detects proposals at the frame-level and scores them with a combination of static and motion CNN features. It then tracks high-scoring proposals throughout the video using a tracking-by-detection approach. Our tracker relies simultaneously on instance-level and class-level detectors. The tracks are scored using a spatio-temporal motion histogram, a descriptor at the track level, in combination with the CNN features. Finally, we perform temporal localization of the action using a sliding-window approach at the track level. We present experimental results for spatio-temporal localization on the UCF-Sports, J-HMDB and UCF-101 action localization datasets, where our approach outperforms the state of the art with a margin of 15%, 7% and 12% respectively in mAP.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes