Learning to track for spatio-temporal action localization
This work addresses the problem of accurately localizing actions in time and space in videos for computer vision applications, representing an incremental advance with strong specific gains.
The paper tackles spatio-temporal action localization in videos by combining frame-level detection, tracking, and track-level scoring, achieving state-of-the-art performance with mAP improvements of 15%, 7%, and 12% on UCF-Sports, J-HMDB, and UCF-101 datasets.
We propose an effective approach for spatio-temporal action localization in realistic videos. The approach first detects proposals at the frame-level and scores them with a combination of static and motion CNN features. It then tracks high-scoring proposals throughout the video using a tracking-by-detection approach. Our tracker relies simultaneously on instance-level and class-level detectors. The tracks are scored using a spatio-temporal motion histogram, a descriptor at the track level, in combination with the CNN features. Finally, we perform temporal localization of the action using a sliding-window approach at the track level. We present experimental results for spatio-temporal localization on the UCF-Sports, J-HMDB and UCF-101 action localization datasets, where our approach outperforms the state of the art with a margin of 15%, 7% and 12% respectively in mAP.