Spatio-Temporal Action Detection Under Large Motion
This work addresses a specific bottleneck in video action detection for applications like sports analysis, but it is incremental as it builds on existing tracking and feature aggregation techniques.
The paper tackles the problem of spatio-temporal action detection under large motion, where existing methods fail due to ineffective feature pooling, and proposes track-aware feature aggregation, achieving state-of-the-art performance on the MultiSports dataset with consistent improvements, especially for high-motion actions.
Current methods for spatiotemporal action tube detection often extend a bounding box proposal at a given keyframe into a 3D temporal cuboid and pool features from nearby frames. However, such pooling fails to accumulate meaningful spatiotemporal features if the position or shape of the actor shows large 2D motion and variability through the frames, due to large camera motion, large actor shape deformation, fast actor action and so on. In this work, we aim to study the performance of cuboid-aware feature aggregation in action detection under large action. Further, we propose to enhance actor feature representation under large motion by tracking actors and performing temporal feature aggregation along the respective tracks. We define the actor motion with intersection-over-union (IoU) between the boxes of action tubes/tracks at various fixed time scales. The action having a large motion would result in lower IoU over time, and slower actions would maintain higher IoU. We find that track-aware feature aggregation consistently achieves a large improvement in action detection performance, especially for actions under large motion compared to the cuboid-aware baseline. As a result, we also report state-of-the-art on the large-scale MultiSports dataset. The Code is available at https://github.com/gurkirt/ActionTrackDetectron.