TriDet: Temporal Action Detection with Relative Boundary Modeling
This work addresses the challenge of ambiguous action boundaries in video analysis for applications like surveillance and content indexing, representing an incremental improvement over existing methods.
The paper tackles the problem of imprecise boundary predictions in temporal action detection by proposing TriDet, a one-stage framework with a Trident-head for relative boundary modeling and an SGP layer for feature aggregation, achieving state-of-the-art performance with an average mAP of 69.3% on THUMOS14, outperforming previous methods by 2.5% while reducing latency to 74.6%.
In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose an efficient Scalable-Granularity Perception (SGP) layer to mitigate the rank loss problem of self-attention that takes place in the video features and aggregate information across different temporal granularities. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of $69.3\%$ on THUMOS14, outperforming the previous best by $2.5\%$, but with only $74.6\%$ of its latency. The code is released to https://github.com/sssste/TriDet.