S3D: Single Shot multi-Span Detector via Fully 3D Convolutional Networks
This addresses the problem of efficient and accurate activity detection in videos for applications like surveillance or video analysis, offering a simplified approach compared to multi-stage methods.
The paper tackles temporal activity detection in long, untrimmed videos by proposing S3D, a single-shot, end-to-end fully 3D convolutional network that predicts activity categories and durations directly from video streams, achieving state-of-the-art performance on the THUMOS'14 benchmark with an efficiency of 1271 FPS.
In this paper, we present a novel Single Shot multi-Span Detector for temporal activity detection in long, untrimmed videos using a simple end-to-end fully three-dimensional convolutional (Conv3D) network. Our architecture, named S3D, encodes the entire video stream and discretizes the output space of temporal activity spans into a set of default spans over different temporal locations and scales. At prediction time, S3D predicts scores for the presence of activity categories in each default span and produces temporal adjustments relative to the span location to predict the precise activity duration. Unlike many state-of-the-art systems that require a separate proposal and classification stage, our S3D is intrinsically simple and dedicatedly designed for single-shot, end-to-end temporal activity detection. When evaluating on THUMOS'14 detection benchmark, S3D achieves state-of-the-art performance and is very efficient and can operate at 1271 FPS.