SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection
This work addresses the problem of detecting event boundaries in videos for computer vision applications, but it is incremental as it builds directly on an existing method.
The authors improved the Structured Context Transformer method for generic event boundary detection by adding a transformer decoder module, introducing optical flow as a new modality, and using model ensemble, achieving an 86.49% F1 score on the Kinetics-GEBD test set, which is a 2.86% improvement over the previous state-of-the-art.
This report presents the algorithm used in the submission of Generic Event Boundary Detection (GEBD) Challenge at CVPR 2022. In this work, we improve the existing Structured Context Transformer (SC-Transformer) method for GEBD. Specifically, a transformer decoder module is added after transformer encoders to extract high quality frame features. The final classification is performed jointly on the results of the original binary classifier and a newly introduced multi-class classifier branch. To enrich motion information, optical flow is introduced as a new modality. Finally, model ensemble is used to further boost performance. The proposed method achieves 86.49% F1 score on Kinetics-GEBD test set. which improves 2.86% F1 score compared to the previous SOTA method.