Self-Feedback DETR for Temporal Action Detection
This work addresses a specific bottleneck in video analysis for applications like surveillance or content indexing, but it is incremental as it builds on existing DETR methods.
The paper tackles the temporal collapse problem in DETR-based models for Temporal Action Detection, where self-attention modules focus on few elements, degrading performance; the proposed Self-DETR framework uses cross-attention maps to reactivate self-attention, resolving this issue and maintaining high attention diversity across layers.
Temporal Action Detection (TAD) is challenging but fundamental for real-world video applications. Recently, DETR-based models have been devised for TAD but have not performed well yet. In this paper, we point out the problem in the self-attention of DETR for TAD; the attention modules focus on a few key elements, called temporal collapse problem. It degrades the capability of the encoder and decoder since their self-attention modules play no role. To solve the problem, we propose a novel framework, Self-DETR, which utilizes cross-attention maps of the decoder to reactivate self-attention modules. We recover the relationship between encoder features by simple matrix multiplication of the cross-attention map and its transpose. Likewise, we also get the information within decoder queries. By guiding collapsed self-attention maps with the guidance map calculated, we settle down the temporal collapse of self-attention modules in the encoder and decoder. Our extensive experiments demonstrate that Self-DETR resolves the temporal collapse problem by keeping high diversity of attention over all layers.