Knowledge Fusion Transformers for Video Action Recognition
This work addresses video action classification for computer vision applications, offering an incremental improvement by blending self-attention architectures to enhance feature representation with reduced pretraining requirements.
The paper tackles video action recognition by introducing Knowledge Fusion Transformers, which use self-attention to fuse action knowledge in 3D spatio-temporal contexts, achieving competitive performance with state-of-the-art methods on UCF-101 and Charades datasets while using only one stream and minimal pretraining.
We introduce Knowledge Fusion Transformers for video action classification. We present a self-attention based feature enhancer to fuse action knowledge in 3D inception based spatio-temporal context of the video clip intended to be classified. We show, how using only one stream networks and with little or, no pretraining can pave the way for a performance close to the current state-of-the-art. Additionally, we present how different self-attention architectures used at different levels of the network can be blended-in to enhance feature representation. Our architecture is trained and evaluated on UCF-101 and Charades dataset, where it is competitive with the state of the art. It also exceeds by a large gap from single stream networks with no to less pretraining.