OadTR: Online Action Detection with Transformers
This work improves online action detection for video analysis applications by introducing a more efficient and accurate method, though it is incremental as it adapts Transformers to an existing task.
The paper tackles the problem of online action detection by proposing OadTR, a Transformer-based encoder-decoder framework that addresses issues like non-parallelism and gradient vanishing in RNNs, achieving higher speeds and outperforming state-of-the-art methods in mAP and mcAP on datasets like HDD, TVSeries, and THUMOS14.
Most recent approaches for online action detection tend to apply Recurrent Neural Network (RNN) to capture long-range temporal structure. However, RNN suffers from non-parallelism and gradient vanishing, hence it is hard to be optimized. In this paper, we propose a new encoder-decoder framework based on Transformers, named OadTR, to tackle these problems. The encoder attached with a task token aims to capture the relationships and global interactions between historical observations. The decoder extracts auxiliary information by aggregating anticipated future clip representations. Therefore, OadTR can recognize current actions by encoding historical information and predicting future context simultaneously. We extensively evaluate the proposed OadTR on three challenging datasets: HDD, TVSeries, and THUMOS14. The experimental results show that OadTR achieves higher training and inference speeds than current RNN based approaches, and significantly outperforms the state-of-the-art methods in terms of both mAP and mcAP. Code is available at https://github.com/wangxiang1230/OadTR.