CVAug 9, 2023

PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

arXiv:2308.05051v110 citationsh-index: 46
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in video action detection for researchers and practitioners, offering incremental improvements over existing methods.

The paper tackles the problem of losing temporal positional information in transformer-based networks for dense multi-label action detection in videos, and achieves new state-of-the-art mAP scores of 26.5% on Charades and 44.6% on MultiTHUMOS, with improvements of 1.1% and 0.6% respectively.

We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes