CVAug 26, 2020

Effective Action Recognition with Embedded Key Point Shifts

arXiv:2008.11378v12 citations
Originality Incremental advance
AI Analysis

This addresses the need for efficient and annotation-free temporal feature extraction in video action recognition, representing an incremental improvement over existing methods.

The paper tackles the problem of costly key point annotation in skeleton-based action recognition by proposing a novel temporal feature extraction module that adaptively extracts channel-wise key point shifts without annotation, achieving state-of-the-art performance of 82.05% on Mini-Kinetics and competitive results on other datasets.

Temporal feature extraction is an essential technique in video-based action recognition. Key points have been utilized in skeleton-based action recognition methods but they require costly key point annotation. In this paper, we propose a novel temporal feature extraction module, named Key Point Shifts Embedding Module ($KPSEM$), to adaptively extract channel-wise key point shifts across video frames without key point annotation for temporal feature extraction. Key points are adaptively extracted as feature points with maximum feature values at split regions, while key point shifts are the spatial displacements of corresponding key points. The key point shifts are encoded as the overall temporal features via linear embedding layers in a multi-set manner. Our method achieves competitive performance through embedding key point shifts with trivial computational cost, achieving the state-of-the-art performance of 82.05% on Mini-Kinetics and competitive performance on UCF101, Something-Something-v1, and HMDB51 datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes