CVSep 3, 2021

Video Pose Distillation for Few-Shot, Fine-Grained Sports Action Recognition

arXiv:2109.01305v154 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of fine-grained sports action recognition for researchers and practitioners by providing a weakly-supervised method to enhance pose-based features in video domains where pose estimators fail, though it is incremental as it builds on existing pose and end-to-end learning approaches.

The paper tackles the problem of unreliable pose estimation in sports videos due to domain shift, motion blur, and occlusions, which harms downstream tasks like action recognition, and introduces Video Pose Distillation (VPD) to learn robust pose features from RGB frames, improving performance on few-shot, fine-grained action recognition, retrieval, and detection across four sports datasets without extra pose annotations.

Human pose is a useful feature for fine-grained sports action understanding. However, pose estimators are often unreliable when run on sports video due to domain shift and factors such as motion blur and occlusions. This leads to poor accuracy when downstream tasks, such as action recognition, depend on pose. End-to-end learning circumvents pose, but requires more labels to generalize. We introduce Video Pose Distillation (VPD), a weakly-supervised technique to learn features for new video domains, such as individual sports that challenge pose estimation. Under VPD, a student network learns to extract robust pose features from RGB frames in the sports video, such that, whenever pose is considered reliable, the features match the output of a pretrained teacher pose detector. Our strategy retains the best of both pose and end-to-end worlds, exploiting the rich visual patterns in raw video frames, while learning features that agree with the athletes' pose and motion in the target video domain to avoid over-fitting to patterns unrelated to athletes' motion. VPD features improve performance on few-shot, fine-grained action recognition, retrieval, and detection tasks in four real-world sports video datasets, without requiring additional ground-truth pose annotations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes