CVNov 25, 2021

Learning from Temporal Gradient for Semi-supervised Action Recognition

arXiv:2111.13241v369 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of limited labeled data for video action recognition, offering a domain-specific improvement over existing image-based methods.

The paper tackles the problem of semi-supervised action recognition in videos by introducing temporal gradient as an additional modality to leverage temporal dynamics, achieving state-of-the-art performance on benchmarks like Kinetics-400, UCF-101, and HMDB-51 without extra inference costs.

Semi-supervised video action recognition tends to enable deep neural networks to achieve remarkable performance even with very limited labeled data. However, existing methods are mainly transferred from current image-based methods (e.g., FixMatch). Without specifically utilizing the temporal dynamics and inherent multimodal attributes, their results could be suboptimal. To better leverage the encoded temporal information in videos, we introduce temporal gradient as an additional modality for more attentive feature extraction in this paper. To be specific, our method explicitly distills the fine-grained motion representations from temporal gradient (TG) and imposes consistency across different modalities (i.e., RGB and TG). The performance of semi-supervised action recognition is significantly improved without additional computation or parameters during inference. Our method achieves the state-of-the-art performance on three video action recognition benchmarks (i.e., Kinetics-400, UCF-101, and HMDB-51) under several typical semi-supervised settings (i.e., different ratios of labeled data).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes