CVJan 2, 2023

STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos

arXiv:2301.00794v317 citationsh-index: 45
Originality Incremental advance
AI Analysis

This addresses the need for automated video analysis in job training and AR applications, though it is incremental as it builds on existing self-supervised techniques.

The paper tackles the problem of extracting key steps from unlabeled procedural videos by proposing a self-supervised method that learns discriminative representations and clusters them, showing significant improvements in key step localization and phase classification over prior works.

We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn discriminative representations for various steps without any labels. Different from prior works, we develop techniques to train a light-weight temporal module which uses off-the-shelf features for self supervision. Our approach can seamlessly leverage information from multiple cues like optical flow, depth or gaze to learn discriminative features for key-steps, making it amenable for AR applications. We finally extract key steps via a tunable algorithm that clusters the representations and samples. We show significant improvements over prior works for the task of key step localization and phase classification. Qualitative results demonstrate that the extracted key steps are meaningful and succinctly represent various steps of the procedural tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes