CVJun 1, 2025

Keystep Recognition using Graph Neural Networks

arXiv:2506.01102v1h-index: 11
Originality Incremental advance
AI Analysis

This work addresses fine-grained keystep recognition for egocentric video analysis, offering a computationally efficient solution with incremental advancements in leveraging multimodal data.

The paper tackles keystep recognition in egocentric videos by framing it as a node classification task and proposing GLEVR, a graph-learning framework that leverages long-term dependencies and multimodal data, achieving substantial performance improvements over existing methods on the Ego-Exo4D dataset.

We pose keystep recognition as a node classification task, and propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos. Our approach, termed GLEVR, consists of constructing a graph where each video clip of the egocentric video corresponds to a node. The constructed graphs are sparse and computationally efficient, outperforming existing larger models substantially. We further leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos, as well as adding automatic captioning as an additional modality. We consider each clip of each exocentric video (if available) or video captions as additional nodes during training. We examine several strategies to define connections across these nodes. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes