CVJun 1, 2025

Keystep Recognition using Graph Neural Networks

Julia Lee Romero, Kyle Min, Subarna Tripathi, Morteza Karimzadeh

arXiv:2506.01102v13.6h-index: 11

Originality Incremental advance

AI Analysis

This work addresses fine-grained keystep recognition for egocentric video analysis, offering a computationally efficient solution with incremental advancements in leveraging multimodal data.

The paper tackles keystep recognition in egocentric videos by framing it as a node classification task and proposing GLEVR, a graph-learning framework that leverages long-term dependencies and multimodal data, achieving substantial performance improvements over existing methods on the Ego-Exo4D dataset.

We pose keystep recognition as a node classification task, and propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos. Our approach, termed GLEVR, consists of constructing a graph where each video clip of the egocentric video corresponds to a node. The constructed graphs are sparse and computationally efficient, outperforming existing larger models substantially. We further leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos, as well as adding automatic captioning as an additional modality. We consider each clip of each exocentric video (if available) or video captions as additional nodes during training. We examine several strategies to define connections across these nodes. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods.

View on arXiv PDF

Similar