CVJan 7, 2025

Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition

arXiv:2501.04121v1h-index: 11
AI Analysis

This addresses the challenge of fine-grained keystep recognition in dynamic egocentric videos, which is incremental as it builds on graph-based methods for video analysis.

The paper tackles keystep recognition in egocentric videos by proposing a graph-learning framework that leverages long-term dependencies and alignment with exocentric videos, achieving over 12 points higher accuracy than existing methods on the Ego-Exo4D dataset.

Egocentric videos capture scenes from a wearer's viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition. We propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos, and leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos. Our approach consists of constructing a graph where each video clip of the egocentric video corresponds to a node. During training, we consider each clip of each exocentric video (if available) as additional nodes. We examine several strategies to define connections across these nodes and pose keystep recognition as a node classification task on the constructed graphs. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods by more than 12 points in accuracy. Furthermore, the constructed graphs are sparse and compute efficient. We also present a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph and discuss their corresponding contribution to the keystep recognition performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes