CVAIMay 7, 2025

Object-Shot Enhanced Grounding Network for Egocentric Video

arXiv:2505.04270v115 citationsh-index: 17Has CodeCVPR
Originality Incremental advance
AI Analysis

This work addresses egocentric video grounding for embodied intelligence applications, representing an incremental improvement by incorporating object and shot movement features.

The paper tackles the problem of egocentric video grounding by addressing the neglect of key egocentric characteristics and fine-grained query information in existing methods, proposing OSGNet which achieves state-of-the-art performance on three datasets.

Egocentric video grounding is a crucial task for embodied intelligence applications, distinct from exocentric video moment localization. Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitations, we propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video. Specifically, we extract object information from videos to enrich video representation, particularly for objects highlighted in the textual query but not directly captured in the video features. Additionally, we analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information, which enhances the model's ability to perform modality alignment. Experiments conducted on three datasets demonstrate that OSGNet achieves state-of-the-art performance, validating the effectiveness of our approach. Our code can be found at https://github.com/Yisen-Feng/OSGNet.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes