CVJan 1, 2024

Retrieval-Augmented Egocentric Video Captioning

arXiv:2401.00789v460 citationsh-index: 13CVPR
Originality Incremental advance
AI Analysis

This work addresses the problem of video captioning for egocentric videos, which is important for applications like assistive technology and robotics, by leveraging existing third-person video data, representing an incremental advance over prior methods that focused solely on egocentric representations.

The paper tackles the challenge of understanding human actions from first-person videos by introducing EgoInstructor, a retrieval-augmented model that uses third-person instructional videos to enhance captioning, resulting in significant improvements in egocentric video captioning performance across multiple benchmarks.

Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references. Project page is available at: https://jazzcharles.github.io/Egoinstructor/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes