Retrieval-Augmented Egocentric Video Captioning
This work addresses the problem of video captioning for egocentric videos, which is important for applications like assistive technology and robotics, by leveraging existing third-person video data, representing an incremental advance over prior methods that focused solely on egocentric representations.
The paper tackles the challenge of understanding human actions from first-person videos by introducing EgoInstructor, a retrieval-augmented model that uses third-person instructional videos to enhance captioning, resulting in significant improvements in egocentric video captioning performance across multiple benchmarks.
Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references. Project page is available at: https://jazzcharles.github.io/Egoinstructor/