CVApr 16, 2021

Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

arXiv:2104.07905v1122 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of domain mismatch in video representation learning for researchers and practitioners in computer vision, offering an incremental improvement by transferring knowledge from third-person to first-person videos.

The paper tackles the challenge of pre-training egocentric video models by leveraging large-scale third-person video datasets to overcome limitations in scale and diversity of egocentric data, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100 for egocentric activity recognition.

We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Learning from purely egocentric data is limited by low dataset scale and diversity, while using purely exocentric (third-person) data introduces a large domain mismatch. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties. Incorporating these signals as knowledge distillation losses during pre-training results in models that benefit from both the scale and diversity of third-person video data, as well as representations that capture salient egocentric properties. Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models; it outperforms all baselines when fine-tuned for egocentric activity recognition, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes