CVApr 16, 2021

Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

Yanghao Li, Tushar Nagarajan, Bo Xiong, Kristen Grauman

arXiv:2104.07905v125.7122 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the problem of domain mismatch in video representation learning for researchers and practitioners in computer vision, offering an incremental improvement by transferring knowledge from third-person to first-person videos.

The paper tackles the challenge of pre-training egocentric video models by leveraging large-scale third-person video datasets to overcome limitations in scale and diversity of egocentric data, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100 for egocentric activity recognition.

We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Learning from purely egocentric data is limited by low dataset scale and diversity, while using purely exocentric (third-person) data introduces a large domain mismatch. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties. Incorporating these signals as knowledge distillation losses during pre-training results in models that benefit from both the scale and diversity of third-person video data, as well as representations that capture salient egocentric properties. Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models; it outperforms all baselines when fine-tuned for egocentric activity recognition, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100.

View on arXiv PDF Code

Similar