CVJul 4, 2022

Egocentric Video-Language Pretraining @ Ego4D Challenge 2022

Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang

MicrosoftUW

arXiv:2207.01622v26.58 citationsh-index: 73Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses video-language understanding challenges for egocentric vision applications, representing an incremental advancement by adapting existing pretraining methods to a new dataset.

The authors tackled four egocentric video-language tasks (NLQ, MQ, OSCC, PNR) by proposing a video-language pretraining solution using the Ego4D dataset, achieving results such as 10.46 R@1&IoU@0.3 on NLQ and 74% accuracy on OSCC.

In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR). Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation or video-only representation to several video downstream tasks. Our Egocentric VLP achieves 10.46R@1&IoU @0.3 on NLQ, 10.33 mAP on MQ, 74% Acc on OSCC, 0.67 sec error on PNR. The code is available at https://github.com/showlab/EgoVLP.

View on arXiv PDF Code

Similar