CVJul 4, 2022

Egocentric Video-Language Pretraining @ Ego4D Challenge 2022

MicrosoftUW
arXiv:2207.01622v28 citationsh-index: 73Has Code
AI Analysis

This work addresses video-language understanding challenges for egocentric vision applications, representing an incremental advancement by adapting existing pretraining methods to a new dataset.

The authors tackled four egocentric video-language tasks (NLQ, MQ, OSCC, PNR) by proposing a video-language pretraining solution using the Ego4D dataset, achieving results such as 10.46 R@1&IoU@0.3 on NLQ and 74% accuracy on OSCC.

In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR). Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation or video-only representation to several video downstream tasks. Our Egocentric VLP achieves 10.46R@1&IoU @0.3 on NLQ, 10.33 mAP on MQ, 74% Acc on OSCC, 0.67 sec error on PNR. The code is available at https://github.com/showlab/EgoVLP.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes