CVJul 4, 2022

Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

Kevin Qinghong Lin, Alex Jinpeng Wang, Rui Yan, Eric Zhongcong Xu, Rongcheng Tu, Yanru Zhu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Wei Liu, Mike Zheng Shou

MicrosoftUW

arXiv:2207.01334v24.82 citationsh-index: 33Has Code

Originality Incremental advance

AI Analysis

This work addresses video-text retrieval for egocentric (first-person) videos, which is incremental as it adapts existing video-language pretraining methods to a specific domain.

The authors tackled the EPIC-KITCHENS-100 Multi-Instance Retrieval challenge by developing an egocentric video-language pretraining model, achieving strong results with 47.39% mAP and 61.44% nDCG on the test set.

In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge. Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation to MIR benchmark. Furthermore, we devise an adaptive multi-instance max-margin loss to effectively fine-tune the model and equip the dual-softmax technique for reliable inference. Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG. The code is available at https://github.com/showlab/EgoVLP.

View on arXiv PDF Code

Similar