CV CLJun 27, 2023

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou

arXiv:2306.15255v114.922 citationsh-index: 108Has Code

Originality Highly original

AI Analysis

This work addresses the challenge of accurately localizing text queries in long egocentric videos, which is incremental as it builds upon existing methods with specific improvements for the Ego4D NLQ benchmark.

The paper tackled the problem of grounding natural language queries in egocentric videos by developing a two-stage pre-training strategy and a novel grounding model called GroundNLQ, achieving state-of-the-art results with 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5 on a blind test set.

In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data. In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module for effective video and text fusion and various temporal intervals, especially for long videos. On the blind test set, GroundNLQ achieves 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5, respectively, and surpasses all other teams by a noticeable margin. Our code will be released at\url{https://github.com/houzhijian/GroundNLQ}.

View on arXiv PDF Code

Similar