GroundNLQ @ Ego4D Natural Language Queries Challenge 2023
This work addresses the challenge of accurately localizing text queries in long egocentric videos, which is incremental as it builds upon existing methods with specific improvements for the Ego4D NLQ benchmark.
The paper tackled the problem of grounding natural language queries in egocentric videos by developing a two-stage pre-training strategy and a novel grounding model called GroundNLQ, achieving state-of-the-art results with 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5 on a blind test set.
In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data. In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module for effective video and text fusion and various temporal intervals, especially for long videos. On the blind test set, GroundNLQ achieves 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5, respectively, and surpasses all other teams by a noticeable margin. Our code will be released at\url{https://github.com/houzhijian/GroundNLQ}.