CVLGAug 5, 2024

Infusing Environmental Captions for Long-Form Video Language Grounding

arXiv:2408.02336v21 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the challenge of temporal localization in long videos for AI systems, representing an incremental improvement over existing methods by leveraging MLLMs as a proxy for human-like knowledge.

The paper tackles the problem of long-form video-language grounding by proposing EI-VLG, a method that uses a Multi-modal Large Language Model to provide richer textual information to exclude irrelevant frames, achieving improved performance on the EgoNLQ benchmark.

In this work, we tackle the problem of long-form video-language grounding (VLG). Given a long-form video and a natural language query, a model should temporally localize the precise moment that answers the query. Humans can easily solve VLG tasks, even with arbitrarily long videos, by discarding irrelevant moments using extensive and robust knowledge gained from experience. Unlike humans, existing VLG methods are prone to fall into superficial cues learned from small-scale datasets, even when they are within irrelevant frames. To overcome this challenge, we propose EI-VLG, a VLG method that leverages richer textual information provided by a Multi-modal Large Language Model (MLLM) as a proxy for human experiences, helping to effectively exclude irrelevant frames. We validate the effectiveness of the proposed method via extensive experiments on a challenging EgoNLQ benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes