CVAug 16, 2025

OVG-HQ: Online Video Grounding with Hybrid-modal Queries

arXiv:2508.11903v1h-index: 12Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of locating moments in streaming videos using diverse query types for applications like video analysis, though it is incremental in extending existing video grounding methods.

The paper tackles the problem of online video grounding with hybrid-modal queries by introducing a new task and dataset, and proposes a unified framework with a parametric memory block and cross-modal distillation to address limited context and modality imbalance, achieving improved performance over existing models.

Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-HQ), which enables online segment localization using text, images, video segments, and their combinations. This task poses two new challenges: limited context in online settings and modality imbalance during training, where dominant modalities overshadow weaker ones. To address these, we propose OVG-HQ-Unify, a unified framework featuring a Parametric Memory Block (PMB) that retain previously learned knowledge to enhance current decision and a cross-modal distillation strategy that guides the learning of non-dominant modalities. This design enables a single model to effectively handle hybrid-modal queries. Due to the lack of suitable datasets, we construct QVHighlights-Unify, an expanded dataset with multi-modal queries. Besides, since offline metrics overlook prediction timeliness, we adapt them to the online setting, introducing oR@n, IoU=m, and online mean Average Precision (omAP) to evaluate both accuracy and efficiency. Experiments show that our OVG-HQ-Unify outperforms existing models, offering a robust solution for online, hybrid-modal video grounding. Source code and datasets are available at https://github.com/maojiaqi2324/OVG-HQ.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes