Localizing Events in Videos with Multimodal Queries
This addresses a gap in video understanding for user-oriented applications like video search, where multimodal queries can improve flexibility, but it is incremental as it adapts existing models rather than proposing a fundamentally new approach.
The paper tackles the problem of localizing events in videos using multimodal queries that combine images and text, rather than just natural language, to better represent non-verbal concepts. It introduces a new benchmark called ICQ with an evaluation dataset ICQ-Highlight, benchmarks 12 state-of-the-art models, and shows high potential for real-world applications.
Localizing events in videos based on semantic queries is a pivotal task in video understanding, with the growing significance of user-oriented applications like video search. Yet, current research predominantly relies on natural language queries (NLQs), overlooking the potential of using multimodal queries (MQs) that integrate images to more flexibly represent semantic queries -- especially when it is difficult to express non-verbal or unfamiliar concepts in words. To bridge this gap, we introduce ICQ, a new benchmark designed for localizing events in videos with MQs, alongside an evaluation dataset ICQ-Highlight. To accommodate and evaluate existing video localization models for this new task, we propose 3 Multimodal Query Adaptation methods and a novel Surrogate Fine-tuning on pseudo-MQs strategy. ICQ systematically benchmarks 12 state-of-the-art backbone models, spanning from specialized video localization models to Video LLMs, across diverse application domains. Our experiments highlight the high potential of MQs in real-world applications. We believe this benchmark is a first step toward advancing MQs in video event localization.