CVDec 1, 2025

IVCR-200K: A Large-Scale Multi-turn Dialogue Benchmark for Interactive Video Corpus Retrieval

arXiv:2512.01312v13 citationsh-index: 11SIGIR
Originality Synthesis-oriented
AI Analysis

It addresses the need for personalized and dynamic video retrieval for at least 80.8% of users, though it is incremental as it builds on existing retrieval tasks.

The paper tackles the lack of interactive, multi-turn dialogue in video retrieval systems by introducing the IVCR task and a large-scale dataset (IVCR-200K), showing effectiveness through experiments.

In recent years, significant developments have been made in both video retrieval and video moment retrieval tasks, which respectively retrieve complete videos or moments for a given text query. These advancements have greatly improved user satisfaction during the search process. However, previous work has failed to establish meaningful "interaction" between the retrieval system and the user, and its one-way retrieval paradigm can no longer fully meet the personalization and dynamic needs of at least 80.8\% of users. In this paper, we introduce the Interactive Video Corpus Retrieval (IVCR) task, a more realistic setting that enables multi-turn, conversational, and realistic interactions between the user and the retrieval system. To facilitate research on this challenging task, we introduce IVCR-200K, a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval. Furthermore, we propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions. The extensive experiments demonstrate the effectiveness of our dataset and framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes