CVAIJan 30

ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

arXiv:2601.23232v32 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This addresses the lack of systematic benchmarks for video retrieval, which is an incremental step for researchers in multimodal AI.

The paper tackles the problem of open-domain video shot retrieval by introducing ShotFinder, a benchmark with 1,210 samples and five constraints, and finds that current models significantly lag behind human performance, with color and visual style being major challenges.

In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes