CVNov 1, 2023

An Empirical Study of Frame Selection for Text-to-Video Retrieval

arXiv:2311.00298v1132 citationsh-index: 11
Originality Synthesis-oriented
AI Analysis

This work addresses the efficiency and performance challenges in text-to-video retrieval, which is an incremental study focusing on optimizing frame selection for researchers and practitioners in video retrieval systems.

The paper conducted the first empirical study of frame selection methods for text-to-video retrieval, analyzing six methods including two newly developed ones, and found that proper frame selection can significantly improve retrieval efficiency without sacrificing performance.

Text-to-video retrieval (TVR) aims to find the most relevant video in a large video gallery given a query text. The intricate and abundant context of the video challenges the performance and efficiency of TVR. To handle the serialized video contexts, existing methods typically select a subset of frames within a video to represent the video content for TVR. How to select the most representative frames is a crucial issue, whereby the selected frames are required to not only retain the semantic information of the video but also promote retrieval efficiency by excluding temporally redundant frames. In this paper, we make the first empirical study of frame selection for TVR. We systemically classify existing frame selection methods into text-free and text-guided ones, under which we detailedly analyze six different frame selections in terms of effectiveness and efficiency. Among them, two frame selections are first developed in this paper. According to the comprehensive analysis on multiple TVR benchmarks, we empirically conclude that the TVR with proper frame selections can significantly improve the retrieval efficiency without sacrificing the retrieval performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes