On Semantic Similarity in Video Retrieval
This addresses a methodological flaw in video retrieval evaluation for researchers, but it is incremental as it focuses on improving assessment rather than introducing new retrieval models.
The paper tackles the problem of evaluating video retrieval models by showing that the standard instance-based assumption, where only a single caption is relevant per video, leads to misleading performance comparisons. It proposes moving to semantic similarity retrieval, where multiple items can be equally relevant and ranking is based on similarity, and demonstrates this on three datasets (MSR-VTT, YouCook2, EPIC-KITCHENS) without needing extra annotations.
Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa. We demonstrate that this assumption results in performance comparisons often not indicative of models' retrieval capabilities. We propose a move to semantic similarity video retrieval, where (i) multiple videos/captions can be deemed equally relevant, and their relative ranking does not affect a method's reported performance and (ii) retrieved videos/captions are ranked by their similarity to a query. We propose several proxies to estimate semantic similarities in large-scale retrieval datasets, without additional annotations. Our analysis is performed on three commonly used video retrieval datasets (MSR-VTT, YouCook2 and EPIC-KITCHENS).