CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content
This work addresses a gap in video retrieval for audio-visual content, enabling more comprehensive search capabilities, though it is incremental as it extends an existing task to include audio.
The paper tackles the limitation of existing composed video retrieval benchmarks that ignore audio variations by introducing CoVA, a new task that accounts for both visual and auditory changes, and constructs the AV-Comp benchmark with video pairs and textual queries, achieving strong baseline performance with the proposed AVT method.
Composed Video Retrieval (CoVR) aims to retrieve a target video from a large gallery using a reference video and a textual query specifying visual modifications. However, existing benchmarks consider only visual changes, ignoring videos that differ in audio despite visual similarity. To address this limitation, we introduce Composed retrieval for Video with its Audio CoVA, a new retrieval task that accounts for both visual and auditory variations. To support this, we construct AV-Comp, a benchmark consisting of video pairs with cross-modal changes and corresponding textual queries that describe the differences. We also propose AVT Compositional Fusion (AVT), which integrates video, audio, and text features by selectively aligning the query to the most relevant modality. AVT outperforms traditional unimodal fusion and serves as a strong baseline for CoVA. Examples from the proposed dataset, including both visual and auditory information, are available at https://perceptualai-lab.github.io/CoVA/.