Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning
This work addresses the need for flexible video retrieval that combines visual and textual queries, but the approach is incremental as it combines existing frozen models.
The authors tackle composed video retrieval, where a target video is retrieved using a reference video and a modification instruction. Their training-free framework achieves 48.78 Recall@1 and 51.48 Recall@5 on the test set.
Recent advances in large vision-language models have expanded video retrieval from simple text-based search to more flexible scenarios, where users may specify the desired result through both visual examples and textual instructions. In the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge, the system is required to retrieve a target video according to a reference video and a modification instruction. To address this task, we develop Visual Representation-Guided Video-LLM Reasoning for Training-Free Composed Video Retrieval. Our framework first uses frozen DINOv3 models to obtain a compact set of visually relevant candidates, and then applies large vision-language models to evaluate whether each candidate satisfies the modification instruction. A final reasoning-based refinement is further performed on the top candidates to improve the first-ranked prediction. Without training, our system achieves 48.78 Recall@1 and 51.48 Recall@5 on the test set. Future work may further improve retrieval accuracy through stronger video-LLMs and detailed integration between visual representations and language reasoning.