CVMar 13, 2025

Continual Text-to-Video Retrieval with Frame Fusion and Task-Aware Routing

Zecheng Zhao, Zhi Chen, Zi Huang, Shazia Sadiq, Tong Chen

arXiv:2503.10111v213 citationsh-index: 14Has CodeSIGIR

Originality Highly original

AI Analysis

This addresses the challenge of catastrophic forgetting and model plasticity in continual learning for video retrieval, which is incremental but important for real-world applications.

The paper tackles the problem of adapting text-to-video retrieval systems to continuously evolving video content by introducing a benchmark for continual text-to-video retrieval and proposing FrameFusionMoE, a framework that achieves superior retrieval performance with minimal degradation on earlier tasks.

Text-to-Video Retrieval (TVR) aims to retrieve relevant videos based on textual queries. However, as video content evolves continuously, adapting TVR systems to new data remains a critical yet under-explored challenge. In this paper, we introduce the first benchmark for Continual Text-to-Video Retrieval (CTVR) to address the limitations of existing approaches. Current Pre-Trained Model (PTM)-based TVR methods struggle with maintaining model plasticity when adapting to new tasks, while existing Continual Learning (CL) methods suffer from catastrophic forgetting, leading to semantic misalignment between historical queries and stored video features. To address these two challenges, we propose FrameFusionMoE, a novel CTVR framework that comprises two key components: (1) the Frame Fusion Adapter (FFA), which captures temporal video dynamics while preserving model plasticity, and (2) the Task-Aware Mixture-of-Experts (TAME), which ensures consistent semantic alignment between queries across tasks and the stored video features. Thus, FrameFusionMoE enables effective adaptation to new video content while preserving historical text-video relevance to mitigate catastrophic forgetting. We comprehensively evaluate FrameFusionMoE on two benchmark datasets under various task settings. Results demonstrate that FrameFusionMoE outperforms existing CL and TVR methods, achieving superior retrieval performance with minimal degradation on earlier tasks when handling continuous video streams. Our code is available at: https://github.com/JasonCodeMaker/CTVR.

View on arXiv PDF Code

Similar