CVAug 20, 2024

TDS-CLIP: Temporal Difference Side Network for Efficient VideoAction Recognition

arXiv:2408.10688v22 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses the challenge of balancing knowledge transfer and temporal modeling in video action recognition, which is important for researchers in computer vision, though it appears incremental as it builds on existing parameter-efficient fine-tuning approaches.

The paper tackles the problem of efficiently transferring knowledge from large pre-trained vision-language models to video action recognition models while enhancing temporal modeling, achieving competitive performance on benchmark datasets like Something-Something V1&V2 and Kinetics-400.

Recently, large-scale pre-trained vision-language models (e.g., CLIP), have garnered significant attention thanks to their powerful representative capabilities. This inspires researchers in transferring the knowledge from these large pre-trained models to other task-specific models, e.g., Video Action Recognition (VAR) models, via particularly leveraging side networks to enhance the efficiency of parameter-efficient fine-tuning (PEFT). However, current transferring approaches in VAR tend to directly transfer the frozen knowledge from large pre-trained models to action recognition networks with minimal cost, instead of exploiting the temporal modeling capabilities of the action recognition models themselves. Therefore, in this paper, we propose a novel memory-efficient Temporal Difference Side Network (TDS-CLIP) to balance knowledge transferring and temporal modeling, avoiding backpropagation in frozen parameter models. Specifically, we introduce a Temporal Difference Adapter (TD-Adapter), which can effectively capture local temporal differences in motion features to strengthen the model's global temporal modeling capabilities. Furthermore, we designed a Side Motion Enhancement Adapter (SME-Adapter) to guide the proposed side network in efficiently learning the rich motion information in videos, thereby improving the side network's ability to capture and learn motion information. Extensive experiments are conducted on three benchmark datasets, including Something-Something V1&V2, and Kinetics-400. Experimental results show that our method achieves competitive performance in video action recognition tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes