CVLGJul 28, 2024

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

arXiv:2407.19520v2h-index: 11
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficient adaptation for egocentric video understanding tasks, offering a lightweight solution that reduces computational costs while maintaining performance.

The paper tackles the problem of adapting large video foundation models to new domains by proposing Ego-VPA, a parameter-efficient method that uses shared basis prompts for context fusion and cross-modal transfer, achieving performance comparable to full fine-tuning with only 0.84% learnable parameters.

Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes