CVJun 2

Where Do We (Not) Need Temporal Context in Low-Resource Video Task Adaptation?

arXiv:2606.0383742.2h-index: 10

Predicted impact top 70% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For researchers adapting foundation models to video tasks with limited data, this work provides practical guidelines on where to place temporal context for optimal performance.

This paper systematically studies parameter-efficient fine-tuning (PEFT) and probing for video understanding, focusing on how temporal context should be distributed across backbone, PEFT, and probe. It finds that allocating temporal context to the probe is most effective in low-resource settings, achieving up to 5% improvement over standard methods.

Parameter-efficient fine-tuning (PEFT) and probing enable adaptation of foundation models using only a small number of trainable parameters, making it attractive for video understanding where annotation and computation are expensive. However, video PEFT has focused on adapting image-pretrained models, while standard PEFT methods can also be applied to video representations. These settings are rarely compared and both confine temporal reasoning to a single component of the model, leaving open how temporal context should be distributed across backbone, PEFT and probe. In this work we provide a systematic study of model adaptation strategies for video understanding. We evaluate methods across appearance-focused, motion-focused and spatially dense settings, with a particular focus on scenarios with limited data where parameter-efficiency is most beneficial. Our results provide new insights into PEFT and probing across settings and demonstrate the importance of temporal context allocation for effective video adaptation

View on arXiv PDF

Similar