CVApr 10

Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

arXiv:2604.0995534.5h-index: 18
AI Analysis

For video action recognition, this work addresses the inefficiency and domain shift caused by static backgrounds in unsupervised domain adaptation, offering a practical solution for real-world deployment.

The paper proposes Learnable Motion-Focused Tokenization (LMFT) for video unsupervised domain adaptation, which discards low-motion background tokens to reduce domain shift and computational cost. LMFT achieves state-of-the-art performance on 21 domain adaptation settings while significantly reducing computational overhead.

Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes