Return of Frustratingly Easy Unsupervised Video Domain Adaptation
For researchers in video domain adaptation, this work provides a simple yet effective method that outperforms existing approaches, though it is incremental in nature.
The paper tackles unsupervised video domain adaptation (UVDA) and proposes MetaTrans, a simple method with only two loss terms that separates spatial and temporal divergence handling. It achieves substantial absolute adaptation improvements and significantly outperforms state-of-the-art UVDA baselines on cross-domain action recognition tasks.
Unsupervised video domain adaptation (UVDA) is a practical but under-explored problem. In this paper, we propose a frustratingly easy UVDA method, called MetaTrans. Specifically, MetaTrans adopts a concise learning objective that contains only two fundamental loss terms. Despite the simplicity of the learning objective, MetaTrans embodies an advanced UVDA idea, that is, handling the spatial and temporal divergence of cross-domain videos separately, through a subtle model architecture design. By implementing a temporal-static subtraction module, MetaTrans effectively removes spatial and temporal divergence. Extensive empirical evaluations, particularly on various cross-domain action recognition tasks, show substantial absolute adaptation performance enhancement and significantly superior relative performance gain compared with state-of-the-art UVDA baselines.