CVAug 10, 2022

Leveraging Endo- and Exo-Temporal Regularization for Black-box Video Domain Adaptation

arXiv:2208.05187v39 citationsh-index: 48
Originality Incremental advance
AI Analysis

This addresses privacy and portability issues in video domain adaptation for action recognition, though it is an incremental improvement over existing image-based black-box methods.

The paper tackled the problem of black-box video domain adaptation, where only a black-box source model is available, by proposing the EXTERN method with endo- and exo-temporal regularizations, achieving state-of-the-art performance on cross-domain action recognition benchmarks and surpassing many methods that require source data access.

To enable video models to be applied seamlessly across video tasks in different environments, various Video Unsupervised Domain Adaptation (VUDA) methods have been proposed to improve the robustness and transferability of video models. Despite improvements made in model robustness, these VUDA methods require access to both source data and source model parameters for adaptation, raising serious data privacy and model portability issues. To cope with the above concerns, this paper firstly formulates Black-box Video Domain Adaptation (BVDA) as a more realistic yet challenging scenario where the source video model is provided only as a black-box predictor. While a few methods for Black-box Domain Adaptation (BDA) are proposed in image domain, these methods cannot apply to video domain since video modality has more complicated temporal features that are harder to align. To address BVDA, we propose a novel Endo and eXo-TEmporal Regularized Network (EXTERN) by applying mask-to-mix strategies and video-tailored regularizations: endo-temporal regularization and exo-temporal regularization, performed across both clip and temporal features, while distilling knowledge from the predictions obtained from the black-box predictor. Empirical results demonstrate the state-of-the-art performance of EXTERN across various cross-domain closed-set and partial-set action recognition benchmarks, which even surpassed most existing video domain adaptation methods with source data accessibility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes