CVAug 26, 2021

Learning Cross-modal Contrastive Features for Video Domain Adaptation

arXiv:2108.11974v189 citations
Originality Incremental advance
AI Analysis

This addresses domain adaptation for video tasks like action recognition, but it is incremental as it builds on existing adversarial methods by incorporating multi-modal information.

The paper tackled the problem of video domain adaptation by proposing a unified framework that regularizes cross-modal and cross-domain feature representations using contrastive learning, achieving effectiveness on benchmark datasets like UCF, HMDB, and EPIC-Kitchens.

Learning transferable and domain adaptive feature representations from videos is important for video-relevant tasks such as action recognition. Existing video domain adaptation methods mainly rely on adversarial feature alignment, which has been derived from the RGB image space. However, video data is usually associated with multi-modal information, e.g., RGB and optical flow, and thus it remains a challenge to design a better method that considers the cross-modal inputs under the cross-domain adaptation setting. To this end, we propose a unified framework for video domain adaptation, which simultaneously regularizes cross-modal and cross-domain feature representations. Specifically, we treat each modality in a domain as a view and leverage the contrastive learning technique with properly designed sampling strategies. As a result, our objectives regularize feature spaces, which originally lack the connection across modalities or have less alignment across domains. We conduct experiments on domain adaptive action recognition benchmark datasets, i.e., UCF, HMDB, and EPIC-Kitchens, and demonstrate the effectiveness of our components against state-of-the-art algorithms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes