CVJan 27, 2020

Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

arXiv:2001.09691v2238 citations
AI Analysis

This addresses domain adaptation for fine-grained action recognition in video, which is an incremental improvement over existing methods by incorporating multi-modal self-supervision.

The paper tackles the problem of domain shift in fine-grained action recognition by proposing a multi-modal self-supervised alignment approach for unsupervised domain adaptation, showing that it improves performance over source-only training by 2.4% on average and outperforms other UDA methods by 3% when combined with adversarial training.

Fine-grained action recognition datasets exhibit environmental bias, where multiple video sequences are captured from a limited number of environments. Training a model in one environment and deploying in another results in a drop in performance due to an unavoidable domain shift. Unsupervised Domain Adaptation (UDA) approaches have frequently utilised adversarial training between the source and target domains. However, these approaches have not explored the multi-modal nature of video within each domain. In this work we exploit the correspondence of modalities as a self-supervised alignment approach for UDA in addition to adversarial alignment. We test our approach on three kitchens from our large-scale dataset, EPIC-Kitchens, using two modalities commonly employed for action recognition: RGB and Optical Flow. We show that multi-modal self-supervision alone improves the performance over source-only training by 2.4% on average. We then combine adversarial training with multi-modal self-supervision, showing that our approach outperforms other UDA methods by 3%.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes