CVAILGApr 15, 2021

Action Segmentation with Mixed Temporal Domain Adaptation

arXiv:2104.07461v239 citations
Originality Highly original
AI Analysis

This addresses the challenge of time-consuming manual annotation for action segmentation in videos, offering a domain adaptation solution that improves performance on benchmark datasets.

The paper tackles the problem of action segmentation by exploiting unlabeled videos through domain adaptation, proposing Mixed Temporal Domain Adaptation (MTDA) to align frame- and video-level features, and achieves state-of-the-art results with gains such as 6.4% on F1@50 and 6.8% on edit score for GTEA.

The main progress for action segmentation comes from densely-annotated data for fully-supervised learning. Since manual annotation for frame-level actions is time-consuming and challenging, we propose to exploit auxiliary unlabeled videos, which are much easier to obtain, by shaping this problem as a domain adaptation (DA) problem. Although various DA techniques have been proposed in recent years, most of them have been developed only for the spatial direction. Therefore, we propose Mixed Temporal Domain Adaptation (MTDA) to jointly align frame- and video-level embedded feature spaces across domains, and further integrate with the domain attention mechanism to focus on aligning the frame-level features with higher domain discrepancy, leading to more effective domain adaptation. Finally, we evaluate our proposed methods on three challenging datasets (GTEA, 50Salads, and Breakfast), and validate that MTDA outperforms the current state-of-the-art methods on all three datasets by large margins (e.g. 6.4% gain on F1@50 and 6.8% gain on the edit score for GTEA).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes