CVAINov 17, 2021

Learning to Align Sequential Actions in the Wild

arXiv:2111.09301v134 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of robust action alignment in videos for applications like video analysis and understanding, representing an incremental improvement over existing methods.

The paper tackles the problem of aligning sequential actions in videos under real-world conditions with diverse temporal variations, such as non-monotonic orders and background frames, and demonstrates that their approach consistently outperforms state-of-the-art methods on four benchmark datasets.

State-of-the-art methods for self-supervised sequential action alignment rely on deep networks that find correspondences across videos in time. They either learn frame-to-frame mapping across sequences, which does not leverage temporal information, or assume monotonic alignment between each video pair, which ignores variations in the order of actions. As such, these methods are not able to deal with common real-world scenarios that involve background frames or videos that contain non-monotonic sequence of actions. In this paper, we propose an approach to align sequential actions in the wild that involve diverse temporal variations. To this end, we propose an approach to enforce temporal priors on the optimal transport matrix, which leverages temporal consistency, while allowing for variations in the order of actions. Our model accounts for both monotonic and non-monotonic sequences and handles background frames that should not be aligned. We demonstrate that our approach consistently outperforms the state-of-the-art in self-supervised sequential action representation learning on four different benchmark datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes