CVAILGJul 24, 2023

MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features

arXiv:2307.12698v146 citationsh-index: 137
Originality Incremental advance
AI Analysis

This addresses the need for visual representations that incorporate motion information, which is important for video analysis tasks, but it is incremental as it builds on existing joint-embedding architectures.

The paper tackles the problem of self-supervised learning by unifying optical flow estimation and content feature learning within a shared encoder, achieving performance on-par with existing unsupervised optical flow benchmarks and self-supervised learning approaches on downstream tasks like semantic segmentation.

Self-supervised learning of visual representations has been focusing on learning content features, which do not capture object motion or location, and focus on identifying and differentiating objects in images and videos. On the other hand, optical flow estimation is a task that does not involve understanding the content of the images on which it is estimated. We unify the two approaches and introduce MC-JEPA, a joint-embedding predictive architecture and self-supervised learning approach to jointly learn optical flow and content features within a shared encoder, demonstrating that the two associated objectives; the optical flow estimation objective and the self-supervised learning objective; benefit from each other and thus learn content features that incorporate motion information. The proposed approach achieves performance on-par with existing unsupervised optical flow benchmarks, as well as with common self-supervised learning approaches on downstream tasks such as semantic segmentation of images and videos.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes