CVJun 22, 2025

CDG-MAE: Learning Correspondences from Diffusion Generated Views

arXiv:2506.18164v1
Originality Incremental advance
AI Analysis

This addresses the need for scalable training data in computer vision tasks like video label propagation, offering an incremental advance by enhancing existing MAE frameworks with synthetic data.

The paper tackles the problem of learning dense correspondences by introducing CDG-MAE, a self-supervised method that uses synthetic views from diffusion models to overcome data acquisition challenges, achieving significant performance improvements over image-based MAE methods and narrowing the gap to video-based approaches.

Learning dense correspondences, critical for application such as video label propagation, is hindered by tedious and unscalable manual annotation. Self-supervised methods address this by using a cross-view pretext task, often modeled with a masked autoencoder, where a masked target view is reconstructed from an anchor view. However, acquiring effective training data remains a challenge - collecting diverse video datasets is difficult and costly, while simple image crops lack necessary pose variations. This paper introduces CDG-MAE, a novel MAE-based self-supervised method that uses diverse synthetic views generated from static images via an image-conditioned diffusion model. These generated views exhibit substantial changes in pose and perspective, providing a rich training signal that overcomes the limitations of video and crop-based anchors. We present a quantitative method to evaluate local and global consistency of generated images, discussing their use for cross-view self-supervised pretraining. Furthermore, we enhance the standard single-anchor MAE setting to a multi-anchor strategy to effectively modulate the difficulty of pretext task. CDG-MAE significantly outperforms state-of-the-art MAE methods reliant only on images and substantially narrows the performance gap to video-based approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes