CV AIDec 2, 2025

Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision

Chenshuang Zhang, Kang Zhang, Joon Son Chung, In So Kweon, Junmo Kim, Chengzhi Mao

arXiv:2512.02339v13.6h-index: 12

Originality Highly original

AI Analysis

This addresses a critical failure point in self-supervised tracking for computer vision, enabling better scalability without labeled data.

The paper tackled the challenge of tracking visually similar objects without supervision by leveraging pre-trained video diffusion models, achieving up to a 6-point improvement over recent self-supervised methods on benchmarks.

Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.

View on arXiv PDF

Similar