CVJul 25, 2025

Back to the Features: DINO as a Foundation for Video World Models

Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, Piotr Bojanowski

arXiv:2507.19468v145 citationsh-index: 37

Originality Incremental advance

AI Analysis

This work addresses video understanding and planning for AI systems, offering a scalable approach with broad applications, though it builds incrementally on existing pre-trained encoders.

The authors tackled video prediction by training DINO-world, a generalist world model that predicts future frames in DINOv2's latent space, achieving superior performance on benchmarks like segmentation and depth forecasting with strong intuitive physics understanding.

We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.

View on arXiv PDF

Similar