CVJul 25, 2025

Back to the Features: DINO as a Foundation for Video World Models

arXiv:2507.19468v145 citationsh-index: 37
Originality Incremental advance
AI Analysis

This work addresses video understanding and planning for AI systems, offering a scalable approach with broad applications, though it builds incrementally on existing pre-trained encoders.

The authors tackled video prediction by training DINO-world, a generalist world model that predicts future frames in DINOv2's latent space, achieving superior performance on benchmarks like segmentation and depth forecasting with strong intuitive physics understanding.

We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes