CVOct 28, 2025

Uniform Discrete Diffusion with Metric Path for Video Generation

arXiv:2510.24717v113 citationsh-index: 9Has Code
Originality Incremental advance
AI Analysis

It addresses the problem of scalable video generation for AI researchers and practitioners, offering a discrete approach that reduces error accumulation and inconsistency, though it is incremental as it builds on existing diffusion methods.

The paper tackles the lag of discrete generative modeling in video generation by proposing URSA, a framework that formulates video generation as iterative refinement of discrete tokens, achieving performance comparable to state-of-the-art continuous diffusion methods on benchmarks.

Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes