CVApr 30

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

arXiv:2604.2818581.63 citations
AI Analysis

This is a position/roadmap paper that provides a capability-centered framework for researchers and practitioners to understand, evaluate, and advance visual generation systems toward more intelligent and world-aware capabilities.

The paper argues that visual generation models should evolve from appearance synthesis to intelligent generation grounded in structure, dynamics, and causality, proposing a five-level taxonomy (Atomic to World-Modeling Generation) and analyzing key technical drivers. It shows current evaluations overestimate progress by focusing on perceptual quality while missing structural, temporal, and causal failures.

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes