MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents
This work addresses the challenge of generating temporally consistent dynamic 3D assets from videos across diverse representations, benefiting computer graphics and vision applications.
MORPHOS introduces an autoregressive framework for generating dynamic 3D assets from videos, using Temporal Structured Latents to handle multiple representations and maintain temporal consistency. It achieves state-of-the-art appearance and competitive geometry results across benchmarks.
We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single representation, struggle to model topological changes, or fail to maintain temporal consistency over long videos. To address these limitations, we introduce the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension. Leveraging T-SLAT, MORPHOS autoregressively generates dynamic 3D assets via causal attention, conditioning each frame on its preceding history to ensure temporal consistency while handling evolving topologies. We also propose a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. MORPHOS achieves state-of-the-art performance in appearance and competitive results in geometry across multiple benchmarks, demonstrating superior generalization across various representations and robustness in long-horizon generation.