CVApr 25, 2025

STP4D: Spatio-Temporal-Prompt Consistent Modeling for Text-to-4D Gaussian Splatting

arXiv:2504.18318v14 citationsh-index: 11ICME
Originality Incremental advance
AI Analysis

This work improves text-to-4D generation for applications in scenarios requiring efficient and consistent 4D content creation, though it appears incremental as it builds on existing 4D Gaussian splatting and diffusion models.

The paper tackles the problem of generating 4D content from text by addressing spatio-temporal inconsistencies and prompt misalignment, resulting in a method that produces high-fidelity 4D assets in about 4.6 seconds per asset, outperforming existing approaches in quality and speed.

Text-to-4D generation is rapidly developing and widely applied in various scenarios. However, existing methods often fail to incorporate adequate spatio-temporal modeling and prompt alignment within a unified framework, resulting in temporal inconsistencies, geometric distortions, or low-quality 4D content that deviates from the provided texts. Therefore, we propose STP4D, a novel approach that aims to integrate comprehensive spatio-temporal-prompt consistency modeling for high-quality text-to-4D generation. Specifically, STP4D employs three carefully designed modules: Time-varying Prompt Embedding, Geometric Information Enhancement, and Temporal Extension Deformation, which collaborate to accomplish this goal. Furthermore, STP4D is among the first methods to exploit the Diffusion model to generate 4D Gaussians, combining the fine-grained modeling capabilities and the real-time rendering process of 4DGS with the rapid inference speed of the Diffusion model. Extensive experiments demonstrate that STP4D excels in generating high-fidelity 4D content with exceptional efficiency (approximately 4.6s per asset), surpassing existing methods in both quality and speed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes