Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine
This work provides a theoretical foundation for why diffusion models avoid the curse of dimensionality when data lies on low-dimensional manifolds, offering a principled alternative to heuristic VAE-based latent diffusion models.
The paper identifies a collapse-and-refine mechanism in diffusion models that explains how they learn score functions efficiently on low-dimensional manifolds, and proposes SiLD, a two-stage framework that achieves sample complexity depending on intrinsic dimension. Experiments show SiLD matches or outperforms VAE-based latent diffusion models in generation quality and improves reconstruction.
Diffusion models generate high-dimensional data with remarkable quality, yet how their training efficiently learns the score function, bypassing the curse of dimensionality when data is supported on low-dimensional manifolds, remains theoretically unexplained. We identify a collapse-and-refine mechanism driven by the geometry of the score function itself: at small noise scales, the diverging singularity of the score drives a rapid dimensional collapse of the induced denoising map onto the data manifold projection; at moderate noise scales, training refines the intrinsic density on the learned manifold. We instantiate this principle as Score-induced Latent Diffusion (SiLD), a two-stage framework in which both manifold learning and density estimation emerge from a single denoising score matching objective, replacing the heuristic KL regularization of VAE-based latent diffusion models. We prove that the resulting sample complexity depends on the intrinsic dimension rather than the ambient dimension. Experiments on Stacked MNIST, CelebA variants, and molecular generation benchmarks show that SiLD matches or outperforms VAE-based LDMs in generation quality and consistently improves reconstruction, validating our theoretical predictions.