Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Modeling
This study is significant for researchers and practitioners working with diffusion models, as it provides insight into the representation learning capabilities of these models and how they relate to generalization.
This work investigates the unimodal representation dynamics phenomenon in diffusion models, where the quality of learned features peaks at an intermediate noise level, and finds that it reliably reflects the generalization of the model. The presence of unimodal dynamics emerges when the model generates novel images and transitions to a monotonically decreasing curve as the model begins to memorize the training data.
Diffusion models, though originally designed for generative tasks, have demonstrated impressive self-supervised representation learning capabilities. A particularly intriguing phenomenon in these models is the emergence of unimodal representation dynamics, where the quality of learned features peaks at an intermediate noise level. In this work, we conduct a comprehensive theoretical and empirical investigation of this phenomenon. Leveraging the inherent low-dimensionality structure of image data, we theoretically demonstrate that the unimodal dynamic emerges when the diffusion model successfully captures the underlying data distribution. The unimodality arises from an interplay between denoising strength and class confidence across noise scales. Empirically, we further show that, in classification tasks, the presence of unimodal dynamics reliably reflects the generalization of the diffusion model: it emerges when the model generates novel images and gradually transitions to a monotonically decreasing curve as the model begins to memorize the training data.