Why Heuristic Weighting Works: A Theoretical Analysis of Denoising Score Matching
This work provides theoretical grounding for a widely used heuristic in diffusion models, which is incremental but important for improving training stability in generative AI.
The paper tackles the lack of formal justification for heuristic weighting in denoising score matching by showing that heteroskedasticity is inherent, leading to a principled derivation of optimal weighting functions. It demonstrates that the heuristic weighting, as a first-order approximation, can achieve lower variance in parameter gradients, facilitating more stable and efficient training.
Score matching enables the estimation of the gradient of a data distribution, a key component in denoising diffusion models used to recover clean data from corrupted inputs. In prior work, a heuristic weighting function has been used for the denoising score matching loss without formal justification. In this work, we demonstrate that heteroskedasticity is an inherent property of the denoising score matching objective. This insight leads to a principled derivation of optimal weighting functions for generalized, arbitrary-order denoising score matching losses, without requiring assumptions about the noise distribution. Among these, the first-order formulation is especially relevant to diffusion models. We show that the widely used heuristical weighting function arises as a first-order Taylor approximation to the trace of the expected optimal weighting. We further provide theoretical and empirical comparisons, revealing that the heuristical weighting, despite its simplicity, can achieve lower variance than the optimal weighting with respect to parameter gradients, which can facilitate more stable and efficient training.