Nonparametric Data Attribution for Diffusion Models
This addresses the need for scalable and interpretable data attribution in generative models, particularly for proprietary or large-scale settings, though it is incremental as it builds on existing attribution concepts.
The paper tackles the problem of attributing influence of training examples on diffusion model outputs without requiring model gradients or retraining, and the result is a nonparametric method that matches gradient-based approaches and outperforms other nonparametric baselines.
Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs. Existing methods for diffusion models typically require access to model gradients or retraining, limiting their applicability in proprietary or large-scale settings. We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images. Our approach is grounded in the analytical form of the optimal score function and naturally extends to multiscale representations, while remaining computationally efficient through convolution-based acceleration. In addition to producing spatially interpretable attributions, our framework uncovers patterns that reflect intrinsic relationships between training data and outputs, independent of any specific model. Experiments demonstrate that our method achieves strong attribution performance, closely matching gradient-based approaches and substantially outperforming existing nonparametric baselines. Code is available at https://github.com/sail-sg/NDA.