CVAILGDec 11, 2024

DMin: Scalable Training Data Influence Estimation for Diffusion Models

arXiv:2412.08637v35 citationsh-index: 18
Originality Incremental advance
AI Analysis

This addresses the computational bottleneck in understanding diffusion models for researchers and practitioners, though it is incremental as it builds on existing influence estimation methods.

The paper tackles the problem of identifying influential training data samples for generated images in diffusion models, proposing DMin as a scalable framework that reduces storage from hundreds of TBs to MBs or KBs and retrieves top-k samples in under 1 second.

Identifying the training data samples that most influence a generated image is a critical task in understanding diffusion models (DMs), yet existing influence estimation methods are constrained to small-scale or LoRA-tuned models due to computational limitations. To address this challenge, we propose DMin (Diffusion Model influence), a scalable framework for estimating the influence of each training data sample on a given generated image. To the best of our knowledge, it is the first method capable of influence estimation for DMs with billions of parameters. Leveraging efficient gradient compression, DMin reduces storage requirements from hundreds of TBs to mere MBs or even KBs, and retrieves the top-k most influential training samples in under 1 second, all while maintaining performance. Our empirical results demonstrate DMin is both effective in identifying influential training samples and efficient in terms of computational and storage requirements.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes