LGAICVITJul 7, 2025

Information-Guided Diffusion Sampling for Dataset Distillation

arXiv:2507.04619v18 citationsh-index: 76
Originality Incremental advance
AI Analysis

This work addresses dataset distillation for machine learning practitioners by improving sample diversity and model performance in resource-constrained scenarios, representing an incremental advancement over prior diffusion-based approaches.

The paper tackles the problem of dataset distillation in low images-per-class settings by proposing an information-guided diffusion sampling method that preserves both prototype and contextual information, resulting in significant performance improvements over existing methods, especially with few images per class.

Dataset distillation aims to create a compact dataset that retains essential information while maintaining model performance. Diffusion models (DMs) have shown promise for this task but struggle in low images-per-class (IPC) settings, where generated samples lack diversity. In this paper, we address this issue from an information-theoretic perspective by identifying two key types of information that a distilled dataset must preserve: ($i$) prototype information $\mathrm{I}(X;Y)$, which captures label-relevant features; and ($ii$) contextual information $\mathrm{H}(X | Y)$, which preserves intra-class variability. Here, $(X,Y)$ represents the pair of random variables corresponding to the input data and its ground truth label, respectively. Observing that the required contextual information scales with IPC, we propose maximizing $\mathrm{I}(X;Y) + β\mathrm{H}(X | Y)$ during the DM sampling process, where $β$ is IPC-dependent. Since directly computing $\mathrm{I}(X;Y)$ and $\mathrm{H}(X | Y)$ is intractable, we develop variational estimations to tightly lower-bound these quantities via a data-driven approach. Our approach, information-guided diffusion sampling (IGDS), seamlessly integrates with diffusion models and improves dataset distillation across all IPC settings. Experiments on Tiny ImageNet and ImageNet subsets show that IGDS significantly outperforms existing methods, particularly in low-IPC regimes. The code will be released upon acceptance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes