CVAILGDec 6, 2023

On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm

arXiv:2312.03526v2126 citationsh-index: 3CVPR
Originality Highly original
AI Analysis

This addresses the problem of high computational demands in machine learning for researchers and practitioners by enabling efficient training on compressed datasets, though it is incremental as it builds on existing dataset distillation methods.

The paper tackles the challenge of dataset distillation for large-scale, high-resolution datasets by proposing RDED, a method that distills ImageNet-1K to 10 images per class in 7 minutes, achieving 42% top-1 accuracy with ResNet-18, compared to 21% in 6 hours for the state-of-the-art.

Contemporary machine learning requires training large neural networks on massive datasets and thus faces the challenges of high computational demands. Dataset distillation, as a recent emerging strategy, aims to compress real-world datasets for efficient training. However, this line of research currently struggle with large-scale and high-resolution datasets, hindering its practicality and feasibility. To this end, we re-examine the existing dataset distillation methods and identify three properties required for large-scale real-world applications, namely, realism, diversity, and efficiency. As a remedy, we propose RDED, a novel computationally-efficient yet effective data distillation paradigm, to enable both diversity and realism of the distilled data. Extensive empirical results over various neural architectures and datasets demonstrate the advancement of RDED: we can distill the full ImageNet-1K to a small dataset comprising 10 images per class within 7 minutes, achieving a notable 42% top-1 accuracy with ResNet-18 on a single RTX-4090 GPU (while the SOTA only achieves 21% but requires 6 hours).

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes