CVAILGNov 30, 2023

Dataset Distillation via Curriculum Data Synthesis in Large Data Era

arXiv:2311.18838v225 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient model training in the large data era for machine learning practitioners, representing an incremental improvement with novel method refinements.

The paper tackles dataset distillation by proposing a global-to-local gradient refinement approach with curriculum data augmentation, achieving state-of-the-art accuracy of 63.2% on ImageNet-1K and 36.1% on ImageNet-21K with reduced synthetic time and faster convergence.

Dataset distillation or condensation aims to generate a smaller but representative subset from a large dataset, which allows a model to be trained more efficiently, meanwhile evaluating on the original testing data distribution to achieve decent performance. Previous decoupled methods like SRe$^2$L simply use a unified gradient update scheme for synthesizing data from Gaussian noise, while, we notice that the initial several update iterations will determine the final outline of synthesis, thus an improper gradient update strategy may dramatically affect the final generation quality. To address this, we introduce a simple yet effective global-to-local gradient refinement approach enabled by curriculum data augmentation ($\texttt{CDA}$) during data synthesis. The proposed framework achieves the current published highest accuracy on both large-scale ImageNet-1K and 21K with 63.2% under IPC (Images Per Class) 50 and 36.1% under IPC 20, using a regular input resolution of 224$\times$224 with faster convergence speed and less synthetic time. The proposed model outperforms the current state-of-the-art methods like SRe$^2$L, TESLA, and MTT by more than 4% Top-1 accuracy on ImageNet-1K/21K and for the first time, reduces the gap to its full-data training counterparts to less than absolute 15%. Moreover, this work represents the inaugural success in dataset distillation on the larger-scale ImageNet-21K dataset under the standard 224$\times$224 resolution. Our code and distilled ImageNet-21K dataset of 20 IPC, 2K recovery budget are available at https://github.com/VILA-Lab/SRe2L/tree/main/CDA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes