LGJun 6, 2024

What is Dataset Distillation Learning?

arXiv:2406.04284v214 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the lack of understanding in dataset distillation for researchers, providing insights into its limitations and mechanisms, but it is incremental as it builds on existing methods without introducing new paradigms.

The paper tackled the problem of understanding how information is stored in dataset distillation, revealing that distilled data cannot substitute real data outside standard evaluation, retains performance by compressing early training dynamics, and individual points contain semantic information.

Dataset distillation has emerged as a strategy to overcome the hurdles associated with large datasets by learning a compact set of synthetic data that retains essential information from the original dataset. While distilled data can be used to train high performing models, little is understood about how the information is stored. In this study, we posit and answer three questions about the behavior, representativeness, and point-wise information content of distilled data. We reveal distilled data cannot serve as a substitute for real data during training outside the standard evaluation setting for dataset distillation. Additionally, the distillation process retains high task performance by compressing information related to the early training dynamics of real models. Finally, we provide an framework for interpreting distilled data and reveal that individual distilled data points contain meaningful semantic information. This investigation sheds light on the intricate nature of distilled data, providing a better understanding on how they can be effectively utilized.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes