LGJan 23, 2025

On Learning Representations for Tabular Data Distillation

arXiv:2501.13905v12 citationsh-index: 30Trans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This work addresses dataset distillation for tabular data, a domain-specific problem with incremental improvements over existing methods.

The paper tackles the problem of dataset distillation for tabular data, which presents challenges like feature heterogeneity and non-differentiable models, by introducing TDColER, a framework using column embeddings-based representation learning. The result shows that TDColER boosts distilled data quality by 0.5-143% across 7 tabular learning models, as evaluated on a new benchmark with 226,890 distilled datasets and 548,880 trained models.

Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present $\texttt{TDColER}$, a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, ${\sf \small TDBench}$. Based on an elaborate evaluation on ${\sf \small TDBench}$, resulting in 226,890 distilled datasets and 548,880 models trained on them, we demonstrate that $\texttt{TDColER}$ is able to boost the distilled data quality of off-the-shelf distillation schemes by 0.5-143% across 7 different tabular learning models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes