LGSep 18, 2023

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

arXiv:2309.09968v377 citationsh-index: 14Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of data scarcity and missing values in tabular data for researchers and practitioners, offering a more efficient, GPU-free approach, though it is incremental by combining existing techniques in a novel way.

The paper tackles the problem of generating and imputing mixed-type tabular data, which is hard to acquire and often has missing values, by introducing a method that uses diffusion and flow-based models with Gradient-Boosted Trees instead of neural networks; it outperforms deep-learning methods in generation tasks and remains competitive in imputation, as shown on a benchmark of 27 datasets with 9 metrics.

Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at https://github.com/SamsungSAILMontreal/ForestDiffusion.

Code Implementations5 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes