Unmasking Trees for Tabular Data
This work addresses the challenge of improving tabular data processing for applications like data imputation and generation, though it appears incremental by building on existing tree-based methods.
The paper tackled the problem of tabular data imputation and generation, where traditional methods often outperform advanced deep learning techniques, by introducing UnmaskingTrees, a gradient-boosted decision tree method that achieved leading imputation performance and state-of-the-art generation with missing data on a benchmark of 27 datasets.
Despite much work on advanced deep learning and generative modeling techniques for tabular data generation and imputation, traditional methods have continued to win on imputation benchmarks. We herein present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees which are used to incrementally unmask individual features. On a benchmark for out-of-the-box performance on 27 small tabular datasets, UnmaskingTrees offers leading performance on imputation; state-of-the-art performance on generation given data with missingness; and competitive performance on vanilla generation given data without missingness. To solve the conditional generation subproblem, we propose a tabular probabilistic prediction method, BaltoBot, which fits a balanced tree of boosted tree classifiers. Unlike older methods, it requires no parametric assumption on the conditional distribution, accommodating features with multimodal distributions; unlike newer diffusion methods, it offers fast sampling, closed-form density estimation, and flexible handling of discrete variables. We finally consider our two approaches as meta-algorithms, demonstrating in-context learning-based generative modeling with TabPFN.