LGMar 1, 2024

Tree-Regularized Tabular Embeddings

arXiv:2403.00963v13 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of enhancing neural network performance on tabular data for machine learning practitioners, though it is incremental as it builds on prior methods like DeepTLF.

The paper tackled the performance gap between tabular neural networks and tree-based models by introducing tree-regularized embeddings derived from pretrained tree ensembles, which improved robustness and achieved competitive or better performance on 88 OpenML datasets.

Tabular neural network (NN) has attracted remarkable attentions and its recent advances have gradually narrowed the performance gap with respect to tree-based models on many public datasets. While the mainstreams focus on calibrating NN to fit tabular data, we emphasize the importance of homogeneous embeddings and alternately concentrate on regularizing tabular inputs through supervised pretraining. Specifically, we extend a recent work (DeepTLF) and utilize the structure of pretrained tree ensembles to transform raw variables into a single vector (T2V), or an array of tokens (T2T). Without loss of space efficiency, these binarized embeddings can be consumed by canonical tabular NN with fully-connected or attention-based building blocks. Through quantitative experiments on 88 OpenML datasets with binary classification task, we validated that the proposed tree-regularized representation not only tapers the difference with respect to tree-based models, but also achieves on-par and better performance when compared with advanced NN models. Most importantly, it possesses better robustness and can be easily scaled and generalized as standalone encoder for tabular modality. Codes: https://github.com/milanlx/tree-regularized-embedding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes