Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding Perspective
This addresses the problem of leveraging pre-training for tabular data across diverse datasets, offering a novel approach to handle heterogeneity, though it is incremental in adapting pre-training concepts from vision/text to tabular domains.
The paper tackles the challenge of pre-training for heterogeneous tabular data by proposing TabPTM, which embeds instances into a shared feature space using neighborhood distances and labels, enabling direct application to new datasets without fine-tuning. Experiments on 101 datasets show effectiveness in classification and regression tasks.
Pre-training is prevalent in deep learning for vision and text data, leveraging knowledge from other datasets to enhance downstream tasks. However, for tabular data, the inherent heterogeneity in attribute and label spaces across datasets complicates the learning of shareable knowledge. We propose Tabular data Pre-Training via Meta-representation (TabPTM), aiming to pre-train a general tabular model over diverse datasets. The core idea is to embed data instances into a shared feature space, where each instance is represented by its distance to a fixed number of nearest neighbors and their labels. This ''meta-representation'' transforms heterogeneous tasks into homogeneous local prediction problems, enabling the model to infer labels (or scores for each label) based on neighborhood information. As a result, the pre-trained TabPTM can be applied directly to new datasets, regardless of their diverse attributes and labels, without further fine-tuning. Extensive experiments on 101 datasets confirm TabPTM's effectiveness in both classification and regression tasks, with and without fine-tuning.