Comparing Task-Agnostic Embedding Models for Tabular Data
This work addresses the efficiency and performance trade-offs in tabular data representation learning for practitioners, but it is incremental as it focuses on benchmarking existing methods.
The paper tackled the problem of evaluating task-gnostic embedding models for tabular data, finding that simple feature engineering (TableVectorizer) performs comparably or better than complex foundation models while being up to 1000 times faster.
Recent foundation models for tabular data achieve strong task-specific performance via in-context learning. Nevertheless, they focus on direct prediction by encapsulating both representation learning and task-specific inference inside a single, resource-intensive network. This work specifically focuses on representation learning, i.e., on transferable, task-agnostic embeddings. We systematically evaluate task-agnostic representations from tabular foundation models (TabPFN and TabICL) alongside with classical feature engineering (TableVectorizer) across a variety of application tasks as outlier detection (ADBench) and supervised learning (TabArena Lite). We find that simple TableVectorizer features achieve comparable or superior performance while being up to three orders of magnitude faster than tabular foundation models. The code is available at https://github.com/ContactSoftwareAI/TabEmbedBench.