LG AIJun 20, 2025

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, Frank Hutter

arXiv:2506.16791v440.3108 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses the problem of outdated and static benchmarks for researchers and practitioners in tabular machine learning, though it is incremental as it builds on existing benchmarking concepts.

The authors tackled the lack of standardized and updated benchmarks for machine learning on tabular data by introducing TabArena, a continuously maintained living benchmarking system, which showed that gradient-boosted trees remain strong, deep learning methods catch up with ensembling, and foundation models excel on smaller datasets, with ensembles advancing state-of-the-art results.

With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

View on arXiv PDF

Similar