LGAISep 3, 2024

PMLBmini: A Tabular Classification Benchmark Suite for Data-Scarce Applications

arXiv:2409.01635v14 citationsh-index: 3Has Code
Originality Synthesis-oriented
AI Analysis

This addresses a problem for researchers and practitioners in data-scarce applications by providing a benchmark to evaluate method efficiency, though it is incremental as it adapts existing datasets for a specific regime.

The authors tackled the lack of benchmarks for small-sized tabular data by introducing PMLBmini, a suite of 44 binary classification datasets with sample sizes ≤ 500, and found that state-of-the-art AutoML and deep learning methods often fail to outperform simple logistic regression in low-data regimes, though they identified scenarios where these methods are reasonable.

In practice, we are often faced with small-sized tabular data. However, current tabular benchmarks are not geared towards data-scarce applications, making it very difficult to derive meaningful conclusions from empirical comparisons. We introduce PMLBmini, a tabular benchmark suite of 44 binary classification datasets with sample sizes $\leq$ 500. We use our suite to thoroughly evaluate current automated machine learning (AutoML) frameworks, off-the-shelf tabular deep neural networks, as well as classical linear models in the low-data regime. Our analysis reveals that state-of-the-art AutoML and deep learning approaches often fail to appreciably outperform even a simple logistic regression baseline, but we also identify scenarios where AutoML and deep learning methods are indeed reasonable to apply. Our benchmark suite, available on https://github.com/RicardoKnauer/TabMini , allows researchers and practitioners to analyze their own methods and challenge their data efficiency.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes