LGAIMar 9

Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules

arXiv:2603.08206v117.62 citations
Predicted impact top 77% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This paper addresses a critical gap in evaluating probabilistic predictions for tabular foundation models, which is important for researchers and practitioners who need more comprehensive assessment beyond point estimates.

The authors highlight a weakness in current tabular regression benchmarks, which primarily focus on point estimates like mean squared error, and propose using proper scoring rules, specifically the continuous ranked probability score (CRPS), to evaluate probabilistic predictions. They illustrate that the choice of scoring rule influences the inductive bias of trained models and advocate for finetuning or promptable tabular foundation models.

Prior-Data Fitted Networks (PFNs), such as TabPFN and TabICL, have revolutionized tabular deep learning by leveraging in-context learning for tabular data. These models are meant as foundation models for classification and regression settings and promise to greatly simplify deployment in practical settings because their performance is unprecedented (in terms of mean squared error or $R^2$, when measured on common benchmarks like TabArena or TALENT). However, we see an important weakness of current benchmarks for the regression setting: the current benchmarks focus on evaluating win rates and performance using metrics like (root) mean squared error or $R^2$. Therefore, these leaderboards (implicitly and explicitly) push researchers to optimize for machine learning pipelines which elicit a good mean value estimate. The main problem is that this approach only evaluates a point estimate (namely the mean estimator which is the Bayes estimator associated with the mean squared error loss). In this article we discuss the application of proper scoring rules for evaluating the goodness of probabilistic forecasts in distributional regression. We also propose to enhance common machine learning benchmarks with metrics for probabilistic regression. To improve the status quo and make the machine learning community aware of scoring rules for probabilistic regression, we advocate to use the continuous ranked probability score (CRPS) in benchmarks for probabilistic regression. However, we also illustrate that the choice of the scoring rule changes the inductive bias of the trained model. We, therefore, advocate for finetuning or promptable tabular foundation models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes