ML LGJan 29

Statsformer: Validated Ensemble Learning with LLM-Derived Semantic Priors

Erica Zhang, Naomi Sagan, Danny Tse, Fangzhao Zhang, Mert Pilanci, Jose Blanchet

arXiv:2601.21410v21.7h-index: 25Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of mitigating LLM hallucinations in statistical learning for researchers and practitioners, though it is incremental as it builds on existing ensemble methods with a novel calibration approach.

Statsformer tackles the problem of integrating LLM-derived knowledge into supervised learning by introducing a guardrailed ensemble architecture that adaptively calibrates LLM-derived feature priors via cross-validation, ensuring it performs no worse than any convex combination of its base learners up to statistical error.

We introduce Statsformer, a principled framework for integrating large language model (LLM)-derived knowledge into supervised statistical learning. Existing approaches are limited in adaptability and scope: they either inject LLM guidance as an unvalidated heuristic, which is sensitive to LLM hallucination, or embed semantic information within a single fixed learner. Statsformer overcomes both limitations through a guardrailed ensemble architecture. We embed LLM-derived feature priors within an ensemble of linear and nonlinear learners, adaptively calibrating their influence via cross-validation. This design yields a flexible system with an oracle-style guarantee that it performs no worse than any convex combination of its in-library base learners, up to statistical error. Empirically, informative priors yield consistent performance improvements, while uninformative or misspecified LLM guidance is automatically downweighted, mitigating the impact of hallucinations across a diverse range of prediction tasks.An open-source implementation of Statsformer is available at https://github.com/pilancilab/statsformer.

View on arXiv PDF Code

Similar