LG AI AP MLAug 27, 2025

Robustness is Important: Limitations of LLMs for Data Fitting

Hejia Liu, Mochen Yang, Gediminas Adomavicius

arXiv:2508.19563v37.12 citationsh-index: 47

Originality Incremental advance

AI Analysis

This highlights a fundamental limitation for practitioners relying on LLMs for data analysis, making it an incremental but important critique of current methods.

The paper identifies a critical vulnerability in using Large Language Models (LLMs) for data fitting, showing that task-irrelevant changes like altering variable names can cause prediction errors to vary by up to 82%, and finds that even specialized models like TabPFN lack robustness.

Large Language Models (LLMs) are being applied in a wide array of settings, well beyond the typical language-oriented use cases. In particular, LLMs are increasingly used as a plug-and-play method for fitting data and generating predictions. Prior work has shown that LLMs, via in-context learning or supervised fine-tuning, can perform competitively with many tabular supervised learning techniques in terms of predictive performance. However, we identify a critical vulnerability of using LLMs for data fitting -- making changes to data representation that are completely irrelevant to the underlying learning task can drastically alter LLMs' predictions on the same data. For example, simply changing variable names can sway the size of prediction error by as much as 82% in certain settings. Such prediction sensitivity with respect to task-irrelevant variations manifests under both in-context learning and supervised fine-tuning, for both close-weight and open-weight general-purpose LLMs. Moreover, by examining the attention scores of an open-weight LLM, we discover a non-uniform attention pattern: training examples and variable names/values which happen to occupy certain positions in the prompt receive more attention when output tokens are generated, even though different positions are expected to receive roughly the same attention. This partially explains the sensitivity in the presence of task-irrelevant variations. We also consider a state-of-the-art tabular foundation model (TabPFN) trained specifically for data fitting. Despite being explicitly designed to achieve prediction robustness, TabPFN is still not immune to task-irrelevant variations. Overall, despite LLMs' impressive predictive capabilities, currently they lack even the basic level of robustness to be used as a principled data-fitting tool.

View on arXiv PDF

Similar