CLMar 12

Prompting Underestimates LLM Capability for Time Series Classification

Dan Schumacher, Erfan Nourbakhsh, Rocky Slavin, Anthony Rios

arXiv:2601.0346484.3h-index: 10

AI Analysis

This addresses a critical evaluation gap for researchers and practitioners in AI, revealing that LLMs encode meaningful temporal structure despite poor prompt-based performance, though it is incremental in refining assessment methods.

The paper tackles the problem that prompt-based evaluations underestimate large language models' (LLMs) capabilities in time series classification, showing that while zero-shot prompting yields near-chance F1 scores (0.15-0.26), linear probes on internal representations improve average F1 to 0.61-0.67, often matching specialized models.

Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model's representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.

View on arXiv PDF

Similar