LGFeb 9, 2025

Investigating Compositional Reasoning in Time Series Foundation Models

CMU
arXiv:2502.06037v26 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This work addresses a fundamental gap in understanding reasoning capabilities in time series models, which is crucial for advancing reliable forecasting in domains like finance and healthcare, though it is incremental as it builds on existing TSFM and LLM literature.

The study investigates whether time series foundation models (TSFMs) can reason compositionally rather than just memorize patterns, by evaluating 16 forecasting models on synthetic and real-world datasets. It finds that patch-based Transformers achieve the best reasoning performance, with residualized MLP-based architectures being 97% less computationally complex and 86% smaller in parameters, and in some zero-shot scenarios, these models outperform statistical baselines trained on in-distribution data.

Large pre-trained time series foundation models (TSFMs) have demonstrated promising zero-shot performance across a wide range of domains. However, a question remains: Do TSFMs succeed by memorizing patterns in training data, or do they possess the ability to reason about such patterns? While reasoning is a topic of great interest in the study of Large Language Models (LLMs), it is undefined and largely unexplored in the context of TSFMs. In this work, inspired by language modeling literature, we formally define compositional reasoning in forecasting and distinguish it from in-distribution generalization. We evaluate the reasoning and generalization capabilities of 16 popular deep learning forecasting models on multiple synthetic and real-world datasets. Additionally, through controlled studies, we systematically examine which design choices in 7 popular open-source TSFMs contribute to improved reasoning capabilities. Our study yields key insights into the impact of TSFM architecture design on compositional reasoning and generalization. We find that patch-based Transformers have the best reasoning performance, closely followed by residualized MLP-based architectures, which are 97\% less computationally complex in terms of FLOPs and 86\% smaller in terms of the number of trainable parameters. Interestingly, in some zero-shot out-of-distribution scenarios, these models can outperform moving average and exponential smoothing statistical baselines trained on in-distribution data. Only a few design choices, such as the tokenization method, had a significant (negative) impact on Transformer model performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes