LGFeb 2
Universal Redundancies in Time Series Foundation ModelsAnthony Bao, Venkata Hasith Vattikuti, Jeffrey Lai et al.
Time Series Foundation Models (TSFMs) leverage extensive pretraining to accurately predict unseen time series during inference, without the need for task-specific fine-tuning. Through large-scale evaluations on standard benchmarks, we find that leading transformer-based TSFMs exhibit redundant components in their intermediate layers. We introduce a set of tools for mechanistic interpretability of TSFMs, including ablations of specific components and direct logit attribution on the residual stream. Our findings are consistent across several leading TSFMs with diverse architectures, and across a diverse set of real-world and synthetic time-series datasets. We discover that all models in our study are robust to ablations of entire layers. Furthermore, we develop a theoretical framework framing transformers as kernel regressors, motivating a purely intrinsic strategy for ablating heads based on the stable rank of the per-head projection matrices. Using this approach, we uncover the specific heads responsible for degenerate phenomena widely observed in TSFMs, such as parroting of motifs from the context and seasonality bias. Our study sheds light on the universal properties of this emerging class of architectures for continuous-time sequence modeling.
LGMay 19, 2025
Panda: A pretrained forecast model for chaotic dynamicsJeffrey Lai, Anthony Bao, William Gilpin
Chaotic systems are intrinsically sensitive to small errors, challenging efforts to construct predictive data-driven models of real-world dynamical systems such as fluid flows or neuronal activity. Prior efforts comprise either specialized models trained separately on individual time series, or foundation models trained on vast time series databases with little underlying dynamical structure. Motivated by dynamical systems theory, we present Panda, Patched Attention for Nonlinear Dynamics. We train Panda on a novel synthetic, extensible dataset of $2 \times 10^4$ chaotic dynamical systems that we discover using an evolutionary algorithm. Trained purely on simulated data, Panda exhibits emergent properties: zero-shot forecasting of unseen chaotic systems preserving both short-term accuracy and long-term statistics. Despite having been trained only on low-dimensional ordinary differential equations, Panda spontaneously develops the ability to predict partial differential equations without retraining. We also demonstrate a neural scaling law for differential equations, underscoring the potential of pretrained models for probing abstract mathematical domains like nonlinear dynamics.
LGFeb 21
Transformers for dynamical systems learn transfer operators in-contextAnthony Bao, Jeffrey Lai, William Gilpin
Large-scale foundation models for scientific machine learning adapt to physical settings unseen during training, such as zero-shot transfer between turbulent scales. This phenomenon, in-context learning, challenges conventional understanding of learning and adaptation in physical systems. Here, we study in-context learning of dynamical systems in a minimal setting: we train a small two-layer, single-head transformer to forecast one dynamical system, and then evaluate its ability to forecast a different dynamical system without retraining. We discover an early tradeoff in training between in-distribution and out-of-distribution performance, which manifests as a secondary double descent phenomenon. We discover that attention-based models apply a transfer-operator forecasting strategy in-context. They (1) lift low-dimensional time series using delay embedding, to detect the system's higher-dimensional dynamical manifold, and (2) identify and forecast long-lived invariant sets that characterize the global flow on this manifold. Our results clarify the mechanism enabling large pretrained models to forecast unseen physical systems at test without retraining, and they illustrate the unique ability of attention-based models to leverage global attractor information in service of short-term forecasts.
MLJan 13, 2025
Gaussian Universality for Diffusion ModelsReza Ghane, Anthony Bao, Danil Akhtiamov et al.
We investigate Gaussian Universality for data distributions generated via diffusion models. By Gaussian Universality we mean that the test error of a generalized linear model $f(\mathbf{W})$ trained for a classification task on the diffusion data matches the test error of $f(\mathbf{W})$ trained on the Gaussian Mixture with matching means and covariances per class.In other words, the test error depends only on the first and second order statistics of the diffusion-generated data in the linear setting. As a corollary, the analysis of the test error for linear classifiers can be reduced to Gaussian data from diffusion-generated data. Analysing the performance of models trained on synthetic data is a pertinent problem due to the surge of methods such as \cite{sehwag2024stretchingdollardiffusiontraining}. Moreover, we show that, for any $1$- Lipschitz scalar function $φ$, $φ(\mathbf{x})$ is close to $\mathbb{E} φ(\mathbf{x})$ with high probability for $\mathbf{x}$ sampled from the conditional diffusion model corresponding to each class. Finally, we note that current approaches for proving universality do not apply to diffusion-generated data as the covariance matrices of the data tend to have vanishing minimum singular values, contrary to the assumption made in the literature. This leaves extending previous mathematical universality results as an intriguing open question.