Beyond Similarity: Temporal Operator Attention for Time Series Analysis
For time-series practitioners, this work addresses a fundamental limitation of attention mechanisms, offering a principled improvement that yields consistent gains across multiple tasks and backbones.
The paper identifies a mismatch between standard attention's convex combination and the signed, oscillatory transformations needed for time-series modeling, and proposes Temporal Operator Attention (TOA) with Stochastic Operator Regularization to enable direct signed mixing. TOA consistently improves performance over backbones like PatchTST and iTransformer across forecasting, anomaly detection, and classification benchmarks.
A persistent paradox in time-series forecasting is that structurally simple MLP and linear models often outperform high-capacity Transformers. We argue that this gap arises from a mismatch in the sequence-modeling primitive: while many time-series dynamics are governed by global temporal operators (e.g., filtering and harmonic structure), standard attention forms each output as a convex combination of inputs. This restricts its ability to represent signed and oscillatory transformations that are fundamental to temporal signal processing. We formalize this limitation as a simplex-constrained mixing bottleneck in softmax attention, which becomes especially restrictive for operator-driven time-series tasks. To address this, we propose $\textbf{Temporal Operator Attention (TOA)}$, a framework that augments attention with explicit, learnable sequence-space operators, enabling direct signed mixing across time while preserving input-dependent adaptivity. To make dense $N \times N$ operators practical, we introduce Stochastic Operator Regularization, a high-variance dropout mechanism that stabilizes training and prevents trivial memorization. Across forecasting, anomaly detection, and classification benchmarks, TOA consistently improves performance when integrated into standard backbones such as PatchTST and iTransformer, with particularly strong gains in reconstruction-heavy tasks. These results suggest that explicit operator learning is a key ingredient for effective time-series modeling.