Why Attention Fails: The Degeneration of Transformers into MLPs in Time Series Forecasting
This addresses a critical bottleneck for researchers and practitioners in time series analysis, revealing fundamental issues with transformer architectures in this domain, making it incremental but insightful.
The paper investigates why transformers underperform in time series forecasting, finding that attention mechanisms often degenerate into MLPs due to flawed embedding methods, which prevents them from operating in a well-structured latent space.
Transformer-based architectures achieved high performance in natural language processing and computer vision, yet many studies have shown that they have not demonstrated a clear advantage in time series forecasting and even underperform simple linear baselines in some cases. However, most of these studies have not thoroughly explored the reasons behind the failure of transformers. To better understand time-series transformers(TST), we designed a series of experiments, progressively modifying transformers into MLPs to investigate the impact of the attention mechanism. Surprisingly, transformer blocks often degenerate into simple MLPs in existing time-series transformers. We designed a interpretable dataset to investigate the reasons behind the failure of the attention mechanism and revealed that the attention mechanism is not working in the expected way. We theoretically analyzed the reasons behind this phenomenon, demonstrating that the current embedding methods fail to allow transformers to function in a well-structured latent space, and further analyzed the deeper underlying causes of the failure of embedding.