Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers
This work addresses a key bottleneck in scaling Transformers to long sequences, which is crucial for applications like natural language processing and time-series analysis, representing a novel method rather than an incremental improvement.
The paper tackles the challenge of extremely long input sequences in Transformers by proposing Continuous-Time Attention, a framework that infuses partial differential equations into the attention mechanism, resulting in better optimization landscapes and polynomial decay of distant interactions, with empirical benchmarks showing consistent gains over standard and specialized variants.
We propose a novel framework, Continuous_Time Attention, which infuses partial differential equations (PDEs) into the Transformer's attention mechanism to address the challenges of extremely long input sequences. Instead of relying solely on a static attention matrix, we allow attention weights to evolve over a pseudo_time dimension via diffusion, wave, or reaction_diffusion dynamics. This mechanism systematically smooths local noise, enhances long_range dependencies, and stabilizes gradient flow. Theoretically, our analysis shows that PDE_based attention leads to better optimization landscapes and polynomial rather than exponential decay of distant interactions. Empirically, we benchmark our method on diverse experiments_demonstrating consistent gains over both standard and specialized long sequence Transformer variants. Our findings highlight the potential of PDE_based formulations to enrich attention mechanisms with continuous_time dynamics and global coherence.