Flash STU: Fast Spectral Transform Units
This work addresses efficiency and performance issues in sequence modeling for applications such as language modeling, robotics control, and linear dynamical systems, representing an incremental improvement over existing state-space models.
The paper tackled the challenge of balancing computational efficiency with model expressiveness in sequence modeling by proposing the Flash STU architecture, a hybrid model that interleaves spectral state space model layers with sliding window attention, and found that it consistently outperforms Transformers and other state-space models like S4 and Mamba-2 given a fixed parameter budget.
Recent advances in state-space model architectures have shown great promise for efficient sequence modeling, but challenges remain in balancing computational efficiency with model expressiveness. We propose the Flash STU architecture, a hybrid model that interleaves spectral state space model layers with sliding window attention, enabling scalability to billions of parameters for language modeling while maintaining a near-linear time complexity. We evaluate the Flash STU and its variants on diverse sequence prediction tasks, including linear dynamical systems, robotics control, and language modeling. We find that, given a fixed parameter budget, the Flash STU architecture consistently outperforms the Transformer and other leading state-space models such as S4 and Mamba-2.