Uncovering the Spectral Bias in Diagonal State Space Models
This work addresses the problem of efficient and effective initialization for diagonal state space models, which is important for researchers and practitioners in sequence modeling, though it appears incremental as it builds on existing diagonal SSM frameworks.
The paper investigates initialization schemes for diagonal state space models from a frequency perspective, uncovering learning biases and proposing a new diagonal initialization method called S4D-DFouT. This approach achieves state-of-the-art results on the Long Range Arena benchmark, enabling training from scratch on large datasets like PathX-256.
Current methods for initializing state space models (SSMs) parameters mainly rely on the \textit{HiPPO framework}, which is based on an online approximation of orthogonal polynomials. Recently, diagonal alternatives have shown to reach a similar level of performance while being significantly more efficient due to the simplification in the kernel computation. However, the \textit{HiPPO framework} does not explicitly study the role of its diagonal variants. In this paper, we take a further step to investigate the role of diagonal SSM initialization schemes from the frequency perspective. Our work seeks to systematically understand how to parameterize these models and uncover the learning biases inherent in such diagonal state-space models. Based on our observations, we propose a diagonal initialization on the discrete Fourier domain \textit{S4D-DFouT}. The insights in the role of pole placing in the initialization enable us to further scale them and achieve state-of-the-art results on the Long Range Arena benchmark, allowing us to train from scratch on very large datasets as PathX-256.