LGDec 31, 2024

Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

arXiv:2501.00658v214 citationsh-index: 23Has CodeICLR
Originality Incremental advance
AI Analysis

This addresses scalability issues in SSMs for long-sequence modeling, though it is an incremental improvement over existing methods.

The paper identifies that Structured State Space Models (SSMs) suffer from recency bias and over-smoothing, which limit their ability to handle long sequences and scale in depth, and proposes a polarization technique that improves associative recall accuracy and enables deeper architectures.

Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that they are inherently limited by strong recency bias. Our empirical studies also reveal that this bias impairs the models' ability to recall distant information and introduces robustness issues. Our scaling experiments then discovered that deeper structures in SSMs can facilitate the learning of long contexts. However, subsequent theoretical analysis reveals that as SSMs increase in depth, they exhibit another inevitable tendency toward over-smoothing, e.g., token representations becoming increasingly indistinguishable. This fundamental dilemma between recency and over-smoothing hinders the scalability of existing SSMs. Inspired by our theoretical findings, we propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing. Experiments demonstrate that our polarization technique consistently enhances the associative recall accuracy of long-range tokens and unlocks SSMs to benefit further from deeper architectures. All source codes are released at https://github.com/VITA-Group/SSM-Bottleneck.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes