LG CLDec 1, 2022

Simplifying and Understanding State Space Models with Diagonal Linear RNNs

Ankit Gupta, Harsh Mehta, Jonathan Berant

DeepMindIBM

arXiv:2212.00768v319.223 citationsh-index: 59Has Code

Originality Incremental advance

AI Analysis

This work simplifies sequence modeling for researchers and practitioners by removing the discretization step in SSMs, though it is incremental as it builds on existing SSM frameworks.

The paper tackles the complexity of state space models (SSMs) by proposing a simpler alternative based on Diagonal Linear RNNs (DLR), showing that DLR performs comparably to SSMs on tasks like Long Range Arena and raw speech classification, with high performance on long-sequence tasks such as ListOpsSubTrees (8K tokens) and PathfinderSegmentation-256 (65K tokens).

Sequence models based on linear state spaces (SSMs) have recently emerged as a promising choice of architecture for modeling long range dependencies across various modalities. However, they invariably rely on discretization of a continuous state space, which complicates their presentation and understanding. In this work, we dispose of the discretization step, and propose a model based on vanilla Diagonal Linear RNNs ($\mathrm{DLR}$). We empirically show that, despite being conceptually much simpler, $\mathrm{DLR}$ is as performant as previously-proposed SSMs on a variety of tasks and benchmarks including Long Range Arena and raw speech classification. Moreover, we characterize the expressivity of SSMs (including $\mathrm{DLR}$) and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks involving interactions over tens of thousands of tokens, ranging from simple operations, such as shifting an input sequence, to detecting co-dependent visual features over long spatial ranges in flattened images. We find that while SSMs report near-perfect performance on tasks that can be modeled via $\textit{few}$ convolutional kernels, they struggle on tasks requiring $\textit{many}$ such kernels and especially when the desired sequence manipulation is $\textit{context-dependent}$. Despite these limitations, $\mathrm{DLR}$ reaches high performance on two higher-order reasoning tasks $\mathrm{ListOpsSubTrees}$ and $\mathrm{PathfinderSegmentation}\text{-}\mathrm{256}$ with input lengths $8K$ and $65K$ respectively, and gives encouraging performance on $\mathrm{PathfinderSegmentation}\text{-}\mathrm{512}$ with input length $262K$ for which attention is not a viable choice.

View on arXiv PDF Code

Similar