LGOct 30, 2023

Convolutional State Space Models for Long-Range Spatiotemporal Modeling

Jimmy T. H. Smith, Shalini De Mello, Jan Kautz, Scott W. Linderman, Wonmin Byeon

arXiv:2310.19694v118.031 citationsh-index: 31

Originality Highly original

AI Analysis

This work addresses the problem of efficient long-range spatiotemporal modeling for AI applications like video prediction, offering a novel method that improves speed and performance over existing approaches.

The paper tackles the challenge of modeling long spatiotemporal sequences by introducing convolutional state space models (ConvSSM), specifically ConvS5, which combines tensor modeling with state space methods to achieve efficient parallelization and fast generation. ConvS5 outperforms Transformers and ConvLSTM on Moving-MNIST with 3X faster training and 400X faster sample generation, and matches or exceeds SOTA on DMLab, Minecraft, and Habitat benchmarks.

Effectively modeling long spatiotemporal sequences is challenging due to the need to model complex spatial correlations and long-range temporal dependencies simultaneously. ConvLSTMs attempt to address this by updating tensor-valued states with recurrent neural networks, but their sequential computation makes them slow to train. In contrast, Transformers can process an entire spatiotemporal sequence, compressed into tokens, in parallel. However, the cost of attention scales quadratically in length, limiting their scalability to longer sequences. Here, we address the challenges of prior methods and introduce convolutional state space models (ConvSSM) that combine the tensor modeling ideas of ConvLSTM with the long sequence modeling approaches of state space methods such as S4 and S5. First, we demonstrate how parallel scans can be applied to convolutional recurrences to achieve subquadratic parallelization and fast autoregressive generation. We then establish an equivalence between the dynamics of ConvSSMs and SSMs, which motivates parameterization and initialization strategies for modeling long-range dependencies. The result is ConvS5, an efficient ConvSSM variant for long-range spatiotemporal modeling. ConvS5 significantly outperforms Transformers and ConvLSTM on a long horizon Moving-MNIST experiment while training 3X faster than ConvLSTM and generating samples 400X faster than Transformers. In addition, ConvS5 matches or exceeds the performance of state-of-the-art methods on challenging DMLab, Minecraft and Habitat prediction benchmarks and enables new directions for modeling long spatiotemporal sequences.

View on arXiv PDF

Similar