LGOct 2, 2025

Poolformer: Recurrent Networks with Pooling for Long-Sequence Modeling

arXiv:2510.02206v1
Originality Incremental advance
AI Analysis

This addresses the computational bottleneck for long-sequence modeling in AI, with potential applications in audio, text, and vision, though it appears incremental as it builds on existing recurrent and pooling techniques.

The authors tackled the problem of quadratic scaling in self-attention for long sequences by introducing Poolformer, which replaces self-attention with recurrent layers and pooling operations, resulting in faster training, improved perceptual metrics (FID and IS), and outperforming state-of-the-art models like SaShiMi and Mamba on raw audio.

Sequence-to-sequence models have become central in Artificial Intelligence, particularly following the introduction of the transformer architecture. While initially developed for Natural Language Processing, these models have demonstrated utility across domains, including Computer Vision. Such models require mechanisms to exchange information along the time dimension, typically using recurrent or self-attention layers. However, self-attention scales quadratically with sequence length, limiting its practicality for very long sequences. We introduce Poolformer, a sequence-to-sequence model that replaces self-attention with recurrent layers and incorporates pooling operations to reduce sequence length. Poolformer is defined recursively using SkipBlocks, which contain residual blocks, a down-pooling layer, a nested SkipBlock, an up-pooling layer, and additional residual blocks. We conduct extensive experiments to support our architectural choices. Our results show that pooling greatly accelerates training, improves perceptual metrics (FID and IS), and prevents overfitting. Our experiments also suggest that long-range dependencies are handled by deep layers, while shallow layers take care of short-term features. Evaluated on raw audio, which naturally features long sequence lengths, Poolformer outperforms state-of-the-art models such as SaShiMi and Mamba. Future directions include applications to text and vision, as well as multi-modal scenarios, where a Poolformer-based LLM could effectively process dense representations of images and videos.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes