CVLGJan 21, 2025

Parallel Sequence Modeling via Generalized Spatial Propagation Network

arXiv:2501.12381v14 citationsh-index: 32CVPR
Originality Highly original
AI Analysis

This addresses the inefficiency and spatial incoherence of existing attention models in computer vision, offering a novel solution for tasks like image classification and generation.

The paper tackles the problem of attention mechanisms in vision tasks by introducing the Generalized Spatial Propagation Network (GSPN), which directly processes 2D spatial data to improve spatial coherence and efficiency, achieving state-of-the-art performance and accelerating SD-XL by over 84x for 16K image generation.

We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multi-dimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, context-aware propagation across 2D sequences and reduces the effective sequence length to $\sqrt{N}$ for a square map with N elements, significantly enhancing computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over $84\times$ when generating 16K images.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes