Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models
This addresses the computational inefficiency of existing image restoration methods, especially at high resolutions, offering significant speed and memory savings for applications like photography or medical imaging.
The paper tackles the problem of efficient high-resolution image restoration by proposing Serpent, an architecture that combines state space models with multi-scale processing, achieving reconstruction quality on par with state-of-the-art methods while reducing compute by up to 150-fold and GPU memory by up to 5-fold.
The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters, while efficient, are inherently local and therefore struggle with modeling long-range dependencies in images. In contrast, attention excels at capturing global interactions between arbitrary image regions, but suffers from a quadratic cost in image dimension. In this work, we propose Serpent, an efficient architecture for high-resolution image restoration that combines recent advances in state space models (SSMs) with multi-scale signal processing in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. We propose a novel hierarchical architecture inspired by traditional signal processing principles, that converts the input image into a collection of sequences and processes them in a multi-scale fashion. Our experimental results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to $150$ fold reduction in FLOPS) and a factor of up to $5\times$ less GPU memory while maintaining a compact model size. The efficiency gains achieved by Serpent are especially notable at high image resolutions.