Efficient Visual State Space Model for Image Deblurring
This work addresses the practical limitation of applying Vision Transformers to high-resolution image restoration, offering an incremental improvement in efficiency for image deblurring tasks.
The authors tackled the problem of high computational complexity in Vision Transformers for image deblurring by proposing an efficient visual state space model (EVSSM) with a visual scan block and frequency-based feedforward network, achieving favorable performance against state-of-the-art methods on benchmark datasets and real-world images.
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration. While ViTs generally outperform CNNs by effectively capturing long-range dependencies and input-specific characteristics, their computational complexity increases quadratically with image resolution. This limitation hampers their practical application in high-resolution image restoration. In this paper, we propose a simple yet effective visual state space model (EVSSM) for image deblurring, leveraging the benefits of state space models (SSMs) for visual data. In contrast to existing methods that employ several fixed-direction scanning for feature extraction, which significantly increases the computational cost, we develop an efficient visual scan block that applies various geometric transformations before each SSM-based module, capturing useful non-local information and maintaining high efficiency. In addition, to more effectively capture and represent local information, we propose an efficient discriminative frequency domain-based feedforward network (EDFFN), which can effectively estimate useful frequency information for latent clear image restoration. Extensive experimental results show that the proposed EVSSM performs favorably against state-of-the-art methods on benchmark datasets and real-world images. The code is available at https://github.com/kkkls/EVSSM.