Dual-former: Hybrid Self-attention Transformer for Efficient Image Restoration
This addresses the problem of high computational complexity in image restoration transformers for researchers and practitioners, offering an incremental improvement in efficiency and performance.
The paper tackles efficient image restoration by proposing Dual-former, a hybrid self-attention transformer that combines global and local modeling, achieving a 1.91dB gain over SOTA on dehazing with only 4.2% GFLOPs and outperforming SOTA in deraining and desnowing with reduced computational costs.
Recently, image restoration transformers have achieved comparable performance with previous state-of-the-art CNNs. However, how to efficiently leverage such architectures remains an open problem. In this work, we present Dual-former whose critical insight is to combine the powerful global modeling ability of self-attention modules and the local modeling ability of convolutions in an overall architecture. With convolution-based Local Feature Extraction modules equipped in the encoder and the decoder, we only adopt a novel Hybrid Transformer Block in the latent layer to model the long-distance dependence in spatial dimensions and handle the uneven distribution between channels. Such a design eliminates the substantial computational complexity in previous image restoration transformers and achieves superior performance on multiple image restoration tasks. Experiments demonstrate that Dual-former achieves a 1.91dB gain over the state-of-the-art MAXIM method on the Indoor dataset for single image dehazing while consuming only 4.2% GFLOPs as MAXIM. For single image deraining, it exceeds the SOTA method by 0.1dB PSNR on the average results of five datasets with only 21.5% GFLOPs. Dual-former also substantially surpasses the latest desnowing method on various datasets, with fewer parameters.