Guangwei Gao

CV
h-index98
49papers
1,690citations
Novelty51%
AI Score60

49 Papers

CVApr 28, 2022Code
Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer

Guangwei Gao, Zhengxue Wang, Juncheng Li et al.

Single-image super-resolution (SISR) has achieved significant breakthroughs with the development of deep learning. However, these methods are difficult to be applied in real-world scenarios since they are inevitably accompanied by the problems of computational and memory costs caused by the complex operations. To solve this issue, we propose a Lightweight Bimodal Network (LBNet) for SISR. Specifically, an effective Symmetric CNN is designed for local feature extraction and coarse image reconstruction. Meanwhile, we propose a Recursive Transformer to fully learn the long-term dependence of images thus the global information can be fully used to further refine texture details. Studies show that the hybrid of CNN and Transformer can build a more efficient model. Extensive experiments have proved that our LBNet achieves more prominent performance than other state-of-the-art methods with a relatively low computational cost and memory consumption. The code is available at https://github.com/IVIPLab/LBNet.

CVApr 19, 2022Code
CTCNet: A CNN-Transformer Cooperation Network for Face Image Super-Resolution

Guangwei Gao, Zixiang Xu, Juncheng Li et al.

Recently, deep convolution neural networks (CNNs) steered face super-resolution methods have achieved great progress in restoring degraded facial details by jointly training with facial priors. However, these methods have some obvious limitations. On the one hand, multi-task joint learning requires additional marking on the dataset, and the introduced prior network will significantly increase the computational cost of the model. On the other hand, the limited receptive field of CNN will reduce the fidelity and naturalness of the reconstructed facial images, resulting in suboptimal reconstructed images. In this work, we propose an efficient CNN-Transformer Cooperation Network (CTCNet) for face super-resolution tasks, which uses the multi-scale connected encoder-decoder architecture as the backbone. Specifically, we first devise a novel Local-Global Feature Cooperation Module (LGCM), which is composed of a Facial Structure Attention Unit (FSAU) and a Transformer block, to promote the consistency of local facial detail and global facial structure restoration simultaneously. Then, we design an efficient Feature Refinement Module (FRM) to enhance the encoded features. Finally, to further improve the restoration of fine facial details, we present a Multi-scale Feature Fusion Unit (MFFU) to adaptively fuse the features from different stages in the encoder procedure. Extensive evaluations on various datasets have assessed that the proposed CTCNet can outperform other state-of-the-art methods significantly. Source code will be available at https://github.com/IVIPLab/CTCNet.

CVDec 29, 2022Code
Efficient Image Super-Resolution with Feature Interaction Weighted Hybrid Network

Wenjie Li, Juncheng Li, Guangwei Gao et al.

Lightweight image super-resolution aims to reconstruct high-resolution images from low-resolution images using low computational costs. However, existing methods result in the loss of middle-layer features due to activation functions. To minimize the impact of intermediate feature loss on reconstruction quality, we propose a Feature Interaction Weighted Hybrid Network (FIWHN), which comprises a series of Wide-residual Distillation Interaction Block (WDIB) as the backbone. Every third WDIB forms a Feature Shuffle Weighted Group (FSWG) by applying mutual information shuffle and fusion. Moreover, to mitigate the negative effects of intermediate feature loss, we introduce Wide Residual Weighting units within WDIB. These units effectively fuse features of varying levels of detail through a Wide-residual Distillation Connection (WRDC) and a Self-Calibrating Fusion (SCF). To compensate for global feature deficiencies, we incorporate a Transformer and explore a novel architecture to combine CNN and Transformer. We show that our FIWHN achieves a favorable balance between performance and efficiency through extensive experiments on low-level and high-level tasks. Codes will be available at \url{https://github.com/IVIPLab/FIWHN}.

CVFeb 21, 2023Code
Lightweight Real-time Semantic Segmentation Network with Efficient Transformer and CNN

Guoan Xu, Juncheng Li, Guangwei Gao et al.

In the past decade, convolutional neural networks (CNNs) have shown prominence for semantic segmentation. Although CNN models have very impressive performance, the ability to capture global representation is still insufficient, which results in suboptimal results. Recently, Transformer achieved huge success in NLP tasks, demonstrating its advantages in modeling long-range dependency. Recently, Transformer has also attracted tremendous attention from computer vision researchers who reformulate the image processing tasks as a sequence-to-sequence prediction but resulted in deteriorating local feature details. In this work, we propose a lightweight real-time semantic segmentation network called LETNet. LETNet combines a U-shaped CNN with Transformer effectively in a capsule embedding style to compensate for respective deficiencies. Meanwhile, the elaborately designed Lightweight Dilated Bottleneck (LDB) module and Feature Enhancement (FE) module cultivate a positive impact on training from scratch simultaneously. Extensive experiments performed on challenging datasets demonstrate that LETNet achieves superior performances in accuracy and efficiency balance. Specifically, It only contains 0.95M parameters and 13.6G FLOPs but yields 72.8\% mIoU at 120 FPS on the Cityscapes test set and 70.5\% mIoU at 250 FPS on the CamVid test dataset using a single RTX 3090 GPU. The source code will be available at https://github.com/IVIPLab/LETNet.

CVJul 6, 2022Code
Cross-receptive Focused Inference Network for Lightweight Image Super-Resolution

Wenjie Li, Juncheng Li, Guangwei Gao et al.

Recently, Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks due to the ability of global feature extraction. However, the capabilities of Transformers that need to incorporate contextual information to extract features dynamically are neglected. To address this issue, we propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer. Specifically, in the CT block, we first propose a CNN-based Cross-Scale Information Aggregation Module (CIAM) to enable the model to better focus on potentially helpful information to improve the efficiency of the Transformer phase. Then, we design a novel Cross-receptive Field Guided Transformer (CFGT) to enable the selection of contextual information required for reconstruction by using a modulated convolutional kernel that understands the current semantic information and exploits the information interaction within different self-attention. Extensive experiments have shown that our proposed CFIN can effectively reconstruct images using contextual information, and it can strike a good balance between computational cost and model performance as an efficient model. Source codes will be available at https://github.com/IVIPLab/CFIN.

CVMar 24, 2023Code
PFT-SSR: Parallax Fusion Transformer for Stereo Image Super-Resolution

Hansheng Guo, Juncheng Li, Guangwei Gao et al.

Stereo image super-resolution aims to boost the performance of image super-resolution by exploiting the supplementary information provided by binocular systems. Although previous methods have achieved promising results, they did not fully utilize the information of cross-view and intra-view. To further unleash the potential of binocular images, in this letter, we propose a novel Transformerbased parallax fusion module called Parallax Fusion Transformer (PFT). PFT employs a Cross-view Fusion Transformer (CVFT) to utilize cross-view information and an Intra-view Refinement Transformer (IVRT) for intra-view feature refinement. Meanwhile, we adopted the Swin Transformer as the backbone for feature extraction and SR reconstruction to form a pure Transformer architecture called PFT-SSR. Extensive experiments and ablation studies show that PFT-SSR achieves competitive results and outperforms most SOTA methods. Source code is available at https://github.com/MIVRC/PFT-PyTorch.

CVApr 13, 2023
EWT: Efficient Wavelet-Transformer for Single Image Denoising

Juncheng Li, Bodong Cheng, Ying Chen et al.

Transformer-based image denoising methods have achieved encouraging results in the past year. However, it must uses linear operations to model long-range dependencies, which greatly increases model inference time and consumes GPU storage space. Compared with convolutional neural network-based methods, current Transformer-based image denoising methods cannot achieve a balance between performance improvement and resource consumption. In this paper, we propose an Efficient Wavelet Transformer (EWT) for image denoising. Specifically, we use Discrete Wavelet Transform (DWT) and Inverse Wavelet Transform (IWT) for downsampling and upsampling, respectively. This method can fully preserve the image features while reducing the image resolution, thereby greatly reducing the device resource consumption of the Transformer model. Furthermore, we propose a novel Dual-stream Feature Extraction Block (DFEB) to extract image features at different levels, which can further reduce model inference time and GPU memory usage. Experiments show that our method speeds up the original Transformer by more than 80%, reduces GPU memory usage by more than 60%, and achieves excellent denoising results. All code will be public.

CVSep 27, 2023Code
Survey on Deep Face Restoration: From Non-blind to Blind and Beyond

Wenjie Li, Mei Wang, Kai Zhang et al.

Face restoration (FR) is a specialized field within image restoration that aims to recover low-quality (LQ) face images into high-quality (HQ) face images. Recent advances in deep learning technology have led to significant progress in FR methods. In this paper, we begin by examining the prevalent factors responsible for real-world LQ images and introduce degradation techniques used to synthesize LQ images. We also discuss notable benchmarks commonly utilized in the field. Next, we categorize FR methods based on different tasks and explain their evolution over time. Furthermore, we explore the various facial priors commonly utilized in the restoration process and discuss strategies to enhance their effectiveness. In the experimental section, we thoroughly evaluate the performance of state-of-the-art FR methods across various tasks using a unified benchmark. We analyze their performance from different perspectives. Finally, we discuss the challenges faced in the field of FR and propose potential directions for future advancements. The open-source repository corresponding to this work can be found at https:// github.com/ 24wenjie-li/ Awesome-Face-Restoration.

CVJul 10, 2024Code
HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation

Guoan Xu, Wenjing Jia, Tao Wu et al.

Both Convolutional Neural Networks (CNNs) and Transformers have shown great success in semantic segmentation tasks. Efforts have been made to integrate CNNs with Transformer models to capture both local and global context interactions. However, there is still room for enhancement, particularly when considering constraints on computational resources. In this paper, we introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers to tackle lightweight semantic segmentation challenges. Specifically, we design a Hierarchy-Aware Pixel-Excitation (HAPE) module for adaptive multi-scale local feature extraction. During the global perception modeling, we devise an Efficient Transformer (ET) module streamlining the quadratic calculations associated with traditional Transformers. Moreover, a correlation-weighted Fusion (cwF) module selectively merges diverse feature representations, significantly enhancing predictive accuracy. HAFormer achieves high performance with minimal computational overhead and compact model size, achieving 74.2% mIoU on Cityscapes and 71.1% mIoU on CamVid test datasets, with frame rates of 105FPS and 118FPS on a single 2080Ti GPU. The source codes are available at https://github.com/XU-GITHUB-curry/HAFormer.

CVApr 29, 2022
Multiple Degradation and Reconstruction Network for Single Image Denoising via Knowledge Distillation

Juncheng Li, Hanhui Yang, Qiaosi Yi et al.

Single image denoising (SID) has achieved significant breakthroughs with the development of deep learning. However, the proposed methods are often accompanied by plenty of parameters, which greatly limits their application scenarios. Different from previous works that blindly increase the depth of the network, we explore the degradation mechanism of the noisy image and propose a lightweight Multiple Degradation and Reconstruction Network (MDRN) to progressively remove noise. Meanwhile, we propose two novel Heterogeneous Knowledge Distillation Strategies (HMDS) to enable MDRN to learn richer and more accurate features from heterogeneous models, which make it possible to reconstruct higher-quality denoised images under extreme conditions. Extensive experiments show that our MDRN achieves favorable performance against other SID models with fewer parameters. Meanwhile, plenty of ablation studies demonstrate that the introduced HMDS can improve the performance of tiny models or the model under high noise levels, which is extremely useful for related applications.

IVSep 12, 2024Code
OCTAMamba: A State-Space Model Approach for Precision OCTA Vasculature Segmentation

Shun Zou, Zhuo Zhang, Guangwei Gao

Optical Coherence Tomography Angiography (OCTA) is a crucial imaging technique for visualizing retinal vasculature and diagnosing eye diseases such as diabetic retinopathy and glaucoma. However, precise segmentation of OCTA vasculature remains challenging due to the multi-scale vessel structures and noise from poor image quality and eye lesions. In this study, we proposed OCTAMamba, a novel U-shaped network based on the Mamba architecture, designed to segment vasculature in OCTA accurately. OCTAMamba integrates a Quad Stream Efficient Mining Embedding Module for local feature extraction, a Multi-Scale Dilated Asymmetric Convolution Module to capture multi-scale vasculature, and a Focused Feature Recalibration Module to filter noise and highlight target areas. Our method achieves efficient global modeling and local feature extraction while maintaining linear complexity, making it suitable for low-computation medical applications. Extensive experiments on the OCTA 3M, OCTA 6M, and ROSSA datasets demonstrated that OCTAMamba outperforms state-of-the-art methods, providing a new reference for efficient OCTA segmentation. Code is available at https://github.com/zs1314/OCTAMamba

CVSep 12, 2024Code
MambaMIC: An Efficient Baseline for Microscopic Image Classification with State Space Models

Shun Zou, Zhuo Zhang, Yi Zou et al.

In recent years, CNN and Transformer-based methods have made significant progress in Microscopic Image Classification (MIC). However, existing approaches still face the dilemma between global modeling and efficient computation. While the Selective State Space Model (SSM) can simulate long-range dependencies with linear complexity, it still encounters challenges in MIC, such as local pixel forgetting, channel redundancy, and lack of local perception. To address these issues, we propose a simple yet efficient vision backbone for MIC tasks, named MambaMIC. Specifically, we introduce a Local-Global dual-branch aggregation module: the MambaMIC Block, designed to effectively capture and fuse local connectivity and global dependencies. In the local branch, we use local convolutions to capture pixel similarity, mitigating local pixel forgetting and enhancing perception. In the global branch, SSM extracts global dependencies, while Locally Aware Enhanced Filter reduces channel redundancy and local pixel forgetting. Additionally, we design a Feature Modulation Interaction Aggregation Module for deep feature interaction and key feature re-localization. Extensive benchmarking shows that MambaMIC achieves state-of-the-art performance across five datasets. code is available at https://zs1314.github.io/MambaMIC

CVNov 5, 2025Code
Transformer-Progressive Mamba Network for Lightweight Image Super-Resolution

Sichen Guo, Wenjie Li, Yuanyang Liu et al.

Recently, Mamba-based super-resolution (SR) methods have demonstrated the ability to capture global receptive fields with linear complexity, addressing the quadratic computational cost of Transformer-based SR approaches. However, existing Mamba-based methods lack fine-grained transitions across different modeling scales, which limits the efficiency of feature representation. In this paper, we propose T-PMambaSR, a lightweight SR framework that integrates window-based self-attention with Progressive Mamba. By enabling interactions among receptive fields of different scales, our method establishes a fine-grained modeling paradigm that progressively enhances feature representation with linear complexity. Furthermore, we introduce an Adaptive High-Frequency Refinement Module (AHFRM) to recover high-frequency details lost during Transformer and Mamba processing. Extensive experiments demonstrate that T-PMambaSR progressively enhances the model's receptive field and expressiveness, yielding better performance than recent Transformer- or Mamba-based methods while incurring lower computational cost. Our codes will be released after acceptance.

CVOct 15, 2024Code
DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution

Zhengxue Wang, Zhiqiang Yan, Jinshan Pan et al.

Recent RGB-guided depth super-resolution methods have achieved impressive performance under the assumption of fixed and known degradation (e.g., bicubic downsampling). However, in real-world scenarios, captured depth data often suffer from unconventional and unknown degradation due to sensor limitations and complex imaging environments (e.g., low reflective surfaces, varying illumination). Consequently, the performance of these methods significantly declines when real-world degradation deviate from their assumptions. In this paper, we propose the Degradation Oriented and Regularized Network (DORNet), a novel framework designed to adaptively address unknown degradation in real-world scenes through implicit degradation representations. Our approach begins with the development of a self-supervised degradation learning strategy, which models the degradation representations of low-resolution depth data using routing selection-based degradation regularization. To facilitate effective RGB-D fusion, we further introduce a degradation-oriented feature transformation module that selectively propagates RGB content into the depth data based on the learned degradation priors. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of our DORNet in handling unknown degradation, outperforming existing methods. The code is available at https://github.com/yanzq95/DORNet.

CVAug 11, 2024
MacFormer: Semantic Segmentation with Fine Object Boundaries

Guoan Xu, Wenfeng Huang, Tao Wu et al.

Semantic segmentation involves assigning a specific category to each pixel in an image. While Vision Transformer-based models have made significant progress, current semantic segmentation methods often struggle with precise predictions in localized areas like object boundaries. To tackle this challenge, we introduce a new semantic segmentation architecture, ``MacFormer'', which features two key components. Firstly, using learnable agent tokens, a Mutual Agent Cross-Attention (MACA) mechanism effectively facilitates the bidirectional integration of features across encoder and decoder layers. This enables better preservation of low-level features, such as elementary edges, during decoding. Secondly, a Frequency Enhancement Module (FEM) in the decoder leverages high-frequency and low-frequency components to boost features in the frequency domain, benefiting object boundaries with minimal computational complexity increase. MacFormer is demonstrated to be compatible with various network architectures and outperforms existing methods in both accuracy and efficiency on benchmark datasets ADE20K and Cityscapes under different computational constraints.

CVSep 1, 2024
Attention-Guided Multi-scale Interaction Network for Face Super-Resolution

Xujie Wan, Wenjie Li, Guangwei Gao et al.

Recently, CNN and Transformer hybrid networks demonstrated excellent performance in face super-resolution (FSR) tasks. Since numerous features at different scales in hybrid networks, how to fuse these multiscale features and promote their complementarity is crucial for enhancing FSR. However, existing hybrid network-based FSR methods ignore this, only simply combining the Transformer and CNN. To address this issue, we propose an attention-guided Multiscale interaction network (AMINet), which incorporates local and global feature interactions, as well as encoder-decoder phase feature interactions. Specifically, we propose a Local and Global Feature Interaction Module (LGFI) to promote the fusion of global features and the local features extracted from different receptive fields by our Residual Depth Feature Extraction Module (RDFE). Additionally, we propose a Selective Kernel Attention Fusion Module (SKAF) to adaptively select fusions of different features within the LGFI and encoder-decoder phases. Our above design allows the free flow of multiscale features from within modules and between the encoder and decoder, which can promote the complementarity of different scale features to enhance FSR. Comprehensive experiments confirm that our method consistently performs well with less computational consumption and faster inference.

CVSep 27, 2024
ReviveDiff: A Universal Diffusion Model for Restoring Images in Adverse Weather Conditions

Wenfeng Huang, Guoan Xu, Wenjing Jia et al.

Images captured in challenging environments--such as nighttime, smoke, rainy weather, and underwater--often suffer from significant degradation, resulting in a substantial loss of visual quality. The effective restoration of these degraded images is critical for the subsequent vision tasks. While many existing approaches have successfully incorporated specific priors for individual tasks, these tailored solutions limit their applicability to other degradations. In this work, we propose a universal network architecture, dubbed ``ReviveDiff'', which can address various degradations and bring images back to life by enhancing and restoring their quality. Our approach is inspired by the observation that, unlike degradation caused by movement or electronic issues, quality degradation under adverse conditions primarily stems from natural media (such as fog, water, and low luminance), which generally preserves the original structures of objects. To restore the quality of such images, we leveraged the latest advancements in diffusion models and developed ReviveDiff to restore image quality from both macro and micro levels across some key factors determining image quality, such as sharpness, distortion, noise level, dynamic range, and color accuracy. We rigorously evaluated ReviveDiff on seven benchmark datasets covering five types of degrading conditions: Rainy, Underwater, Low-light, Smoke, and Nighttime Hazy. Our experimental results demonstrate that ReviveDiff outperforms the state-of-the-art methods both quantitatively and visually.

CVApr 15
RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

Guoan Xu, Yang Xiao, Guangwei Gao et al.

Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality reliability and propose a novel framework termed the Reliability-aware Self-Gated State Space Model (RSGMamba). At the core of our method is the Reliability-aware Self-Gated Mamba Block (RSGMB), which explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism. Unlike conventional fusion strategies that indiscriminately exchange information across modalities, RSGMB enables reliability-aware feature selection and enhancing informative feature aggregation. In addition, a lightweight Local Cross-Gated Modulation (LCGM) is incorporated to refine fine-grained spatial details, complementing the global modeling capability of RSGMB. Extensive experiments demonstrate that RSGMamba achieves state-of-the-art performance on both RGB-D and RGB-T semantic segmentation benchmarks, resulting 58.8% / 54.0% mIoU on NYUDepth V2 and SUN-RGBD (+0.4% / +0.7% over prior best), and 61.1% / 88.9% mIoU on MFNet and PST900 (up to +1.6%), with only 48.6M parameters, thereby validating the effectiveness and superiority of the proposed approach.

IVMar 13, 2025Code
Dual-domain Modulation Network for Lightweight Image Super-Resolution

Wenjie Li, Heng Guo, Yuefeng Hou et al.

Lightweight image super-resolution (SR) aims to reconstruct high-resolution images from low-resolution images under limited computational costs. We find that existing frequency-based SR methods cannot balance the reconstruction of overall structures and high-frequency parts. Meanwhile, these methods are inefficient for handling frequency features and unsuitable for lightweight SR. In this paper, we show that introducing both wavelet and Fourier information allows our model to consider both high-frequency features and overall SR structure reconstruction while reducing costs. Specifically, we propose a Dual-domain Modulation Network that integrates both wavelet and Fourier information for enhanced frequency modeling. Unlike existing methods that rely on a single frequency representation, our design combines wavelet-domain modulation via a Wavelet-domain Modulation Transformer (WMT) with global Fourier supervision, enabling complementary spectral learning well-suited for lightweight SR. Experimental results show that our method achieves a comparable PSNR to SRFormer and MambaIR while with less than 50\% and 60\% of their FLOPs and achieving inference speeds 15.4x and 5.4x faster, respectively, demonstrating the effectiveness of our method on SR quality and lightweight. Code link: https://github.com/24wenjie-li/DMNet

CVMar 26
PASDiff: Physics-Aware Semantic Guidance for Joint Real-world Low-Light Face Enhancement and Restoration

Yilin Ni, Wenjie Li, Zhengxue Wang et al.

Face images captured in real-world low light suffer multiple degradations-low illumination, blur, noise, and low visibility, etc. Existing cascaded solutions often suffer from severe error accumulation, while generic joint models lack explicit facial priors and struggle to resolve clear face structures. In this paper, we propose PASDiff, a Physics-Aware Semantic Diffusion with a training-free manner. To achieve a plausible illumination and color distribution, we leverage inverse intensity weighting and Retinex theory to introduce photometric constraints, thereby reliably recovering visibility and natural chromaticity. To faithfully reconstruct facial details, our Style-Agnostic Structural Injection (SASI) extracts structures from an off-the-shelf facial prior while filtering out its intrinsic photometric biases, seamlessly harmonizing identity features with physical constraints. Furthermore, we construct WildDark-Face, a real-world benchmark of 700 low-light facial images with complex degradations. Extensive experiments demonstrate that PASDiff significantly outperforms existing methods, achieving a superior balance among natural illumination, color recovery, and identity consistency.

NAApr 20
Dual formulations of geometric curvature flows and their discretizations

Guangwei Gao, Buyang Li, Rong Tang

We propose new formulations of geometric curvature flows -- referred to as \emph{dual formulations} -- that are equivalent to the original formulations but provide a novel framework for constructing linearly implicit and energy-stable schemes for curvature-driven surface evolution, including mean curvature flow, surface diffusion, and solid-state dewetting on a substrate with a moving contact line. The dual formulations are derived by introducing, at the continuous level, an additional unknown in the form of a dual multiplier. This augmentation does not alter the continuous dynamics but makes the underlying energy-dissipation structure explicit and, in turn, enables a systematic design of linearly implicit discretizations that inherit energy stability. A key feature of this framework is that it accommodates a broad class of artificial tangential motions which can be used to maintain good mesh quality of the computed surfaces. As an illustration, we combine the framework with the minimal-deformation-rate (MDR) tangential motion, leading to what we call the \emph{dual-MDR} scheme. The resulting method is linearly implicit and energy-stable, while retaining the MDR tangential motion to maintain good mesh quality. Extensive numerical experiments demonstrate the convergence of the proposed schemes, their structure-preserving properties, and advantages on representative benchmark problems.

CVDec 3, 2025
Traffic Image Restoration under Adverse Weather via Frequency-Aware Mamba

Liwen Pan, Longguang Wang, Guangwei Gao et al.

Traffic image restoration under adverse weather conditions remains a critical challenge for intelligent transportation systems. Existing methods primarily focus on spatial-domain modeling but neglect frequency-domain priors. Although the emerging Mamba architecture excels at long-range dependency modeling through patch-wise correlation analysis, its potential for frequency-domain feature extraction remains unexplored. To address this, we propose Frequency-Aware Mamba (FAMamba), a novel framework that integrates frequency guidance with sequence modeling for efficient image restoration. Our architecture consists of two key components: (1) a Dual-Branch Feature Extraction Block (DFEB) that enhances local-global interaction via bidirectional 2D frequency-adaptive scanning, dynamically adjusting traversal paths based on sub-band texture distributions; and (2) a Prior-Guided Block (PGB) that refines texture details through wavelet-based high-frequency residual learning, enabling high-quality image reconstruction with precise details. Meanwhile, we design a novel Adaptive Frequency Scanning Mechanism (AFSM) for the Mamba architecture, which enables the Mamba to achieve frequency-domain scanning across distinct subgraphs, thereby fully leveraging the texture distribution characteristics inherent in subgraph structures. Extensive experiments demonstrate the efficiency and effectiveness of FAMamba.

CVOct 14, 2025Code
Self-Supervised Selective-Guided Diffusion Model for Old-Photo Face Restoration

Wenjie Li, Xiangyi Wang, Heng Guo et al.

Old-photo face restoration poses significant challenges due to compounded degradations such as breakage, fading, and severe blur. Existing pre-trained diffusion-guided methods either rely on explicit degradation priors or global statistical guidance, which struggle with localized artifacts or face color. We propose Self-Supervised Selective-Guided Diffusion (SSDiff), which leverages pseudo-reference faces generated by a pre-trained diffusion model under weak guidance. These pseudo-labels exhibit structurally aligned contours and natural colors, enabling region-specific restoration via staged supervision: structural guidance applied throughout the denoising process and color refinement in later steps, aligned with the coarse-to-fine nature of diffusion. By incorporating face parsing maps and scratch masks, our method selectively restores breakage regions while avoiding identity mismatch. We further construct VintageFace, a 300-image benchmark of real old face photos with varying degradation levels. SSDiff outperforms existing GAN-based and diffusion-based methods in perceptual quality, fidelity, and regional controllability. Code link: https://github.com/PRIS-CV/SSDiff.

CVMay 12
LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing

Yongcong Wang, Chengchao Shen, Guangwei Gao et al.

Currently, there is a gap in the field of ultra-high-definition (UHD) video dehazing due to the lack of a benchmark for evaluation. Furthermore, existing video dehazing methods cannot run on consumer-grade GPUs when processing continuous UHD sequences of 3--5 frames at a time. In this paper, we address both issues with a new benchmark and an efficient method. Our key observation is that atmospheric dehazing reduces to a per-pixel affine transform governed by the low-frequency depth field, which can be compactly encoded in bilateral grids whose prediction cost is decoupled from the output resolution. Building on this, we propose LiBrA-Net, which factorizes the spatiotemporal affine field into a spatial--color and a temporal bilateral sub-grid predicted at a fixed low resolution, fuses their coefficients in the $\mathfrak{gl}(3)$ Lie algebra under group-theoretic regularization, maps the result to invertible GL(3) transforms via a Cayley parameterization, and restores high-frequency detail through a lightweight input-guided branch. We further release UHV-4K, the first paired 4K video dehazing benchmark with depth, transmission, and optical-flow annotations on every frame. Across UHV-4K, REVIDE, and HazeWorld, LiBrA-Net sets a new state of the art among compared video dehazing methods while running native 4K at 25 FPS on a single GPU with only 6.12 M parameters. Code and data are available at https://anonymous.4open.science/r/LiBrA-Net-42B8.

CVMar 15, 2025Code
Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition

Shun Zou, Yi Zou, Mingya Zhang et al.

In recent years, Transformer has witnessed significant progress in food recognition. However, most existing approaches still face two critical challenges in lightweight food recognition: (1) the quadratic complexity and redundant feature representation from interactions with irrelevant tokens; (2) static feature recognition and single-scale representation, which overlook the unstructured, non-fixed nature of food images and the need for multi-scale features. To address these, we propose an adaptive and efficient sparse Transformer architecture (Fraesormer) with two core designs: Adaptive Top-k Sparse Partial Attention (ATK-SPA) and Hierarchical Scale-Sensitive Feature Gating Network (HSSFGN). ATK-SPA uses a learnable Gated Dynamic Top-K Operator (GDTKO) to retain critical attention scores, filtering low query-key matches that hinder feature aggregation. It also introduces a partial channel mechanism to reduce redundancy and promote expert information flow, enabling local-global collaborative modeling. HSSFGN employs gating mechanism to achieve multi-scale feature representation, enhancing contextual semantic information. Extensive experiments show that Fraesormer outperforms state-of-the-art methods. code is available at https://zs1314.github.io/Fraesormer.

CVDec 16, 2021Code
Feature Distillation Interaction Weighting Network for Lightweight Image Super-Resolution

Guangwei Gao, Wenjie Li, Juncheng Li et al.

Convolutional neural networks based single-image super-resolution (SISR) has made great progress in recent years. However, it is difficult to apply these methods to real-world scenarios due to the computational and memory cost. Meanwhile, how to take full advantage of the intermediate features under the constraints of limited parameters and calculations is also a huge challenge. To alleviate these issues, we propose a lightweight yet efficient Feature Distillation Interaction Weighted Network (FDIWN). Specifically, FDIWN utilizes a series of specially designed Feature Shuffle Weighted Groups (FSWG) as the backbone, and several novel mutual Wide-residual Distillation Interaction Blocks (WDIB) form an FSWG. In addition, Wide Identical Residual Weighting (WIRW) units and Wide Convolutional Residual Weighting (WCRW) units are introduced into WDIB for better feature distillation. Moreover, a Wide-Residual Distillation Connection (WRDC) framework and a Self-Calibration Fusion (SCF) unit are proposed to interact features with different scales more flexibly and efficiently.Extensive experiments show that our FDIWN is superior to other models to strike a good balance between model performance and efficiency. The code is available at https://github.com/IVIPLab/FDIWN.

IVSep 29, 2021Code
A Systematic Survey of Deep Learning-based Single-Image Super-Resolution

Juncheng Li, Zehua Pei, Wenjie Li et al.

Single-image super-resolution (SISR) is an important task in image processing, which aims to enhance the resolution of imaging systems. Recently, SISR has made a huge leap and has achieved promising results with the help of deep learning (DL). In this survey, we give an overview of DL-based SISR methods and group them according to their design targets. Specifically, we first introduce the problem definition, research background, and the significance of SISR. Secondly, we introduce some related works, including benchmark datasets, upsampling methods, optimization objectives, and image quality assessment methods. Thirdly, we provide a detailed investigation of SISR and give some domain-specific applications of it. Fourthly, we present the reconstruction results of some classic SISR methods to intuitively know their performance. Finally, we discuss some issues that still exist in SISR and summarize some new trends and future directions. This is an exhaustive survey of SISR, which can help researchers better understand SISR and inspire more exciting research in this field. An investigation project for SISR is provided at https://github.com/CV-JunchengLi/SISR-Survey.

CVSep 2, 2021Code
FBSNet: A Fast Bilateral Symmetrical Network for Real-Time Semantic Segmentation

Guangwei Gao, Guoan Xu, Juncheng Li et al.

Real-time semantic segmentation, which can be visually understood as the pixel-level classification task on the input image, currently has broad application prospects, especially in the fast-developing fields of autonomous driving and drone navigation. However, the huge burden of calculation together with redundant parameters are still the obstacles to its technological development. In this paper, we propose a Fast Bilateral Symmetrical Network (FBSNet) to alleviate the above challenges. Specifically, FBSNet employs a symmetrical encoder-decoder structure with two branches, semantic information branch and spatial detail branch. The Semantic Information Branch (SIB) is the main branch with semantic architecture to acquire the contextual information of the input image and meanwhile acquire sufficient receptive field. While the Spatial Detail Branch (SDB) is a shallow and simple network used to establish local dependencies of each pixel for preserving details, which is essential for restoring the original resolution during the decoding phase. Meanwhile, a Feature Aggregation Module (FAM) is designed to effectively combine the output of these two branches. Experimental results of Cityscapes and CamVid show that the proposed FBSNet can strike a good balance between accuracy and efficiency. Specifically, it obtains 70.9\% and 68.9\% mIoU along with the inference speed of 90 fps and 120 fps on these two test datasets, respectively, with only 0.62 million parameters on a single RTX 2080Ti GPU. The code is available at https://github.com/IVIPLab/FBSNet.

CVMar 26, 2021Code
Leaning Compact and Representative Features for Cross-Modality Person Re-Identification

Guangwei Gao, Hao Shao, Fei Wu et al.

This paper pays close attention to the cross-modality visible-infrared person re-identification (VI Re-ID) task, which aims to match pedestrian samples between visible and infrared modes. In order to reduce the modality-discrepancy between samples from different cameras, most existing works usually use constraints based on Euclidean metric. Because of the Euclidean based distance metric strategy cannot effectively measure the internal angles between the embedded vectors, the existing solutions cannot learn the angularly discriminative feature embedding. Since the most important factor affecting the classification task based on embedding vector is whether there is an angularly discriminative feature space, in this paper, we present a new loss function called Enumerate Angular Triplet (EAT) loss. Also, motivated by the knowledge distillation, to narrow down the features between different modalities before feature embedding, we further present a novel Cross-Modality Knowledge Distillation (CMKD) loss. Benefit from the above two considerations, the embedded features are discriminative enough in a way to tackle modality-discrepancy problem. The experimental results on RegDB and SYSU-MM01 datasets have demonstrated that the proposed method is superior to the other most advanced methods in terms of impressive performance. Code is available at https://github.com/IVIPLab/LCCRF.

CVMay 5
RPBA-Net: An Interpretable Residual Pyramid Bilateral Affine Network for RAW-Domain ISP Enhancement

Yucheng Xin, Wu Chen, Xiang Chen et al.

To address module fragmentation, uninterpretable mappings, and deployment constraints in RAW-domain demosaicing, color correction, and detail enhancement, this paper proposes RPBA-Net, an interpretable residual pyramid bilateral affine network for RAW-domain ISP enhancement. Given packed RAW as input, the method performs residual affine base reconstruction by estimating a base RGB representation and learning identity-guided residual affine corrections, thereby unifying demosaicing and enhancement. It further builds pyramid bilateral affine grids and combines guide-driven autoregressive adaptive slicing with adaptive cross-layer fusion to hierarchically model global tone restoration and local texture enhancement. In addition, smoothness, cross-scale consistency, and magnitude regularization terms are introduced to improve model stability, controllability, and structural interpretability. Extensive experiments demonstrate that RPBA-Net surpasses representative RAW-to-sRGB methods and achieves state-of-the-art performance in reconstruction fidelity and perceptual quality, while maintaining low model complexity and strong deployment potential for mobile and embedded platforms.

CVApr 17, 2025
NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

Xin Li, Yeying Jin, Xin Jin et al.

This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.

CVFeb 21, 2024
Scene Prior Filtering for Depth Super-Resolution

Zhengxue Wang, Zhiqiang Yan, Ming-Hsuan Yang et al.

Multi-modal fusion is vital to the success of super-resolution of depth maps. However, commonly used fusion strategies, such as addition and concatenation, fall short of effectively bridging the modal gap. As a result, guided image filtering methods have been introduced to mitigate this issue. Nevertheless, it is observed that their filter kernels usually encounter significant texture interference and edge inaccuracy. To tackle these two challenges, we introduce a Scene Prior Filtering network, SPFNet, which utilizes the priors surface normal and semantic map from large-scale models. Specifically, we design an All-in-one Prior Propagation that computes the similarity between multi-modal scene priors, i.e., RGB, normal, semantic, and depth, to reduce the texture interference. In addition, we present a One-to-one Prior Embedding that continuously embeds each single-modal prior into depth using Mutual Guided Filtering, further alleviating the texture interference while enhancing edges. Our SPFNet has been extensively evaluated on both real and synthetic datasets, achieving state-of-the-art performance.

IVApr 10
UHD Low-Light Image Enhancement via Real-Time Enhancement Methods with Clifford Information Fusion

Xiaohan Wang, Chen Wu, Dawei Zhao et al.

Considering efficiency, ultra-high-definition (UHD) low-light image restoration is extremely challenging. Existing methods based on Transformer architectures or high-dimensional complex convolutional neural networks often suffer from the "memory wall" bottleneck, failing to achieve millisecond-level inference on edge devices. To address this issue, we propose a novel real-time UHD low-light enhancement network based on geometric feature fusion using Clifford algebra in 2D Euclidean space. First, we construct a four-layer feature pyramid with gradually increasing resolution, which decomposes input images into low-frequency and high-frequency structural components via a Gaussian blur kernel, and adopts a lightweight U-Net based on depthwise separable convolution for dual-branch feature extraction. Second, to resolve structural information loss and artifacts from traditional high-low frequency feature fusion, we introduce spatially aware Clifford algebra, which maps feature tensors to a multivector space (scalars, vectors, bivectors) and uses Clifford similarity to aggregate features while suppressing noise and preserving textures. In the reconstruction stage, the network outputs adaptive Gamma and Gain maps, which perform physically constrained non-linear brightness adjustment via Retinex theory. Integrated with FP16 mixed-precision computation and dynamic operator fusion, our method achieves millisecond-level inference for 4K/8K images on a single consumer-grade device, while outperforming state-of-the-art (SOTA) models on several restoration metrics.

CVMar 15, 2025
Learning Dual-Domain Multi-Scale Representations for Single Image Deraining

Shun Zou, Yi Zou, Mingya Zhang et al.

Existing image deraining methods typically rely on single-input, single-output, and single-scale architectures, which overlook the joint multi-scale information between external and internal features. Furthermore, single-domain representations are often too restrictive, limiting their ability to handle the complexities of real-world rain scenarios. To address these challenges, we propose a novel Dual-Domain Multi-Scale Representation Network (DMSR). The key idea is to exploit joint multi-scale representations from both external and internal domains in parallel while leveraging the strengths of both spatial and frequency domains to capture more comprehensive properties. Specifically, our method consists of two main components: the Multi-Scale Progressive Spatial Refinement Module (MPSRM) and the Frequency Domain Scale Mixer (FDSM). The MPSRM enables the interaction and coupling of multi-scale expert information within the internal domain using a hierarchical modulation and fusion strategy. The FDSM extracts multi-scale local information in the spatial domain, while also modeling global dependencies in the frequency domain. Extensive experiments show that our model achieves state-of-the-art performance across six benchmark datasets.

CVNov 26, 2024
SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation

Guoan Xu, Jiaming Chen, Wenfeng Huang et al.

The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants extensively validated across various downstream tasks, including semantic segmentation. However, designed as general-purpose visual encoders, ViT backbones often overlook the specific needs of task decoders, revealing opportunities to design decoders tailored to efficient semantic segmentation. This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head explicitly designed for semantic segmentation. Instead of relying on the simple conventional skip connections, we employ lateral connections between the encoder and decoder stages, using encoder features as Queries for the cross-attention modules. Additionally, we introduce a Cross-Layer Block that blends hierarchical feature maps from different encoder and decoder stages to create a unified representation for Keys and Values. To further boost computational efficiency, SCASeg compresses queries and keys into strip-like patterns to optimize memory usage and inference speed over the traditional vanilla cross-attention. Moreover, the Cross-Layer Block incorporates the local perceptual strengths of convolution, enabling SCASeg to capture both global and local context dependencies across multiple layers. This approach facilitates effective feature interaction at different scales, improving the overall performance. Experiments show that the adaptable decoder of SCASeg produces competitive performance across different setups, surpassing leading segmentation architectures on all benchmark datasets, including ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012, even under varying computational limitations.

CVNov 20, 2025
Multi-Order Matching Network for Alignment-Free Depth Super-Resolution

Zhengxue Wang, Zhiqiang Yan, Yuan Wu et al.

Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.

CVNov 18, 2025
Measurement-Constrained Sampling for Text-Prompted Blind Face Restoration

Wenjie Li, Yulun Zhang, Guangwei Gao et al.

Blind face restoration (BFR) may correspond to multiple plausible high-quality (HQ) reconstructions under extremely low-quality (LQ) inputs. However, existing methods typically produce deterministic results, struggling to capture this one-to-many nature. In this paper, we propose a Measurement-Constrained Sampling (MCS) approach that enables diverse LQ face reconstructions conditioned on different textual prompts. Specifically, we formulate BFR as a measurement-constrained generative task by constructing an inverse problem through controlled degradations of coarse restorations, which allows posterior-guided sampling within text-to-image diffusion. Measurement constraints include both Forward Measurement, which ensures results align with input structures, and Reverse Measurement, which produces projection spaces, ensuring that the solution can align with various prompts. Experiments show that our MCS can generate prompt-aligned results and outperforms existing BFR methods. Codes will be released after acceptance.

CVOct 24, 2025
WaveSeg: Enhancing Segmentation Precision via High-Frequency Prior and Mamba-Driven Spectrum Decomposition

Guoan Xu, Yang Xiao, Wenjing Jia et al.

While recent semantic segmentation networks heavily rely on powerful pretrained encoders, most employ simplistic decoders, leading to suboptimal trade-offs between semantic context and fine-grained detail preservation. To address this, we propose a novel decoder architecture, WaveSeg, which jointly optimizes feature refinement in spatial and wavelet domains. Specifically, high-frequency components are first learned from input images as explicit priors to reinforce boundary details at early stages. A multi-scale fusion mechanism, Dual Domain Operation (DDO), is then applied, and the novel Spectrum Decomposition Attention (SDA) block is proposed, which is developed to leverage Mamba's linear-complexity long-range modeling to enhance high-frequency structural details. Meanwhile, reparameterized convolutions are applied to preserve low-frequency semantic integrity in the wavelet domain. Finally, a residual-guided fusion integrates multi-scale features with boundary-aware representations at native resolution, producing semantically and structurally rich feature maps. Extensive experiments on standard benchmarks demonstrate that WaveSeg, leveraging wavelet-domain frequency prior with Mamba-based attention, consistently outperforms state-of-the-art approaches both quantitatively and qualitatively, achieving efficient and precise segmentation.

CVSep 26, 2025
DeLiVR: Differential Spatiotemporal Lie Bias for Efficient Video Deraining

Shuning Sun, Jialang Lu, Xiang Chen et al.

Videos captured in the wild often suffer from rain streaks, blur, and noise. In addition, even slight changes in camera pose can amplify cross-frame mismatches and temporal artifacts. Existing methods rely on optical flow or heuristic alignment, which are computationally expensive and less robust. To address these challenges, Lie groups provide a principled way to represent continuous geometric transformations, making them well-suited for enforcing spatial and temporal consistency in video modeling. Building on this insight, we propose DeLiVR, an efficient video deraining method that injects spatiotemporal Lie-group differential biases directly into attention scores of the network. Specifically, the method introduces two complementary components. First, a rotation-bounded Lie relative bias predicts the in-plane angle of each frame using a compact prediction module, where normalized coordinates are rotated and compared with base coordinates to achieve geometry-consistent alignment before feature aggregation. Second, a differential group displacement computes angular differences between adjacent frames to estimate a velocity. This bias computation combines temporal decay and attention masks to focus on inter-frame relationships while precisely matching the direction of rain streaks. Extensive experimental results demonstrate the effectiveness of our method on publicly available benchmarks.

CVJun 17, 2025
FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution

Siyu Xu, Wenjie Li, Guangwei Gao et al.

Face super-resolution (FSR) under limited computational costs remains an open problem. Existing approaches typically treat all facial pixels equally, resulting in suboptimal allocation of computational resources and degraded FSR performance. CNN is relatively sensitive to high-frequency facial features, such as component contours and facial outlines. Meanwhile, Mamba excels at capturing low-frequency features like facial color and fine-grained texture, and does so with lower complexity than Transformers. Motivated by these observations, we propose FADPNet, a Frequency-Aware Dual-Path Network that decomposes facial features into low- and high-frequency components and processes them via dedicated branches. For low-frequency regions, we introduce a Mamba-based Low-Frequency Enhancement Block (LFEB), which combines state-space attention with squeeze-and-excitation operations to extract low-frequency global interactions and emphasize informative channels. For high-frequency regions, we design a CNN-based Deep Position-Aware Attention (DPA) module to enhance spatially-dependent structural details, complemented by a lightweight High-Frequency Refinement (HFR) module that further refines frequency-specific representations. Through the above designs, our method achieves an excellent balance between FSR quality and model efficiency, outperforming existing approaches.

CVJun 10, 2025
A PDE-Based Image Dehazing Method via Atmospheric Scattering Theory

Liubing Hu, Pu Wang, Guangwei Gao et al.

This paper introduces a novel partial differential equation (PDE) framework for single-image dehazing. We embed the atmospheric scattering model into a PDE featuring edge-preserving diffusion and a nonlocal operator to maintain both local details and global structures. A key innovation is an adaptive regularization mechanism guided by the dark channel prior, which adjusts smoothing strength based on haze density. The framework's mathematical well-posedness is rigorously established by proving the existence and uniqueness of its weak solution in $H_0^1(Ω)$. An efficient, GPU-accelerated fixed-point solver is used for implementation. Experiments confirm our method achieves effective haze removal while preserving high image fidelity, offering a principled alternative to purely data-driven techniques.

CVMay 28, 2025
S2AFormer: Strip Self-Attention for Efficient Vision Transformer

Guoan Xu, Wenfeng Huang, Wenjing Jia et al.

Vision Transformer (ViT) has made significant advancements in computer vision, thanks to its token mixer's sophisticated ability to capture global dependencies between all tokens. However, the quadratic growth in computational demands as the number of tokens increases limits its practical efficiency. Although recent methods have combined the strengths of convolutions and self-attention to achieve better trade-offs, the expensive pairwise token affinity and complex matrix operations inherent in self-attention remain a bottleneck. To address this challenge, we propose S2AFormer, an efficient Vision Transformer architecture featuring novel Strip Self-Attention (SSA). We design simple yet effective Hybrid Perception Blocks (HPBs) to effectively integrate the local perception capabilities of CNNs with the global context modeling of Transformer's attention mechanisms. A key innovation of SSA lies in its reducing the spatial dimensions of $K$ and $V$ while compressing the channel dimensions of $Q$ and $K$. This design significantly reduces computational overhead while preserving accuracy, striking an optimal balance between efficiency and effectiveness. We evaluate the robustness and efficiency of S2AFormer through extensive experiments on multiple vision benchmarks, including ImageNet-1k for image classification, ADE20k for semantic segmentation, and COCO for object detection and instance segmentation. Results demonstrate that S2AFormer achieves significant accuracy gains with superior efficiency in both GPU and non-GPU environments, making it a strong candidate for efficient vision Transformers.

CVApr 23, 2025
Cross Paradigm Representation and Alignment Transformer for Image Deraining

Shun Zou, Yi Zou, Juncheng Li et al.

Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms (spatial-channel and global-local) to aid image reconstruction. It bridges the gap within and between paradigms, aligning and coordinating them to enable deep interaction and fusion of features. Specifically, we use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA). SPC-SA enhances global channel dependencies through dynamic sparsity, while SPR-SA focuses on spatial rain distribution and fine-grained texture recovery. To address the feature misalignment and knowledge differences between them, we introduce the Adaptive Alignment Frequency Module (AAFM), which aligns and interacts with features in a two-stage progressive manner, enabling adaptive guidance and complementarity. This reduces the information gap within and between paradigms. Through this unified cross-paradigm dynamic interaction framework, we achieve the extraction of the most valuable interactive fusion information from the two paradigms. Extensive experiments demonstrate that our model achieves state-of-the-art performance on eight benchmark datasets and further validates CPRAformer's robustness in other image restoration tasks and downstream applications.

CVMar 25, 2021
Hierarchical Deep CNN Feature Set-Based Representation Learning for Robust Cross-Resolution Face Recognition

Guangwei Gao, Yi Yu, Jian Yang et al.

Cross-resolution face recognition (CRFR), which is important in intelligent surveillance and biometric forensics, refers to the problem of matching a low-resolution (LR) probe face image against high-resolution (HR) gallery face images. Existing shallow learning-based and deep learning-based methods focus on mapping the HR-LR face pairs into a joint feature space where the resolution discrepancy is mitigated. However, little works consider how to extract and utilize the intermediate discriminative features from the noisy LR query faces to further mitigate the resolution discrepancy due to the resolution limitations. In this study, we desire to fully exploit the multi-level deep convolutional neural network (CNN) feature set for robust CRFR. In particular, our contributions are threefold. (i) To learn more robust and discriminative features, we desire to adaptively fuse the contextual features from different layers. (ii) To fully exploit these contextual features, we design a feature set-based representation learning (FSRL) scheme to collaboratively represent the hierarchical features for more accurate recognition. Moreover, FSRL utilizes the primitive form of feature maps to keep the latent structural information, especially in noisy cases. (iii) To further promote the recognition performance, we desire to fuse the hierarchical recognition outputs from different stages. Meanwhile, the discriminability from different scales can also be fully integrated. By exploiting these advantages, the efficiency of the proposed method can be delivered. Experimental results on several face datasets have verified the superiority of the presented algorithm to the other competitive CRFR approaches.

CVMar 25, 2021
JDSR-GAN: Constructing An Efficient Joint Learning Network for Masked Face Super-Resolution

Guangwei Gao, Lei Tang, Fei Wu et al.

With the growing importance of preventing the COVID-19 virus, face images obtained in most video surveillance scenarios are low resolution with mask simultaneously. However, most of the previous face super-resolution solutions can not handle both tasks in one model. In this work, we treat the mask occlusion as image noise and construct a joint and collaborative learning network, called JDSR-GAN, for the masked face super-resolution task. Given a low-quality face image with the mask as input, the role of the generator composed of a denoising module and super-resolution module is to acquire a high-quality high-resolution face image. The discriminator utilizes some carefully designed loss functions to ensure the quality of the recovered face images. Moreover, we incorporate the identity information and attention mechanism into our network for feasible correlated feature expression and informative feature learning. By jointly performing denoising and face super-resolution, the two tasks can complement each other and attain promising performance. Extensive qualitative and quantitative results show the superiority of our proposed JDSR-GAN over some comparable methods which perform the previous two tasks separately.

CVMar 24, 2021
MSCFNet: A Lightweight Network With Multi-Scale Context Fusion for Real-Time Semantic Segmentation

Guangwei Gao, Guoan Xu, Yi Yu et al.

In recent years, how to strike a good trade-off between accuracy and inference speed has become the core issue for real-time semantic segmentation applications, which plays a vital role in real-world scenarios such as autonomous driving systems and drones. In this study, we devise a novel lightweight network using a multi-scale context fusion (MSCFNet) scheme, which explores an asymmetric encoder-decoder architecture to dispose this problem. More specifically, the encoder adopts some developed efficient asymmetric residual (EAR) modules, which are composed of factorization depth-wise convolution and dilation convolution. Meanwhile, instead of complicated computation, simple deconvolution is applied in the decoder to further reduce the amount of parameters while still maintaining high segmentation accuracy. Also, MSCFNet has branches with efficient attention modules from different stages of the network to well capture multi-scale contextual information. Then we combine them before the final classification to enhance the expression of the features and improve the segmentation efficiency. Comprehensive experiments on challenging datasets have demonstrated that the proposed MSCFNet, which contains only 1.15M parameters, achieves 71.9\% Mean IoU on the Cityscapes testing dataset and can run at over 50 FPS on a single Titan XP GPU configuration.

IVMar 24, 2021
Lightweight Image Super-Resolution with Multi-scale Feature Interaction Network

Zhengxue Wang, Guangwei Gao, Juncheng Li et al.

Recently, the single image super-resolution (SISR) approaches with deep and complex convolutional neural network structures have achieved promising performance. However, those methods improve the performance at the cost of higher memory consumption, which is difficult to be applied for some mobile devices with limited storage and computing resources. To solve this problem, we present a lightweight multi-scale feature interaction network (MSFIN). For lightweight SISR, MSFIN expands the receptive field and adequately exploits the informative features of the low-resolution observed images from various scales and interactive connections. In addition, we design a lightweight recurrent residual channel attention block (RRCAB) so that the network can benefit from the channel attention mechanism while being sufficiently lightweight. Extensive experiments on some benchmarks have confirmed that our proposed MSFIN can achieve comparable performance against the state-of-the-arts with a more lightweight model.

CVJul 20, 2020
Cross-View Image Synthesis with Deformable Convolution and Attention Mechanism

Hao Ding, Songsong Wu, Hao Tang et al.

Learning to generate natural scenes has always been a daunting task in computer vision. This is even more laborious when generating images with very different views. When the views are very different, the view fields have little overlap or objects are occluded, leading the task very challenging. In this paper, we propose to use Generative Adversarial Networks(GANs) based on a deformable convolution and attention mechanism to solve the problem of cross-view image synthesis (see Fig.1). It is difficult to understand and transform scenes appearance and semantic information from another view, thus we use deformed convolution in the U-net network to improve the network's ability to extract features of objects at different scales. Moreover, to better learn the correspondence between images from different views, we apply an attention mechanism to refine the intermediate feature map thus generating more realistic images. A large number of experiments on different size images on the Dayton dataset[1] show that our model can produce better results than state-of-the-art methods.

CVMay 7, 2019
LEDNet: A Lightweight Encoder-Decoder Network for Real-Time Semantic Segmentation

Yu Wang, Quan Zhou, Jia Liu et al.

The extensive computational burden limits the usage of CNNs in mobile devices for dense estimation tasks. In this paper, we present a lightweight network to address this problem,namely LEDNet, which employs an asymmetric encoder-decoder architecture for the task of real-time semantic segmentation.More specifically, the encoder adopts a ResNet as backbone network, where two new operations, channel split and shuffle, are utilized in each residual block to greatly reduce computation cost while maintaining higher segmentation accuracy. On the other hand, an attention pyramid network (APN) is employed in the decoder to further lighten the entire network complexity. Our model has less than 1M parameters,and is able to run at over 71 FPS in a single GTX 1080Ti GPU. The comprehensive experiments demonstrate that our approach achieves state-of-the-art results in terms of speed and accuracy trade-off on CityScapes dataset.