Liang-Jian Deng

CV
h-index39
38papers
1,098citations
Novelty55%
AI Score61

38 Papers

CVJun 21, 2022Code
TCJA-SNN: Temporal-Channel Joint Attention for Spiking Neural Networks

Rui-Jie Zhu, Malu Zhang, Qihang Zhao et al.

Spiking Neural Networks (SNNs) are attracting widespread interest due to their biological plausibility, energy efficiency, and powerful spatio-temporal information representation ability. Given the critical role of attention mechanisms in enhancing neural network performance, the integration of SNNs and attention mechanisms exhibits potential to deliver energy-efficient and high-performance computing paradigms. We present a novel Temporal-Channel Joint Attention mechanism for SNNs, referred to as TCJA-SNN. The proposed TCJA-SNN framework can effectively assess the significance of spike sequence from both spatial and temporal dimensions. More specifically, our essential technical contribution lies on: 1) We employ the squeeze operation to compress the spike stream into an average matrix. Then, we leverage two local attention mechanisms based on efficient 1D convolutions to facilitate comprehensive feature extraction at the temporal and channel levels independently. 2) We introduce the Cross Convolutional Fusion (CCF) layer as a novel approach to model the inter-dependencies between the temporal and channel scopes. This layer breaks the independence of these two dimensions and enables the interaction between features. Experimental results demonstrate that the proposed TCJA-SNN outperforms SOTA by up to 15.7% accuracy on standard static and neuromorphic datasets, including Fashion-MNIST, CIFAR10-DVS, N-Caltech 101, and DVS128 Gesture. Furthermore, we apply the TCJA-SNN framework to image generation tasks by leveraging a variation autoencoder. To the best of our knowledge, this study is the first instance where the SNN-attention mechanism has been employed for image classification and generation tasks. Notably, our approach has achieved SOTA performance in both domains, establishing a significant advancement in the field. Codes are available at https://github.com/ridgerchu/TCJA.

NANov 17, 2018
A New Operator Splitting Method for Euler's Elastica Model

Liang-Jian Deng, Roland Glowinski, Xue-Cheng Tai

Euler's elastica model has a wide range of applications in Image Processing and Computer Vision. However, the non-convexity, the non-smoothness and the nonlinearity of the associated energy functional make its minimization a challenging task, further complicated by the presence of high order derivatives in the model. In this article we propose a new operator-splitting algorithm to minimize the Euler elastica functional. This algorithm is obtained by applying an operator-splitting based time discretization scheme to an initial value problem (dynamical flow) associated with the optimality system (a system of multivalued equations). The sub-problems associated with the three fractional steps of the splitting scheme have either closed form solutions or can be handled by fast dedicated solvers. Compared with earlier approaches relying on ADMM (Alternating Direction Method of Multipliers), the new method has, essentially, only the time discretization step as free parameter to choose, resulting in a very robust and stable algorithm. The simplicity of the sub-problems and its modularity make this algorithm quite efficient. Applications to the numerical solution of smoothing test problems demonstrate the efficiency and robustness of the proposed methodology.

CVApr 10, 2023
DDRF: Denoising Diffusion Model for Remote Sensing Image Fusion

ZiHan Cao, ShiQi Cao, Xiao Wu et al.

Denosing diffusion model, as a generative model, has received a lot of attention in the field of image generation recently, thanks to its powerful generation capability. However, diffusion models have not yet received sufficient research in the field of image fusion. In this article, we introduce diffusion model to the image fusion field, treating the image fusion task as image-to-image translation and designing two different conditional injection modulation modules (i.e., style transfer modulation and wavelet modulation) to inject coarse-grained style information and fine-grained high-frequency and low-frequency information into the diffusion UNet, thereby generating fused images. In addition, we also discussed the residual learning and the selection of training objectives of the diffusion model in the image fusion task. Extensive experimental results based on quantitative and qualitative assessments compared with benchmarks demonstrates state-of-the-art results and good generalization performance in image fusion tasks. Finally, it is hoped that our method can inspire other works and gain insight into this field to better apply the diffusion model to image fusion tasks. Code shall be released for better reproducibility.

CVAug 24, 2023
APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency

Yupu Yao, Shangqi Deng, Zihan Cao et al.

Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.

CVJul 24, 2023
A Theoretically Guaranteed Quaternion Weighted Schatten p-norm Minimization Method for Color Image Restoration

Qing-Hua Zhang, Liang-Tian He, Yi-Lun Wang et al.

Inspired by the fact that the matrix formulated by nonlocal similar patches in a natural image is of low rank, the rank approximation issue have been extensively investigated over the past decades, among which weighted nuclear norm minimization (WNNM) and weighted Schatten $p$-norm minimization (WSNM) are two prevailing methods have shown great superiority in various image restoration (IR) problems. Due to the physical characteristic of color images, color image restoration (CIR) is often a much more difficult task than its grayscale image counterpart. However, when applied to CIR, the traditional WNNM/WSNM method only processes three color channels individually and fails to consider their cross-channel correlations. Very recently, a quaternion-based WNNM approach (QWNNM) has been developed to mitigate this issue, which is capable of representing the color image as a whole in the quaternion domain and preserving the inherent correlation among the three color channels. Despite its empirical success, unfortunately, the convergence behavior of QWNNM has not been strictly studied yet. In this paper, on the one side, we extend the WSNM into quaternion domain and correspondingly propose a novel quaternion-based WSNM model (QWSNM) for tackling the CIR problems. Extensive experiments on two representative CIR tasks, including color image denoising and deblurring, demonstrate that the proposed QWSNM method performs favorably against many state-of-the-art alternatives, in both quantitative and qualitative evaluations. On the other side, more importantly, we preliminarily provide a theoretical convergence analysis, that is, by modifying the quaternion alternating direction method of multipliers (QADMM) through a simple continuation strategy, we theoretically prove that both the solution sequences generated by the QWNNM and QWSNM have fixed-point convergence guarantees.

23.3CVApr 10
Fast Model-guided Instance-wise Adaptation Framework for Real-world Pansharpening with Fidelity Constraints

Zhiqi Yang, Jin-Liang Xiao, Shan Yin et al.

Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and high-resolution panchromatic (PAN) images while preserving both spectral and spatial information. Although deep learning (DL)-based pansharpening methods achieve impressive performance, they require high training cost and large datasets, and often degrade when the test distribution differs from training, limiting generalization. Recent zero-shot methods, trained on a single PAN/LRMS pair, offer strong generalization but suffer from limited fusion quality, high computational overhead, and slow convergence. To address these issues, we propose FMG-Pan, a fast and generalizable model-guided instance-wise adaptation framework for real-world pansharpening, achieving both cross-sensor generality and rapid training-inference. The framework leverages a pretrained model to guide a lightweight adaptive network through joint optimization with spectral and physical fidelity constraints. We further design a novel physical fidelity term to enhance spatial detail preservation. Extensive experiments on real-world datasets under both intra- and cross-sensor settings demonstrate state-of-the-art performance. On the WorldView-3 dataset, FMG-Pan completes training and inference for a 512x512x8 image within 3 seconds on an RTX 3090 GPU, significantly faster than existing zero-shot methods, making it suitable for practical deployment.

CVJul 14, 2023
Implicit Neural Feature Fusion Function for Multispectral and Hyperspectral Image Fusion

ShangQi Deng, RuoCheng Wu, Liang-Jian Deng et al.

Multispectral and Hyperspectral Image Fusion (MHIF) is a practical task that aims to fuse a high-resolution multispectral image (HR-MSI) and a low-resolution hyperspectral image (LR-HSI) of the same scene to obtain a high-resolution hyperspectral image (HR-HSI). Benefiting from powerful inductive bias capability, CNN-based methods have achieved great success in the MHIF task. However, they lack certain interpretability and require convolution structures be stacked to enhance performance. Recently, Implicit Neural Representation (INR) has achieved good performance and interpretability in 2D tasks due to its ability to locally interpolate samples and utilize multimodal content such as pixels and coordinates. Although INR-based approaches show promise, they require extra construction of high-frequency information (\emph{e.g.,} positional encoding). In this paper, inspired by previous work of MHIF task, we realize that HR-MSI could serve as a high-frequency detail auxiliary input, leading us to propose a novel INR-based hyperspectral fusion function named Implicit Neural Feature Fusion Function (INF). As an elaborate structure, it solves the MHIF task and addresses deficiencies in the INR-based approaches. Specifically, our INF designs a Dual High-Frequency Fusion (DHFF) structure that obtains high-frequency information twice from HR-MSI and LR-HSI, then subtly fuses them with coordinate information. Moreover, the proposed INF incorporates a parameter-free method named INR with cosine similarity (INR-CS) that uses cosine similarity to generate local weights through feature vectors. Based on INF, we construct an Implicit Neural Fusion Network (INFN) that achieves state-of-the-art performance for MHIF tasks of two public datasets, \emph{i.e.,} CAVE and Harvard. The code will soon be made available on GitHub.

41.8CVMay 10Code
Adaptive 3D Convolution for Remote Sensing Image Fusion

Siran Peng, Xiangyu Zhu, Shang-Qi Deng et al.

Remote sensing image fusion aims to create a high-resolution multi/hyper-spectral image from a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Recently, deep learning (DL) techniques have shown significant effectiveness in this area. Most DL-based methods approach image fusion as a 2D problem by encoding spectral information into feature map channels. However, our research suggests that this strategy introduces notable spectral distortions. In contrast, some methods consider spectral data as an additional dimension, utilizing standard 3D convolutions to preserve spectral information. Nevertheless, in a standard 3D convolutional layer, the same set of kernels is applied across all input regions, which we have found to be sub-optimal for image fusion. Furthermore, standard 3D convolutions necessitate substantial computational resources. To address these challenges, we propose a novel convolutional paradigm called Adaptive 3D Convolution (Ada3D) for remote sensing image fusion. Ada3D applies a unique set of 3D kernels to each input voxel, enabling the capture of fine-grained details. These adaptive kernels are generated through a two-step process: (i) spatial and spectral kernels are derived from their respective image sources; (ii) these two types of kernels are then combined to form content-aware 3D kernels that effectively integrate spatial and spectral information. Additionally, adaptive biases are introduced to enhance the convolutional outcome at the voxel level. Furthermore, we incorporate the group convolution technique to reduce computational complexity. As a result, Ada3D offers full adaptivity in an efficient manner. Evaluation results across five datasets demonstrate that our method achieves SOTA performance, underscoring the superiority of Ada3D. The code is available at https://github.com/PSRben/Ada3D.

CVApr 11, 2024Code
Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening

Yule Duan, Xiao Wu, Haoyu Deng et al.

Currently, machine learning-based methods for remote sensing pansharpening have progressed rapidly. However, existing pansharpening methods often do not fully exploit differentiating regional information in non-local spaces, thereby limiting the effectiveness of the methods and resulting in redundant learning parameters. In this paper, we introduce a so-called content-adaptive non-local convolution (CANConv), a novel method tailored for remote sensing image pansharpening. Specifically, CANConv employs adaptive convolution, ensuring spatial adaptability, and incorporates non-local self-similarity through the similarity relationship partition (SRP) and the partition-wise adaptive convolution (PWAC) sub-modules. Furthermore, we also propose a corresponding network architecture, called CANNet, which mainly utilizes the multi-scale self-similarity. Extensive experiments demonstrate the superior performance of CANConv, compared with recent promising fusion methods. Besides, we substantiate the method's effectiveness through visualization, ablation experiments, and comparison with existing methods on multiple test sets. The source code is publicly available at https://github.com/duanyll/CANConv.

CVApr 11, 2024Code
FusionMamba: Efficient Remote Sensing Image Fusion with State Space Model

Siran Peng, Xiangyu Zhu, Haoyu Deng et al.

Remote sensing image fusion aims to generate a high-resolution multi/hyper-spectral image by combining a high-resolution image with limited spectral data and a low-resolution image rich in spectral information. Current deep learning (DL) methods typically employ convolutional neural networks (CNNs) or Transformers for feature extraction and information integration. While CNNs are efficient, their limited receptive fields restrict their ability to capture global context. Transformers excel at learning global information but are computationally expensive. Recent advancements in the state space model (SSM), particularly Mamba, present a promising alternative by enabling global perception with low complexity. However, the potential of SSM for information integration remains largely unexplored. Therefore, we propose FusionMamba, an innovative method for efficient remote sensing image fusion. Our contributions are twofold. First, to effectively merge spatial and spectral features, we expand the single-input Mamba block to accommodate dual inputs, creating the FusionMamba block, which serves as a plug-and-play solution for information integration. Second, we incorporate Mamba and FusionMamba blocks into an interpretable network architecture tailored for remote sensing image fusion. Our designs utilize two U-shaped network branches, each primarily composed of four-directional Mamba blocks, to extract spatial and spectral features separately and hierarchically. The resulting feature maps are sufficiently merged in an auxiliary network branch constructed with FusionMamba blocks. Furthermore, we improve the representation of spectral information through an enhanced channel attention module. Quantitative and qualitative valuation results across six datasets demonstrate that our method achieves SOTA performance. The code is available at https://github.com/PSRben/FusionMamba.

CVApr 14, 2024Code
A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion

Zihan Cao, Xiao Wu, Liang-Jian Deng et al.

In image fusion tasks, images from different sources possess distinct characteristics. This has driven the development of numerous methods to explore better ways of fusing them while preserving their respective characteristics.Mamba, as a state space model, has emerged in the field of natural language processing. Recently, many studies have attempted to extend Mamba to vision tasks. However, due to the nature of images different from causal language sequences, the limited state capacity of Mamba weakens its ability to model image information. Additionally, the sequence modeling ability of Mamba is only capable of spatial information and cannot effectively capture the rich spectral information in images. Motivated by these challenges, we customize and improve the vision Mamba network designed for the image fusion task. Specifically, we propose the local-enhanced vision Mamba block, dubbed as LEVM. The LEVM block can improve local information perception of the network and simultaneously learn local and global spatial information. Furthermore, we propose the state sharing technique to enhance spatial details and integrate spatial and spectral information. Finally, the overall network is a multi-scale structure based on vision Mamba, called LE-Mamba. Extensive experiments show the proposed methods achieve state-of-the-art results on multispectral pansharpening and multispectral and hyperspectral image fusion datasets, and demonstrate the effectiveness of the proposed approach. Codes can be accessed at \url{https://github.com/294coder/Efficient-MIF}.

CVApr 17, 2024Code
SSDiff: Spatial-spectral Integrated Diffusion Model for Remote Sensing Pansharpening

Yu Zhong, Xiao Wu, Liang-Jian Deng et al.

Pansharpening is a significant image fusion technique that merges the spatial content and spectral characteristics of remote sensing images to generate high-resolution multispectral images. Recently, denoising diffusion probabilistic models have been gradually applied to visual tasks, enhancing controllable image generation through low-rank adaptation (LoRA). In this paper, we introduce a spatial-spectral integrated diffusion model for the remote sensing pansharpening task, called SSDiff, which considers the pansharpening process as the fusion process of spatial and spectral components from the perspective of subspace decomposition. Specifically, SSDiff utilizes spatial and spectral branches to learn spatial details and spectral features separately, then employs a designed alternating projection fusion module (APFM) to accomplish the fusion. Furthermore, we propose a frequency modulation inter-branch module (FMIM) to modulate the frequency distribution between branches. The two components of SSDiff can perform favorably against the APFM when utilizing a LoRA-like branch-wise alternative fine-tuning method. It refines SSDiff to capture component-discriminating features more sufficiently. Finally, extensive experiments on four commonly used datasets, i.e., WorldView-3, WorldView-2, GaoFen-2, and QuickBird, demonstrate the superiority of SSDiff both visually and quantitatively. The code will be made open source after possible acceptance.

CVMay 13, 2024Code
Exploring the Low-Pass Filtering Behavior in Image Super-Resolution

Haoyu Deng, Zijing Xu, Yule Duan et al.

Deep neural networks for image super-resolution (ISR) have shown significant advantages over traditional approaches like the interpolation. However, they are often criticized as 'black boxes' compared to traditional approaches with solid mathematical foundations. In this paper, we attempt to interpret the behavior of deep neural networks in ISR using theories from the field of signal processing. First, we report an intriguing phenomenon, referred to as `the sinc phenomenon.' It occurs when an impulse input is fed to a neural network. Then, building on this observation, we propose a method named Hybrid Response Analysis (HyRA) to analyze the behavior of neural networks in ISR tasks. Specifically, HyRA decomposes a neural network into a parallel connection of a linear system and a non-linear system and demonstrates that the linear system functions as a low-pass filter while the non-linear system injects high-frequency information. Finally, to quantify the injected high-frequency information, we introduce a metric for image-to-image tasks called Frequency Spectrum Distribution Similarity (FSDS). FSDS reflects the distribution similarity of different frequency components and can capture nuances that traditional metrics may overlook. Code, videos and raw experimental results for this paper can be found in: https://github.com/RisingEntropy/LPFInISR.

83.1GTMar 21
Evolutionary Dynamics of Variable Games in Structured Populations

Bin Pi, Minyu Feng, Liang-Jian Deng et al.

The game interactions among individuals in nature are often uncertain and dynamically evolving, significantly influencing the persistence of cooperation. However, it remains a formidable challenge to effectively characterize these dynamic properties in structured populations, derive theoretical conditions for cooperation, and identify the optimal game distribution for promoting cooperation. To address these issues, we propose the variable game framework in a structured population, where the game interactions between different individuals change over time. By means of the Markov chain and the pair approximation method, we derive theoretical conditions under which cooperation is favored by natural selection and when it is favored over defection under weak selection. Furthermore, we respectively formulate and solve two optimization problems to determine the optimal game distribution that most effectively fosters the evolution of cooperation by maximizing the gradient of cooperation selection and minimizing the fitness difference between defectors and cooperators. The theoretical predictions regarding both the conditions for cooperation and optimal game distribution are further validated by numerical calculations and extensive Monte Carlo simulations. Our findings offer novel insights into the mechanisms driving cooperative behavior in complex systems and provide theoretical guidance for designing optimal game environments that facilitate the evolution of cooperation.

CVMar 19, 2025Code
MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Zihan Cao, Yu Zhong, Ziqi Wang et al.

Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (\textit{e.g.}, noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at https://github.com/294coder/MMAIF.

CVDec 24, 2025
Matrix Completion Via Reweighted Logarithmic Norm Minimization

Zhijie Wang, Liangtian He, Qinghua Zhang et al.

Low-rank matrix completion (LRMC) has demonstrated remarkable success in a wide range of applications. To address the NP-hard nature of the rank minimization problem, the nuclear norm is commonly used as a convex and computationally tractable surrogate for the rank function. However, this approach often yields suboptimal solutions due to the excessive shrinkage of singular values. In this letter, we propose a novel reweighted logarithmic norm as a more effective nonconvex surrogate, which provides a closer approximation than many existing alternatives. We efficiently solve the resulting optimization problem by employing the alternating direction method of multipliers (ADMM). Experimental results on image inpainting demonstrate that the proposed method achieves superior performance compared to state-of-the-art LRMC approaches, both in terms of visual quality and quantitative metrics.

IVFeb 2
Hyperspectral Image Fusion with Spectral-Band and Fusion-Scale Agnosticism

Yu-Jie Liang, Zihan Cao, Liang-Jian Deng et al.

Current deep learning models for Multispectral and Hyperspectral Image Fusion (MS/HS fusion) are typically designed for fixed spectral bands and spatial scales, which limits their transferability across diverse sensors. To address this, we propose SSA, a universal framework for MS/HS fusion with spectral-band and fusion-scale agnosticism. Specifically, we introduce Matryoshka Kernel (MK), a novel operator that enables a single model to adapt to arbitrary numbers of spectral channels. Meanwhile, we build SSA upon an Implicit Neural Representation (INR) backbone that models the HS signal as a continuous function, enabling reconstruction at arbitrary spatial resolutions. Together, these two forms of agnosticism enable a single MS/HS fusion model that generalizes effectively to unseen sensors and spatial scales. Extensive experiments demonstrate that our single model achieves state-of-the-art performance while generalizing well to unseen sensors and scales, paving the way toward future HS foundation models.

AISep 18, 2025Code
A Knowledge-driven Adaptive Collaboration of LLMs for Enhancing Medical Decision-making

Xiao Wu, Ting-Zhu Huang, Liang-Jian Deng et al.

Medical decision-making often involves integrating knowledge from multiple clinical specialties, typically achieved through multidisciplinary teams. Inspired by this collaborative process, recent work has leveraged large language models (LLMs) in multi-agent collaboration frameworks to emulate expert teamwork. While these approaches improve reasoning through agent interaction, they are limited by static, pre-assigned roles, which hinder adaptability and dynamic knowledge integration. To address these limitations, we propose KAMAC, a Knowledge-driven Adaptive Multi-Agent Collaboration framework that enables LLM agents to dynamically form and expand expert teams based on the evolving diagnostic context. KAMAC begins with one or more expert agents and then conducts a knowledge-driven discussion to identify and fill knowledge gaps by recruiting additional specialists as needed. This supports flexible, scalable collaboration in complex clinical scenarios, with decisions finalized through reviewing updated agent comments. Experiments on two real-world medical benchmarks demonstrate that KAMAC significantly outperforms both single-agent and advanced multi-agent methods, particularly in complex clinical scenarios (i.e., cancer prognosis) requiring dynamic, cross-specialty expertise. Our code is publicly available at: https://github.com/XiaoXiao-Woo/KAMAC.

CVAug 21, 2025Code
Spiking Variational Graph Representation Inference for Video Summarization

Wenrui Li, Wei Han, Liang-Jian Deng et al.

With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at https://github.com/liwrui/SpiVG.

CVMar 19, 2025Code
Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening

Zihan Cao, Yu Zhong, Liang-Jian Deng

Pansharpening, a pivotal task in remote sensing for fusing high-resolution panchromatic and multispectral imagery, has garnered significant research interest. Recent advancements employing diffusion models based on stochastic differential equations (SDEs) have demonstrated state-of-the-art performance. However, the inherent multi-step sampling process of SDEs imposes substantial computational overhead, hindering practical deployment. While existing methods adopt efficient samplers, knowledge distillation, or retraining to reduce sampling steps (e.g., from 1,000 to fewer steps), such approaches often compromise fusion quality. In this work, we propose the Optimal Transport Flow Matching (OTFM) framework, which integrates the dual formulation of unbalanced optimal transport (UOT) to achieve one-step, high-quality pansharpening. Unlike conventional OT formulations that enforce rigid distribution alignment, UOT relaxes marginal constraints to enhance modeling flexibility, accommodating the intrinsic spectral and spatial disparities in remote sensing data. Furthermore, we incorporate task-specific regularization into the UOT objective, enhancing the robustness of the flow model. The OTFM framework enables simulation-free training and single-step inference while maintaining strict adherence to pansharpening constraints. Experimental evaluations across multiple datasets demonstrate that OTFM matches or exceeds the performance of previous regression-based models and leading diffusion-based methods while only needing one sampling step. Codes are available at https://github.com/294coder/PAN-OTFM.

CVMar 20, 2018Code
FastDeRain: A Novel Video Rain Streak Removal Method Using Directional Gradient Priors

Tai-Xiang Jiang, Ting-Zhu Huang, Xi-Le Zhao et al.

Rain streak removal is an important issue in outdoor vision systems and has recently been investigated extensively. In this paper, we propose a novel video rain streak removal approach FastDeRain, which fully considers the discriminative characteristics of rain streaks and the clean video in the gradient domain. Specifically, on the one hand, rain streaks are sparse and smooth along the direction of the raindrops, whereas on the other hand, clean videos exhibit piecewise smoothness along the rain-perpendicular direction and continuity along the temporal direction. Theses smoothness and continuity results in the sparse distribution in the different directional gradient domain, respectively. Thus, we minimize 1) the $\ell_1$ norm to enhance the sparsity of the underlying rain streaks, 2) two $\ell_1$ norm of unidirectional Total Variation (TV) regularizers to guarantee the anisotropic spatial smoothness, and 3) an $\ell_1$ norm of the time-directional difference operator to characterize the temporal continuity. A split augmented Lagrangian shrinkage algorithm (SALSA) based algorithm is designed to solve the proposed minimization model. Experiments conducted on synthetic and real data demonstrate the effectiveness and efficiency of the proposed method. According to comprehensive quantitative performance measures, our approach outperforms other state-of-the-art methods especially on account of the running time. The code of FastDeRain can be downloaded at https://github.com/TaiXiangJiang/FastDeRain.

18.5CVMar 15
G-ZAP: A Generalizable Zero-Shot Framework for Arbitrary-Scale Pansharpening

Zhiqi Yang, Shan Yin, Jingze Liang et al.

Pansharpening aims to fuse a high-resolution panchromatic (PAN) image and a low-resolution multispectral (LRMS) image to produce a high-resolution multispectral (HRMS) image. Recent deep models have achieved strong performance, yet they typically rely on large-scale pretraining and often generalize poorly to unseen real-world image pairs.Prior zero-shot approaches improve real-scene generalization but require per-image optimization, hindering weight reuse, and the above methods are usually limited to a fixed scale.To address this issue, we propose G-ZAP, a generalizable zero-shot framework for arbitrary-scale pansharpening, designed to handle cross-resolution, cross-scene, and cross-sensor generalization.G-ZAP adopts a feature-based implicit neural representation (INR) fusion network as the backbone and introduces a multi-scale, semi-supervised training scheme to enable robust generalization.Extensive experiments on multiple real-world datasets show that G-ZAP achieves state-of-the-art results under PAN-scale fusion in both visual quality and quantitative metrics.Notably, G-ZAP supports weight reuse across image pairs while maintaining competitiveness with per-pair retraining, demonstrating strong potential for efficient real-world deployment.

CVApr 23, 2024
Fourier-enhanced Implicit Neural Fusion Network for Multispectral and Hyperspectral Image Fusion

Yu-Jie Liang, Zihan Cao, Liang-Jian Deng et al.

Recently, implicit neural representations (INR) have made significant strides in various vision-related domains, providing a novel solution for Multispectral and Hyperspectral Image Fusion (MHIF) tasks. However, INR is prone to losing high-frequency information and is confined to the lack of global perceptual capabilities. To address these issues, this paper introduces a Fourier-enhanced Implicit Neural Fusion Network (FeINFN) specifically designed for MHIF task, targeting the following phenomena: The Fourier amplitudes of the HR-HSI latent code and LR-HSI are remarkably similar; however, their phases exhibit different patterns. In FeINFN, we innovatively propose a spatial and frequency implicit fusion function (Spa-Fre IFF), helping INR capture high-frequency information and expanding the receptive field. Besides, a new decoder employing a complex Gabor wavelet activation function, called Spatial-Frequency Interactive Decoder (SFID), is invented to enhance the interaction of INR features. Especially, we further theoretically prove that the Gabor wavelet activation possesses a time-frequency tightness property that favors learning the optimal bandwidths in the decoder. Experiments on two benchmark MHIF datasets verify the state-of-the-art (SOTA) performance of the proposed method, both visually and quantitatively. Also, ablation studies demonstrate the mentioned contributions. The code will be available on Anonymous GitHub (https://anonymous.4open.science/r/FeINFN-15C9/) after possible acceptance.

CVMar 1, 2025
Adaptive Rectangular Convolution for Remote Sensing Pansharpening

Xueyang Wang, Zhixin Zheng, Jiandong Shao et al.

Recent advancements in convolutional neural network (CNN)-based techniques for remote sensing pansharpening have markedly enhanced image quality. However, conventional convolutional modules in these methods have two critical drawbacks. First, the sampling positions in convolution operations are confined to a fixed square window. Second, the number of sampling points is preset and remains unchanged. Given the diverse object sizes in remote sensing images, these rigid parameters lead to suboptimal feature extraction. To overcome these limitations, we introduce an innovative convolutional module, Adaptive Rectangular Convolution (ARConv). ARConv adaptively learns both the height and width of the convolutional kernel and dynamically adjusts the number of sampling points based on the learned scale. This approach enables ARConv to effectively capture scale-specific features of various objects within an image, optimizing kernel sizes and sampling locations. Additionally, we propose ARNet, a network architecture in which ARConv is the primary convolutional module. Extensive evaluations across multiple datasets reveal the superiority of our method in enhancing pansharpening performance over previous techniques. Ablation studies and visualization further confirm the efficacy of ARConv.

CVApr 1, 2024
CMT: Cross Modulation Transformer with Hybrid Loss for Pansharpening

Wen-Jie Shu, Hong-Xia Dou, Rui Wen et al.

Pansharpening aims to enhance remote sensing image (RSI) quality by merging high-resolution panchromatic (PAN) with multispectral (MS) images. However, prior techniques struggled to optimally fuse PAN and MS images for enhanced spatial and spectral information, due to a lack of a systematic framework capable of effectively coordinating their individual strengths. In response, we present the Cross Modulation Transformer (CMT), a pioneering method that modifies the attention mechanism. This approach utilizes a robust modulation technique from signal processing, integrating it into the attention mechanism's calculations. It dynamically tunes the weights of the carrier's value (V) matrix according to the modulator's features, thus resolving historical challenges and achieving a seamless integration of spatial and spectral attributes. Furthermore, considering that RSI exhibits large-scale features and edge details along with local textures, we crafted a hybrid loss function that combines Fourier and wavelet transforms to effectively capture these characteristics, thereby enhancing both spatial and spectral accuracy in pansharpening. Extensive experiments demonstrate our framework's superior performance over existing state-of-the-art methods. The code will be publicly available to encourage further research.

CVApr 17, 2024
Neural Shrödinger Bridge Matching for Pansharpening

Zihan Cao, Xiao Wu, Liang-Jian Deng

Recent diffusion probabilistic models (DPM) in the field of pansharpening have been gradually gaining attention and have achieved state-of-the-art (SOTA) performance. In this paper, we identify shortcomings in directly applying DPMs to the task of pansharpening as an inverse problem: 1) initiating sampling directly from Gaussian noise neglects the low-resolution multispectral image (LRMS) as a prior; 2) low sampling efficiency often necessitates a higher number of sampling steps. We first reformulate pansharpening into the stochastic differential equation (SDE) form of an inverse problem. Building upon this, we propose a Schrödinger bridge matching method that addresses both issues. We design an efficient deep neural network architecture tailored for the proposed SB matching. In comparison to the well-established DL-regressive-based framework and the recent DPM framework, our method demonstrates SOTA performance with fewer sampling steps. Moreover, we discuss the relationship between SB matching and other methods based on SDEs and ordinary differential equations (ODEs), as well as its connection with optimal transport. Code will be available.

CVAug 10, 2025
Training and Inference within 1 Second -- Tackle Cross-Sensor Degradation of Real-World Pansharpening with Efficient Residual Feature Tailoring

Tianyu Xin, Jin-Liang Xiao, Zeyu Xia et al.

Deep learning methods for pansharpening have advanced rapidly, yet models pretrained on data from a specific sensor often generalize poorly to data from other sensors. Existing methods to tackle such cross-sensor degradation include retraining model or zero-shot methods, but they are highly time-consuming or even need extra training data. To address these challenges, our method first performs modular decomposition on deep learning-based pansharpening models, revealing a general yet critical interface where high-dimensional fused features begin mapping to the channel space of the final image. % may need revisement A Feature Tailor is then integrated at this interface to address cross-sensor degradation at the feature level, and is trained efficiently with physics-aware unsupervised losses. Moreover, our method operates in a patch-wise manner, training on partial patches and performing parallel inference on all patches to boost efficiency. Our method offers two key advantages: (1) $\textit{Improved Generalization Ability}$: it significantly enhance performance in cross-sensor cases. (2) $\textit{Low Generalization Cost}$: it achieves sub-second training and inference, requiring only partial test inputs and no external data, whereas prior methods often take minutes or even hours. Experiments on the real-world data from multiple datasets demonstrate that our method achieves state-of-the-art quality and efficiency in tackling cross-sensor degradation. For example, training and inference of $512\times512\times8$ image within $\textit{0.2 seconds}$ and $4000\times4000\times8$ image within $\textit{3 seconds}$ at the fastest setting on a commonly used RTX 3090 GPU, which is over 100 times faster than zero-shot methods.

CVJul 27, 2025
SWIFT: A General Sensitive Weight Identification Framework for Fast Sensor-Transfer Pansharpening

Zeyu Xia, Chenxi Sun, Tianyu Xin et al.

Pansharpening aims to fuse high-resolution panchromatic (PAN) images with low-resolution multispectral (LRMS) images to generate high-resolution multispectral (HRMS) images. Although deep learning-based methods have achieved promising performance, they generally suffer from severe performance degradation when applied to data from unseen sensors. Adapting these models through full-scale retraining or designing more complex architectures is often prohibitively expensive and impractical for real-world deployment. To address this critical challenge, we propose a fast and general-purpose framework for cross-sensor adaptation, SWIFT (Sensitive Weight Identification for Fast Transfer). Specifically, SWIFT employs an unsupervised sampling strategy based on data manifold structures to balance sample selection while mitigating the bias of traditional Farthest Point Sampling, efficiently selecting only 3\% of the most informative samples from the target domain. This subset is then used to probe a source-domain pre-trained model by analyzing the gradient behavior of its parameters, allowing for the quick identification and subsequent update of only the weight subset most sensitive to the domain shift. As a plug-and-play framework, SWIFT can be applied to various existing pansharpening models. Extensive experiments demonstrate that SWIFT reduces the adaptation time from hours to approximately one minute on a single NVIDIA RTX 4090 GPU. The adapted models not only substantially outperform direct-transfer baselines but also achieve performance competitive with, and in some cases superior to, full retraining, establishing a new state-of-the-art on cross-sensor pansharpening tasks for the WorldView-2 and QuickBird datasets.

CVMay 25, 2025
Fast Kernel-Space Diffusion for Remote Sensing Pansharpening

Hancong Jin, Zihan Cao, Liang-jian Deng et al.

Pansharpening seeks to fuse high-resolution panchromatic (PAN) and low-resolution multispectral (LRMS) images into a single image with both fine spatial and rich spectral detail. Despite progress in deep learning-based approaches, existing methods often fail to capture global priors inherent in remote sensing data distributions. Diffusion-based models have recently emerged as promising solutions due to their powerful distribution mapping capabilities, however, they suffer from heavy inference latency. We introduce KSDiff, a fast kernel-space diffusion framework that generates convolutional kernels enriched with global context to enhance pansharpening quality and accelerate inference. Specifically, KSDiff constructs these kernels through the integration of a low-rank core tensor generator and a unified factor generator, orchestrated by a structure-aware multi-head attention mechanism. We further introduce a two-stage training strategy tailored for pansharpening, facilitating integration into existing pansharpening architectures. Experiments show that KSDiff achieves superior performance compared to recent promising methods, and with over $500 \times$ faster inference than diffusion-based pansharpening baselines. Ablation studies, visualizations and further evaluations substantiate the effectiveness of our approach. Code will be released upon possible acceptance.

CVApr 14, 2025
CAT: A Conditional Adaptation Tailor for Efficient and Effective Instance-Specific Pansharpening on Real-World Data

Tianyu Xin, Jin-Liang Xiao, Zeyu Xia et al.

Pansharpening is a crucial remote sensing technique that fuses low-resolution multispectral (LRMS) images with high-resolution panchromatic (PAN) images to generate high-resolution multispectral (HRMS) imagery. Although deep learning techniques have significantly advanced pansharpening, many existing methods suffer from limited cross-sensor generalization and high computational overhead, restricting their real-time applications. To address these challenges, we propose an efficient framework that quickly adapts to a specific input instance, completing both training and inference in a short time. Our framework splits the input image into multiple patches, selects a subset for unsupervised CAT training, and then performs inference on all patches, stitching them into the final output. The CAT module, integrated between the feature extraction and channel transformation stages of a pre-trained network, tailors the fused features and fixes the parameters for efficient inference, generating improved results. Our approach offers two key advantages: (1) $\textit{Improved Generalization Ability}$: by mitigating cross-sensor degradation, our model--although pre-trained on a specific dataset--achieves superior performance on datasets captured by other sensors; (2) $\textit{Enhanced Computational Efficiency}$: the CAT-enhanced network can swiftly adapt to the test sample using the single LRMS-PAN pair input, without requiring extensive large-scale data retraining. Experiments on the real-world data from WorldView-3 and WorldView-2 datasets demonstrate that our method achieves state-of-the-art performance on cross-sensor real-world data, while achieving both training and inference of $512\times512$ image within $\textit{0.4 seconds}$ and $4000\times4000$ image within $\textit{3 seconds}$ at the fastest setting on a commonly used RTX 3090 GPU.

CVMar 30, 2022
SIT: A Bionic and Non-Linear Neuron for Spiking Neural Network

Cheng Jin, Rui-Jie Zhu, Xiao Wu et al.

Spiking Neural Networks (SNNs) have piqued researchers' interest because of their capacity to process temporal information and low power consumption. However, current state-of-the-art methods limited their biological plausibility and performance because their neurons are generally built on the simple Leaky-Integrate-and-Fire (LIF) model. Due to the high level of dynamic complexity, modern neuron models have seldom been implemented in SNN practice. In this study, we adopt the Phase Plane Analysis (PPA) technique, a technique often utilized in neurodynamics field, to integrate a recent neuron model, namely, the Izhikevich neuron. Based on the findings in the advancement of neuroscience, the Izhikevich neuron model can be biologically plausible while maintaining comparable computational cost with LIF neurons. By utilizing the adopted PPA, we have accomplished putting neurons built with the modified Izhikevich model into SNN practice, dubbed as the Standardized Izhikevich Tonic (SIT) neuron. For performance, we evaluate the suggested technique for image classification tasks in self-built LIF-and-SIT-consisted SNNs, named Hybrid Neural Network (HNN) on static MNIST, Fashion-MNIST, CIFAR-10 datasets and neuromorphic N-MNIST, CIFAR10-DVS, and DVS128 Gesture datasets. The experimental results indicate that the suggested method achieves comparable accuracy while exhibiting more biologically realistic behaviors on nearly all test datasets, demonstrating the efficiency of this novel strategy in bridging the gap between neurodynamics and SNN practice.

CVDec 4, 2021
A Triple-Double Convolutional Neural Network for Panchromatic Sharpening

Tian-Jing Zhang, Liang-Jian Deng, Ting-Zhu Huang et al.

Pansharpening refers to the fusion of a panchromatic image with a high spatial resolution and a multispectral image with a low spatial resolution, aiming to obtain a high spatial resolution multispectral image. In this paper, we propose a novel deep neural network architecture with level-domain based loss function for pansharpening by taking into account the following double-type structures, \emph{i.e.,} double-level, double-branch, and double-direction, called as triple-double network (TDNet). By using the structure of TDNet, the spatial details of the panchromatic image can be fully exploited and utilized to progressively inject into the low spatial resolution multispectral image, thus yielding the high spatial resolution output. The specific network design is motivated by the physical formula of the traditional multi-resolution analysis (MRA) methods. Hence, an effective MRA fusion module is also integrated into the TDNet. Besides, we adopt a few ResNet blocks and some multi-scale convolution kernels to deepen and widen the network to effectively enhance the feature extraction and the robustness of the proposed TDNet. Extensive experiments on reduced- and full-resolution datasets acquired by WorldView-3, QuickBird, and GaoFen-2 sensors demonstrate the superiority of the proposed TDNet compared with some recent state-of-the-art pansharpening approaches. An ablation study has also corroborated the effectiveness of the proposed approach.

CVSep 5, 2021
Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution

Jin-Fan Hu, Ting-Zhu Huang, Liang-Jian Deng

Hyperspectral image has become increasingly crucial due to its abundant spectral information. However, It has poor spatial resolution with the limitation of the current imaging mechanism. Nowadays, many convolutional neural networks have been proposed for the hyperspectral image super-resolution problem. However, convolutional neural network (CNN) based methods only consider the local information instead of the global one with the limited kernel size of receptive field in the convolution operation. In this paper, we design a network based on the transformer for fusing the low-resolution hyperspectral images and high-resolution multispectral images to obtain the high-resolution hyperspectral images. Thanks to the representing ability of the transformer, our approach is able to explore the intrinsic relationships of features globally. Furthermore, considering the LR-HSIs hold the main spectral structure, the network focuses on the spatial detail estimation releasing from the burden of reconstructing the whole data. It reduces the mapping space of the proposed network, which enhances the final performance. Various experiments and quality indexes show our approach's superiority compared with other state-of-the-art methods.

CVJul 24, 2021
LAConv: Local Adaptive Convolution for Image Fusion

Zi-Rong Jin, Liang-Jian Deng, Tai-Xiang Jiang et al.

The convolution operation is a powerful tool for feature extraction and plays a prominent role in the field of computer vision. However, when targeting the pixel-wise tasks like image fusion, it would not fully perceive the particularity of each pixel in the image if the uniform convolution kernel is used on different patches. In this paper, we propose a local adaptive convolution (LAConv), which is dynamically adjusted to different spatial locations. LAConv enables the network to pay attention to every specific local area in the learning process. Besides, the dynamic bias (DYB) is introduced to provide more possibilities for the depiction of features and make the network more flexible. We further design a residual structure network equipped with the proposed LAConv and DYB modules, and apply it to two image fusion tasks. Experiments for pansharpening and hyperspectral image super-resolution (HISR) demonstrate the superiority of our method over other state-of-the-art methods. It is worth mentioning that LAConv can also be competent for other super-resolution tasks with less computation effort.

IVMay 29, 2020
Hyperspectral Image Super-resolution via Deep Spatio-spectral Convolutional Neural Networks

Jin-Fan Hu, Ting-Zhu Huang, Liang-Jian Deng et al.

Hyperspectral images are of crucial importance in order to better understand features of different materials. To reach this goal, they leverage on a high number of spectral bands. However, this interesting characteristic is often paid by a reduced spatial resolution compared with traditional multispectral image systems. In order to alleviate this issue, in this work, we propose a simple and efficient architecture for deep convolutional neural networks to fuse a low-resolution hyperspectral image (LR-HSI) and a high-resolution multispectral image (HR-MSI), yielding a high-resolution hyperspectral image (HR-HSI). The network is designed to preserve both spatial and spectral information thanks to an architecture from two folds: one is to utilize the HR-HSI at a different scale to get an output with a satisfied spectral preservation; another one is to apply concepts of multi-resolution analysis to extract high-frequency information, aiming to output high quality spatial details. Finally, a plain mean squared error loss function is used to measure the performance during the training. Extensive experiments demonstrate that the proposed network architecture achieves best performance (both qualitatively and quantitatively) compared with recent state-of-the-art hyperspectral image super-resolution approaches. Moreover, other significant advantages can be pointed out by the use of the proposed approach, such as, a better network generalization ability, a limited computational burden, and a robustness with respect to the number of training samples.

CVAug 26, 2018
Rain Streak Removal for Single Image via Kernel Guided CNN

Ye-Tao Wang, Xi-Le Zhao, Tai-Xiang Jiang et al.

Rain streak removal is an important issue and has recently been investigated extensively. Existing methods, especially the newly emerged deep learning methods, could remove the rain streaks well in many cases. However the essential factor in the generative procedure of the rain streaks, i.e., the motion blur, which leads to the line pattern appearances, were neglected by the deep learning rain streaks approaches and this resulted in over-derain or under-derain results. In this paper, we propose a novel rain streak removal approach using a kernel guided convolutional neural network (KGCNN), achieving the state-of-the-art performance with simple network architectures. We first model the rain streak interference with its motion blur mechanism. Then, our framework starts with learning the motion blur kernel, which is determined by two factors including angle and length, by a plain neural network, denoted as parameter net, from a patch of the texture component. Then, after a dimensionality stretching operation, the learned motion blur kernel is stretched into a degradation map with the same spatial size as the rainy patch. The stretched degradation map together with the texture patch is subsequently input into a derain convolutional network, which is a typical ResNet architecture and trained to output the rain streaks with the guidance of the learned motion blur kernel. Experiments conducted on extensive synthetic and real data demonstrate the effectiveness of the proposed method, which preserves the texture and the contrast while removing the rain streaks.

NADec 15, 2017
Multi-dimensional imaging data recovery via minimizing the partial sum of tubal nuclear norm

Tai-Xiang Jiang, Ting-Zhu Huang, Xi-Le Zhao et al.

In this paper, we investigate tensor recovery problems within the tensor singular value decomposition (t-SVD) framework. We propose the partial sum of the tubal nuclear norm (PSTNN) of a tensor. The PSTNN is a surrogate of the tensor tubal multi-rank. We build two PSTNN-based minimization models for two typical tensor recovery problems, i.e., the tensor completion and the tensor principal component analysis. We give two algorithms based on the alternating direction method of multipliers (ADMM) to solve proposed PSTNN-based tensor recovery models. Experimental results on the synthetic data and real-world data reveal the superior of the proposed PSTNN.

CVMar 12, 2015
Single image super-resolution by approximated Heaviside functions

Liang-Jian Deng, Weihong Guo, Ting-Zhu Huang

Image super-resolution is a process to enhance image resolution. It is widely used in medical imaging, satellite imaging, target recognition, etc. In this paper, we conduct continuous modeling and assume that the unknown image intensity function is defined on a continuous domain and belongs to a space with a redundant basis. We propose a new iterative model for single image super-resolution based on an observation: an image is consisted of smooth components and non-smooth components, and we use two classes of approximated Heaviside functions (AHFs) to represent them respectively. Due to sparsity of the non-smooth components, a $L_{1}$ model is employed. In addition, we apply the proposed iterative model to image patches to reduce computation and storage. Comparisons with some existing competitive methods show the effectiveness of the proposed method.